Upload an audio clip and its transcript to get word-level timestamps. Powered by Qwen/Qwen3-ForcedAligner-0.6B-hf.