Transcribe a sung vocal into lyrics with word-level timings (Whisper large-v3 on GPU). Use a vocal from a previous stems test or upload one. This is speech-to-text only — phoneme alignment is a later pipeline stage.
start_ms/end_ms. Best results on a clean vocal stem (use the stems service first). Singing ASR is imperfect — the editor allows correction downstream.