Drop in audio or video. CaptionFit transcribes, and renders a finished video with burned-in captions — 9x16 or 16x9, your font, your color.
Transcribe audio. Sync captions. Render video. No timeline-scrubbing, no manually nudging timestamps, no weird XML.
MP3, WAV, M4A, MP4 — up to 2 hours. Pick a language (or auto-detect) and a caption length, then hit Transcribe.
Got the words already? Paste them — one line per caption — to fix spelling and word grouping. We snap them to the audio.
Pick a font, size, color and aspect ratio (9x16 or 16x9). Hit Render Video — or just download the SRT.
A 4-minute song aligns in about 20 seconds. We run on dedicated GPUs so your queue stays empty.
Paste lyrics or a script and CaptionFit aligns to those exact words. No more "ahh-vuh-tahn-deal" mishears.
Pick a font, dial in size and color, choose 9x16 for Reels or 16x9 for YouTube. Hit Render Video.
Burn-in preview updates live. Edit a line, nudge a timestamp, split or merge — render when it feels right.
I had a 3-minute song and the lyrics in a Notes file. CaptionFit gave me an SRT in 18 seconds and I uploaded the video before my coffee was cold.
The lyric-paste feature is the unlock. Other tools mishear half my band's vocals — pasting the words means it just works.
Replaced an internal Python script we'd been duct-taping for a year. The keyboard shortcuts in the editor are chef's kiss.
No credit card required. All plans include transcription, MP4 rendering, the captions editor, multilingual fonts, and AI tools.
For trying it out — short clips and quick experiments.
For regular creators — weekly clips, social posts, podcasts.
Cancel anytime · Stripe-secured
For power users and studios — full-length content with AI covers.
Cancel anytime · Stripe-secured
All plans include transcription, MP4 rendering, the captions editor, multilingual fonts, AI tools and blurred-backdrop covers.
When you paste lyrics or a script, alignment is typically within 80–150ms of the spoken word — good enough that you'll rarely need to nudge anything. Audio-only transcription depends on the recording, but you can always paste a correction and re-align.
MP3, WAV, M4A, FLAC, AAC, OGG, plus video formats (MP4, MOV, WebM, MKV). Up to 2 hours per file on paid plans, 10 minutes on Free.
Over 99 languages — including English, Spanish, French, German, Mandarin, Hindi, Arabic, Japanese, and many more. Leave the language field on Auto-detect and CaptionFit will identify it for you.
No account needed to start. Drop a file and CaptionFit gets to work — your project is saved in your browser. Sign in any time to access your projects across devices.
Yes — choose from 30+ fonts, set your own color, adjust size and vertical position. Everything is live-previewed before you commit to a render.
Karaoke mode highlights each word as it's spoken, keeping viewers locked in. Choose Color style (the active word changes color) or Stroke style (the active word gets an outline). Works on both 9:16 and 16:9 renders.
9:16 for Reels, TikTok, and Shorts, and 16:9 for YouTube and the web. Toggle Cover to fit a horizontal source into a vertical canvas (or vice versa).
Yes — download a clean SRT any time, even before rendering. SRT files work directly in YouTube Studio, Adobe Premiere, DaVinci Resolve, and most other editing tools.
No — CaptionFit does not train any models on your audio or transcripts. Your files are stored only while your project exists; delete a project (or your account) and its files are removed from our servers. To produce captions we send your audio to specialist AI providers (ElevenLabs and Anthropic) acting as our processing providers, solely to return your results.
On paid plans you can drop a folder of files at once and we'll align them in parallel. Long files (lectures, audiobooks) are chunked automatically — you still get one clean SRT at the end.
No card required, no setup call, no "book a demo." Free tier covers most one-off projects.