Transcribe · sync · render

Captioned video in minutes,
not Saturdays.

Drop in audio or video. CaptionFit transcribes, and renders a finished video with burned-in captions — 9x16 or 16x9, your font, your color.

Design as you wish
9x16 or 16x9 render
Download SRT or MP4
Aligning rooftop_demo.mp3
00:00 / 00:24
0:00 0:05 0:10 0:15 0:20
00:00 listening…
How it works

Three steps. That's the whole thing.

Transcribe audio. Sync captions. Render video. No timeline-scrubbing, no manually nudging timestamps, no weird XML.

— STEP 01 · TRANSCRIBE

Drop in audio or video

MP3, WAV, M4A, MP4 — up to 2 hours. Pick a language (or auto-detect) and a caption length, then hit Transcribe.

rooftop_demo.mp3
3.4 MB · 0:24
verse_takes_03.wav
28.1 MB · 4:12
aligning…
+ drop more files
— STEP 02 · SYNC

Paste script or lyrics (optional)

Got the words already? Paste them — one line per caption — to fix spelling and word grouping. We snap them to the audio.

LYRICS.TXT
When the night is bright
we run a little wild
city lights below↳ snap to 0:08
I'll meet you on the rooftop
underneath the neon
for one more song
— STEP 03 · RENDER

Render video or grab the SRT

Pick a font, size, color and aspect ratio (9x16 or 16x9). Hit Render Video — or just download the SRT.

1
00:00:01,200 → 00:00:03,800When the night is bright
2
00:00:04,100 → 00:00:06,500we run a little wild
3
00:00:08,400 → 00:00:11,100city lights below
What you get

Built for the last-mile stuff that always eats your night.

Fast

Faster than real-time

A 4-minute song aligns in about 20 seconds. We run on dedicated GPUs so your queue stays empty.

Captionfit
12s
Service B
1m 42s
Service C
3m 08s
By hand
~40m
Lyric-aware

It uses what you give it

Paste lyrics or a script and CaptionFit aligns to those exact words. No more "ahh-vuh-tahn-deal" mishears.

Audio-only
we run a little while
city light's below
underneath the neo
+ Lyrics
we run a little wild
city lights below
underneath the neon
Render-ready

Burn-in, your way

Pick a font, dial in size and color, choose 9x16 for Reels or 16x9 for YouTube. Hit Render Video.

Noto Sans ▾ A− A+
9x16
16x9 · Cover
Render queue: 12s Render Video →
In the editor

Preview, tweak, render. All in one tab.

captionfit / projects / rooftop_demo

Recent projects

rooftop_demojust now
verse_takes_032h ago
podcast_ep_42yesterday
livestream_clipMon
lecture_introApr 28
New transcription
rooftop_demo.mp3 · 6 segments
00:24 · english · lyric-aligned
00:01,200When the night is bright0.99
00:04,100we run a little wild0.97
00:08,400city lights below0.99
00:12,300I'll meet you on the rooftop0.95
00:16,000underneath the neon0.98
00:20,100for one more song0.99
Preview & edit

Tweak captions while the video plays.

Burn-in preview updates live. Edit a line, nudge a timestamp, split or merge — render when it feels right.

16x9 · 1080p
00:00 / 00:24
Noto Sans · 48 · White
Position · Bottom
Render Video
J back 1s K play/pause L ahead 1s split here
From the inbox

People who used to dread Sunday captioning.

I had a 3-minute song and the lyrics in a Notes file. CaptionFit gave me an SRT in 18 seconds and I uploaded the video before my coffee was cold.

MR
Mara Reyes
Independent songwriter

The lyric-paste feature is the unlock. Other tools mishear half my band's vocals — pasting the words means it just works.

DW
Devon Wu
Music video editor

Replaced an internal Python script we'd been duct-taping for a year. The keyboard shortcuts in the editor are chef's kiss.

PK
Priya Kothari
Podcast producer, Loopfield
Pricing

Start free. Scale when you're ready.

No credit card required. All plans include transcription, MP4 rendering, the captions editor, multilingual fonts, and AI tools.

Free
$0
20 tokens / month
  • ~10 min transcribed & rendered video
  • 4 AI cover images
  • Unlimited Script Fix & lyrics align

For trying it out — short clips and quick experiments.

Get started free
Expert
$20 /mo
500 tokens / month
  • ~250 min transcribed & rendered video
  • 100 AI cover images
  • Unlimited Script Fix & lyrics align
  • No watermark on renders

For power users and studios — full-length content with AI covers.

Start with Expert

Cancel anytime · Stripe-secured

All plans include transcription, MP4 rendering, the captions editor, multilingual fonts, AI tools and blurred-backdrop covers.

FAQ

Things people ask before signing up.

How accurate is the alignment?

When you paste lyrics or a script, alignment is typically within 80–150ms of the spoken word — good enough that you'll rarely need to nudge anything. Audio-only transcription depends on the recording, but you can always paste a correction and re-align.

Which formats can I upload?

MP3, WAV, M4A, FLAC, AAC, OGG, plus video formats (MP4, MOV, WebM, MKV). Up to 2 hours per file on paid plans, 10 minutes on Free.

What languages are supported?

Over 99 languages — including English, Spanish, French, German, Mandarin, Hindi, Arabic, Japanese, and many more. Leave the language field on Auto-detect and CaptionFit will identify it for you.

Do I need an account to get started?

No account needed to start. Drop a file and CaptionFit gets to work — your project is saved in your browser. Sign in any time to access your projects across devices.

Can I style the captions?

Yes — choose from 30+ fonts, set your own color, adjust size and vertical position. Everything is live-previewed before you commit to a render.

What is Karaoke mode?

Karaoke mode highlights each word as it's spoken, keeping viewers locked in. Choose Color style (the active word changes color) or Stroke style (the active word gets an outline). Works on both 9:16 and 16:9 renders.

What aspect ratios can I render?

9:16 for Reels, TikTok, and Shorts, and 16:9 for YouTube and the web. Toggle Cover to fit a horizontal source into a vertical canvas (or vice versa).

Can I download captions as something other than a video?

Yes — download a clean SRT any time, even before rendering. SRT files work directly in YouTube Studio, Adobe Premiere, DaVinci Resolve, and most other editing tools.

Does CaptionFit train on my audio?

No — CaptionFit does not train any models on your audio or transcripts. Your files are stored only while your project exists; delete a project (or your account) and its files are removed from our servers. To produce captions we send your audio to specialist AI providers (ElevenLabs and Anthropic) acting as our processing providers, solely to return your results.

What about long files or batch jobs?

On paid plans you can drop a folder of files at once and we'll align them in parallel. Long files (lectures, audiobooks) are chunked automatically — you still get one clean SRT at the end.

Ready when you are

Drop a track. Get a captioned video.

No card required, no setup call, no "book a demo." Free tier covers most one-off projects.