How accurate is Apple’s new transcription AI? We tested it against Whisper and Parakeet

3 days ago 3

As I pointed out recently, while Whisper is top of mind and still a pretty good transcription model, OpenAI has moved away from it. That said, the fact that Apple’s new transcription API is faster than Whisper is great news. But how accurate is it? We tested it out.

Full disclosure: the idea for this post came from developer Prakash Pax, who did his own tests. As he explains it:

I recorded 15 audio samples in English, randomly ranging from 15 seconds to 2 minutes. And tested against these 3 speech-to-text tools.

Apple’s New Transcription APIs
openAI Whisper Large v3 Turbo
Eleven Lab’s scribe v1

I won’t include his results here, otherwise you’d have no reason to head to his interesting post and check it out for yourself.

But he did add this caveat about his methodology. “I’m non-native English speaker. So the results might slightly vary for others,” and his tests got me curious regarding how Apple and OpenAI would pit against NVIDIA’s Parakeet, which is by far the fastest transcription model out there.

How I did it

Since I’m not a native English speaker either, I decided to use a recent 9to5Mac Daily episode, which was 7:31 long.

I used MacWhisper to run OpenAI’s Whisper Large V3 Turbo, and NVIDIA’s Parakeet v2. For Apple’s speech API, I used Finn Vorhees’ excellent Yap project. I ran them on my M2 Pro MacBook Pro with 16GB of RAM.

For the actual Character Error Rate (CER) and Word Error Rate (WER) analysis, since there are many ways to calculate character and word error rates, (eg:Do you normalize spacing? Do you ignore casing? Do you ignore punctuation?) I turned to these two Hugging Face Spaces: Metric: cer, and Metric: wer.

Both outline their methodology on their respective pages, so I won’t go into it here. What matters is that all models were evaluated using the same approach, which helps ensure that the baseline is consistent and the overall trends remain reliable, even if the exact numbers would differ under slightly different methods.

How they did

Model	Transcription time	Character Error Rate	Word Error Rate
Parakeet v2	2 seconds	5.8%	12.3%
Whisper Large V3 Turbo	40 seconds	0.2%	1.5%
Apple	9 seconds	1.9%	10.3%

Taking things a bit further, I decided to have ChatGPT, Claude, and Gemini also calculate CER and WER. Here are the results, along with the self-reported description of the methodology:

ChatGPT (o4-mini-high)

Reference (“baseline”) and each hypothesis (Apple, Parakeet, Whisper) were first normalized by:

Converting all text to lower case
Removing all punctuation (periods, commas, quotes, etc.)
Collapsing any runs of whitespace to a single space
This ensures we’re comparing only the words (or characters) themselves, not formatting or capitalization.

Model	Transcription time	Character Error Rate	Word Error Rate
Parakeet v2	2 seconds	6.0%	12.3%
Whisper Large V3 Turbo	40 seconds	0.4%	1.4%
Apple	9 seconds	2.1%	10.2%

Claude (Sonnet 4)

Text Normalization Protocol: Before computing error rates, all texts undergo standardized normalization to ensure fair comparison:

Case normalization: Convert to lowercase
Punctuation standardization:
- Normalize curly quotes (‘ ‘) to straight apostrophes (‘)
- Normalize curly quotes (” “) to straight quotes (“)
- Normalize em-dashes (–) to hyphens (-)
Whitespace normalization: Replace multiple whitespace characters with single spaces
Trimming: Remove leading and trailing whitespace

Note: I tried to use Claude 4 Opus, but got error messages due to server capacity limits on Anthropic’s end.

Model	Transcription time	Character Error Rate	Word Error Rate
Parakeet v2	2 seconds	8.4%	11.0%
Whisper Large V3 Turbo	40 seconds	0.1%	1.0%
Apple	9 seconds	3.5%	8.2%

Gemini (2.5 Pro)

All texts (both the baseline and the hypothesis files) were normalized to ensure a fair comparison. The normalization process included:

Converting all text to lowercase.
Removing all punctuation (e.g., ,, ., ?, ‘, “).
Replacing hyphens with spaces (e.g., “end-to-end” becomes “end to end”).

Model	Transcription time	Character Error Rate	Word Error Rate
Parakeet v2	2 seconds	7.6%	12.3%
Whisper Large V3 Turbo	40 seconds	0.3%	0.4%
Apple	9 seconds	3.4%	5.3%

So which is better?

Here’s nobody’s favorite answer: it depends.

Whisper is clearly the most accurate of the three, but if you’re working on a time-sensitive project and transcribing something rather longer, the processing time might be a problem.

Parakeet, on the other hand, is definitely your best bet when speed matters more than precision. Say you’ve got the recording of a two-hour lecture, and just need a quick way to find a specific segment. In that case, giving up some precision for speed might be the way to go.

Apple’s model lands in the middle of the road, but not in a bad way. It’s closer to Parakeet in terms of speed, yet already manages to outperform it on accuracy. That’s pretty good for a first crack at it.

Granted, it’s still a far cry from Whisper, especially for high-stakes transcription work that requires minimal or no manual adjustments. But the fact that it runs natively, with no reliance on third-party APIs or external installs, is a big deal, especially as developer adoption ramps up and Apple continues to iterate.