Benchmarking Salad Transcription APIs: Salad Transcription and Transcription Lite
We recently completed extensive accuracy benchmarks comparing our two transcription APIs – Salad Transcription API and Transcription Lite. Our goal was to measure and compare their accuracy across multiple languages using widely recognized, publicly available datasets and also compare their accuracy against existing transcription solutions. In this blog, we break down the methodology, workflow and results from our Transcription accuracy benchmark. For users interested in recreating the benchmark, we also provide publicly available scripts to recreate the benchmark and test the accuracy results.
Overview of our AI Transcription APIs
At Salad, we provide two AI transcription APIs offering different features and capabilities to the market.
Our two main APIs for Speech-to-text transcription are:
- Salad Transcription API: Delivers the No.1 transcription accuracy in market for the lowest cost ($0.16/hour). This API includes all the standard transcription features such as speaker identification, timestamps, captions, but also includes more comprehensive LLM driven features such as summarization, multilingual translations, and insights analytics such as sentiment analysis and classification.
- Transcription Lite: Offers quicker, lower-latency transcription with standard accuracy and includes essential features like timestamps, speaker diarization, and captions. Pricing starts from $0.03 per hour, again the lowest cost in the industry compared to APIs with comparable features and accuracy.
For more information about all the features, check out our documentation.
Transcription accuracy benchmarking methodology
Accuracy is often the most critical factor when evaluating transcription services, particularly for professional applications. To fairly assess our services, we adopted a benchmarking approach similar to what AssemblyAI used, utilizing publicly available datasets. We initially focused on English-language datasets already processed by Assembly AI to get direct comparisons to their results.
Datasets Used
We selected three datasets for our benchmarks:
- CommonVoice: An extensive, crowdsourced multilingual database of datasets provided by Mozilla. We used Common Voice Corpus 5.1 featuring over 1 million validated audio files in English which is over 1,500 hours of speech.
- Meanwhile Dataset: Consisting of 64 segments from “The Late Show with Stephen Colbert,” published as part of OpenAI’s Whisper release. Dataset Details
- TED-LIUM Dataset: A collection of English-language TED talk recordings. Dataset Details. Note: We excluded segments without audible speech to ensure accuracy.
Workflow
Our benchmarking process included:
Audio Preprocessing: Audio samples were uploaded to Salad S4 storage.
Transcription: Audio files were transcribed using both the Salad Transcription API and Transcription Lite.
Normalization: Both the predicted transcripts and the ground truth were normalized using the open-source Whisper Normalizer to ensure consistency by standardizing punctuation, capitalization, and formatting. Normalization ensures that minor formatting differences do not affect accuracy results.
Below are examples of how transcripts were adjusted:
Original:
- Truth: “everybody talks about happiness these days”
- Result: ” Everybody talks about happiness these days.”
After Normalization:
- Truth: “everybody talks about happiness these days”
- Result: “everybody talks about happiness these days”
Original:
- Truth: “i had somebody count the number of books with happiness in the title published in the last five years”
- Result: ” I had somebody count the number of books with happiness in the title published in the last five years.”
After Normalization:
- Truth: “i had somebody count the number of books with happiness in the title published in the last 5 years”
- Result: “i had somebody count the number of books with happiness in the title published in the last 5 years”
Accuracy Evaluation: We calculated Word Error Rate (WER) for each file, using the JiWER library, to objectively compare transcription accuracy across datasets. The average WER was then determined for each dataset. You can find all our benchmarking scripts here: https://github.com/SaladTechnologies/salad-transcription-accuracy-benchmarks
Here is an example script:
{
"truth": "The other man, dressed casually, watches the multicoloured radioactive clouds advance upon them.",
"result": " The other man, dressed casually, watches the multicolored radioactive clouds advance upon them.",
"wer": 0.0
},
{
"truth": "The Dutch outnumbered the Spanish army, but were caught off-guard by the Spanish attack.",
"result": " The Dutch outnumbered the Spanish army but were caught off guard by the Spanish attack.",
"wer": 0.0
},
{
"truth": "When Alvin was a little boy, he loved to watch Bud Spencer and Terence Hill.",
"result": " When Alvin was a little boy, he loved to watch Bud Spencer and Terrence Hill.",
"wer": 0.06666666666666667
},
{
"truth": "Capobianco wrote four novels jointly with William Barton.",
"result": " Capo Bianco wrote four novels jointly with William Barton.",
"wer": 0.25
},
{
"truth": "Denise hoovered the rug.",
"result": " Denise, who was the rug?",
"wer": 0.5
},Benchmark results: Word Error Rate (WER) for English
| Dataset | Salad Transcription API | Salad Transcription Lite API | AssemblyAI Universal | Amazon Transcribe | Google Latest-long | Microsoft Azure Batch v3.1 | Deepgram Nova 2 | OpenAI Whisper |
|---|---|---|---|---|---|---|---|---|
| Common Voice | 4.90% | 18.70% | 6.67% | 8.98% | 17.59% | 7.81% | 12.43% | 8.83% |
| Meanwhile | 4.30% | 16.70% | 4.77% | 7.27% | 11.67% | 6.73% | 5.56% | 9.75% |
| TED-LIUM | 4.20% | 8.20% | 7.21% | 9.12% | 11.69% | 9.27% | 8.98% | 7.30% |
Salad’s Transcription API is cost-effective and accurate. Meet with our transcription team today.
Our benchmarks show that the Salad Transcription API consistently delivers the best accuracy in the market compared to other transcription services in the market.
Expanding our benchmarks to more languages
After comparing our transcription APIs against all major competitors, we expanded our benchmarking efforts to include additional datasets and languages. Our goal is to measure performance across all languages and identify areas for further improvement.
The following table presents our latest benchmark results, showing accuracy and Word Error Rate (WER) for Salad Transcription API and Transcription Lite across multiple languages.
| Dataset | Sub-dataset | Language | Full API Accuracy | Lite Accuracy | Full API WER | Lite WER |
|---|---|---|---|---|---|---|
| TED-LIUM | tedlium | English | 95.8% | 91.8% | 4.2% | 8.2% |
| Meanwhile | Meanwhile | English | 95.7% | 83.3% | 4.3% | 16.7% |
| CommonVoice | cv-corpus-5.1-2020-06-22 | English | 95.1% | 81.3% | 4.9% | 18.7% |
| CommonVoice | cv-corpus-20.0-delta-2024-12-06 | English | 93.1% | 78.1% | 6.9% | 21.9% |
| CommonVoice | cv-corpus-8.0-2022-01-19 | Portugese | 92% | 55% | 8% | 45% |
| CommonVoice | cv-corpus-10.0-delta-2022-07-04 | French | 92% | 54.3% | 8% | 45.7% |
| CommonVoice | cv-corpus-12.0-delta-2022-12-07 | Spanish | 94% | 58.2% | 6% | 42.8% |
| CommonVoice | cv-corpus-14.0-delta-2023-06-23 | Spanish | 96.8% | 79.5% | 3.2% | 20.5% |
| CommonVoice | cv-corpus-16.1-delta-2023-12-06 | Spanish | 95.7% | 70.9% | 4.3% | 29.1% |
| CommonVoice | cv-corpus-13.0-delta-2023-03-09 | German | 96.3% | 71.1% | 3.7% | 28.9% |
| CommonVoice | cv-corpus-20.0-2024-12-06 | Hindi | 84% | 0% (translates to Eng) | 16% | 100% |
| CommonVoice | Italian | 93.3% | 54% | 6.7% | 46% | |
| CommonVoice | Russian | 96.4% | 60% | 3.6% | 40% | |
| CommonVoice | cv-corpus-17.0-2024-03-15 | Hebrew | 84.2% | 12% | 15.8% | 88% |
| CommonVoice | cv-corpus-19.0-2024-09-13 | Kazakh | 51% | 0% | 49% | 100% |
| CommonVoice | cv-corpus-9.0-2022-04-27 | Urdu | 78.8% | 8.3% | 21.2% | 91.7% |
Salad Transcription API performs exceptionally well in English and major European languages, achieving high accuracy in: English, Spanish, German, French, Portuguese, Italian, and Russian.
However, there is room for improvement in certain languages, particularly in: Thai, Kazakh, Hebrew, Hindi and Urdu. Transcription Lite currently performs well in English as the base language, as it’s optimized for speed.
Industry-Leading Pricing
While accuracy is a very important factor in choosing a transcription service, cost is just as important especially for large-scale applications. Salad’s Transcription APIs are not only among the most accurate but also the most affordable APIs compared to competitors.
Pricing Breakdown
- Salad Transcription API: Just $0.16 per audio hour
- Transcription Lite: Just $0.03 per audio hour
This makes Salad Transcription API the cheapest high-accuracy solution on the market, and Transcription Lite one of the most cost-effective, close to real-time transcription services available.
Key Takeaways from Our Benchmarks
Our benchmarking process, comparing Salad Transcription API and Salad Transcription Lite against major transcription providers and across multiple languages, has revealed several insights:
1. Leading accuracy in Transcription
- Salad Transcription API consistently outperformed other transcription providers, achieving the lowest Word Error Rate (WER) across several English datasets tested.
- In European languages such as Spanish, German, French, Portuguese, and Italian, our model also maintained accuracy levels above 90%.
2. Challenges in low-resource languages
- Some languages, particularly Hindi, Kazakh, Thai, and Hebrew, had lower accuracy, highlighting areas where further improvements are needed.
3. Transcription Lite accuracy vs speed
- While Transcription Lite offers near real-time transcription, its accuracy is lower compared to Salad Transcription API, particularly for non-English languages.
- It remains a great option for English language for users needing fast, timestamped speech-to-text processing at a lower cost.
Next Steps
Expanding our dataset coverage to include more datasets and languages.
Improving transcription for non-English languages, particularly in low-resource languages.
We will continue updating our benchmarks and improving our transcription models to provide the best value, accuracy, and performance in the market. Stay tuned for more updates!
Schedule a call with our expert transcription team.

SaladCloud is the world’s largest distributed cloud computing network with 11,000+ daily GPUs and 450,000 GPUs contributing compute, all at the lowest cost in the market.
