SaladCloud Blog

INSIDE SALAD

Salad Transcription API Accuracy Benchmark: How it outperforms Deepgram, Assembly AI & AWS

SaladCloud

Benchmarking Salad Transcription APIs: Salad Transcription and Transcription Lite

We recently completed extensive accuracy benchmarks comparing our two transcription APIs – Salad Transcription API and Transcription Lite. Our goal was to measure and compare their accuracy across multiple languages using widely recognized, publicly available datasets and also compare their accuracy against existing transcription solutions. In this blog, we break down the methodology, workflow and results from our Transcription accuracy benchmark. For users interested in recreating the benchmark, we also provide publicly available scripts to recreate the benchmark and test the accuracy results.

Overview of our AI Transcription APIs

At Salad, we provide two AI transcription APIs offering different features and capabilities to the market.

Our two main APIs for Speech-to-text transcription are:

  • Salad Transcription API: Delivers the No.1 transcription accuracy in market for the lowest cost ($0.16/hour). This API includes all the standard transcription features such as speaker identification, timestamps, captions, but also includes more comprehensive LLM driven features such as summarization, multilingual translations, and insights analytics such as sentiment analysis and classification.
  • Transcription Lite: Offers quicker, lower-latency transcription with standard accuracy and includes essential features like timestamps, speaker diarization, and captions. Pricing starts from $0.03 per hour, again the lowest cost in the industry compared to APIs with comparable features and accuracy.

For more information about all the features, check out our documentation.

Transcription accuracy benchmarking methodology

Accuracy is often the most critical factor when evaluating transcription services, particularly for professional applications. To fairly assess our services, we adopted a benchmarking approach similar to what AssemblyAI used, utilizing publicly available datasets. We initially focused on English-language datasets already processed by Assembly AI to get direct comparisons to their results.

Datasets Used

We selected three datasets for our benchmarks:

  • CommonVoice: An extensive, crowdsourced multilingual database of datasets provided by Mozilla. We used Common Voice Corpus 5.1 featuring over 1 million validated audio files in English which is over 1,500 hours of speech.
  • Meanwhile Dataset: Consisting of 64 segments from “The Late Show with Stephen Colbert,” published as part of OpenAI’s Whisper release. Dataset Details
  • TED-LIUM Dataset: A collection of English-language TED talk recordings. Dataset Details. Note: We excluded segments without audible speech to ensure accuracy.

Workflow

Our benchmarking process included:

Audio Preprocessing: Audio samples were uploaded to Salad S4 storage.

Transcription: Audio files were transcribed using both the Salad Transcription API and Transcription Lite.

Normalization: Both the predicted transcripts and the ground truth were normalized using the open-source Whisper Normalizer to ensure consistency by standardizing punctuation, capitalization, and formatting. Normalization ensures that minor formatting differences do not affect accuracy results.

Below are examples of how transcripts were adjusted:

Original:

  • Truth: “everybody talks about happiness these days”
  • Result: ” Everybody talks about happiness these days.”

After Normalization:

  • Truth: “everybody talks about happiness these days”
  • Result: “everybody talks about happiness these days”

Original:

  • Truth: “i had somebody count the number of books with happiness in the title published in the last five years”
  • Result: ” I had somebody count the number of books with happiness in the title published in the last five years.”

After Normalization:

  • Truth: “i had somebody count the number of books with happiness in the title published in the last 5 years”
  • Result: “i had somebody count the number of books with happiness in the title published in the last 5 years”

Accuracy Evaluation: We calculated Word Error Rate (WER) for each file, using the JiWER library, to objectively compare transcription accuracy across datasets. The average WER was then determined for each dataset. You can find all our benchmarking scripts here: https://github.com/SaladTechnologies/salad-transcription-accuracy-benchmarks

Here is an example script:

 {
        "truth": "The other man, dressed casually, watches the multicoloured radioactive clouds advance upon them.",
        "result": " The other man, dressed casually, watches the multicolored radioactive clouds advance upon them.",
        "wer": 0.0
    },
    {
        "truth": "The Dutch outnumbered the Spanish army, but were caught off-guard by the Spanish attack.",
        "result": " The Dutch outnumbered the Spanish army but were caught off guard by the Spanish attack.",
        "wer": 0.0
    },
    {
        "truth": "When Alvin was a little boy, he loved to watch Bud Spencer and Terence Hill.",
        "result": " When Alvin was a little boy, he loved to watch Bud Spencer and Terrence Hill.",
        "wer": 0.06666666666666667
    },
    {
        "truth": "Capobianco wrote four novels jointly with William Barton.",
        "result": " Capo Bianco wrote four novels jointly with William Barton.",
        "wer": 0.25
    },
    {
        "truth": "Denise hoovered the rug.",
        "result": " Denise, who was the rug?",
        "wer": 0.5
    },

Benchmark results: Word Error Rate (WER) for English

DatasetSalad Transcription APISalad Transcription Lite APIAssemblyAI UniversalAmazon TranscribeGoogle Latest-longMicrosoft Azure Batch v3.1Deepgram Nova 2OpenAI Whisper
Common
Voice
4.90%18.70%6.67%8.98%17.59%7.81%12.43%8.83%
Meanwhile4.30%16.70%4.77%7.27%11.67%6.73%5.56%9.75%
TED-LIUM4.20%8.20%7.21%9.12%11.69%9.27%8.98%7.30%

Salad’s Transcription API is cost-effective and accurate. Meet with our transcription team today.

Our benchmarks show that the Salad Transcription API consistently delivers the best accuracy in the market compared to other transcription services in the market.

Expanding our benchmarks to more languages

After comparing our transcription APIs against all major competitors, we expanded our benchmarking efforts to include additional datasets and languages. Our goal is to measure performance across all languages and identify areas for further improvement.

The following table presents our latest benchmark results, showing accuracy and Word Error Rate (WER) for Salad Transcription API and Transcription Lite across multiple languages.

DatasetSub-datasetLanguageFull API AccuracyLite AccuracyFull API WERLite WER
TED-LIUMtedliumEnglish95.8%91.8%4.2%8.2%
MeanwhileMeanwhileEnglish95.7%83.3%4.3%16.7%
CommonVoicecv-corpus-5.1-2020-06-22English95.1%81.3%4.9%18.7%
CommonVoicecv-corpus-20.0-delta-2024-12-06English93.1%78.1%6.9%21.9%
CommonVoicecv-corpus-8.0-2022-01-19Portugese92%55%8%45%
CommonVoicecv-corpus-10.0-delta-2022-07-04French92%54.3%8%45.7%
CommonVoice
cv-corpus-12.0-delta-2022-12-07
Spanish94%58.2%6%42.8%
CommonVoice
cv-corpus-14.0-delta-2023-06-23
Spanish
96.8%
79.5%3.2%20.5%
CommonVoice
cv-corpus-16.1-delta-2023-12-06
Spanish
95.7%
70.9%4.3%29.1%
CommonVoice
cv-corpus-13.0-delta-2023-03-09
German
96.3%
71.1%3.7%28.9%
CommonVoice
cv-corpus-20.0-2024-12-06
Hindi
84%
0% (translates to Eng)16%100%
CommonVoiceItalian

93.3%
54%6.7%46%
CommonVoiceRussian

96.4%
60%3.6%40%
CommonVoice
cv-corpus-17.0-2024-03-15
Hebrew84.2%12%15.8%88%
CommonVoice
cv-corpus-19.0-2024-09-13
Kazakh51%0%49%100%
CommonVoice
cv-corpus-9.0-2022-04-27
Urdu78.8%8.3%21.2%91.7%

Salad Transcription API performs exceptionally well in English and major European languages, achieving high accuracy in: English, Spanish, German, French, Portuguese, Italian, and Russian.

However, there is room for improvement in certain languages, particularly in: Thai, Kazakh, Hebrew, Hindi and Urdu. Transcription Lite currently performs well in English as the base language, as it’s optimized for speed.

Industry-Leading Pricing

While accuracy is a very important factor in choosing a transcription service, cost is just as important especially for large-scale applications. Salad’s Transcription APIs are not only among the most accurate but also the most affordable APIs compared to competitors.

Pricing Breakdown

  • Salad Transcription API: Just $0.16 per audio hour
  • Transcription Lite: Just $0.03 per audio hour

This makes Salad Transcription API the cheapest high-accuracy solution on the market, and Transcription Lite one of the most cost-effective, close to real-time transcription services available.

Key Takeaways from Our Benchmarks

Our benchmarking process, comparing Salad Transcription API and Salad Transcription Lite against major transcription providers and across multiple languages, has revealed several insights:

1. Leading accuracy in Transcription

  • Salad Transcription API consistently outperformed other transcription providers, achieving the lowest Word Error Rate (WER) across several English datasets tested.
  • In European languages such as Spanish, German, French, Portuguese, and Italian, our model also maintained accuracy levels above 90%.

2. Challenges in low-resource languages

  • Some languages, particularly Hindi, Kazakh, Thai, and Hebrew, had lower accuracy, highlighting areas where further improvements are needed.

3. Transcription Lite accuracy vs speed

  • While Transcription Lite offers near real-time transcription, its accuracy is lower compared to Salad Transcription API, particularly for non-English languages.
  • It remains a great option for English language for users needing fast, timestamped speech-to-text processing at a lower cost.

Next Steps

Expanding our dataset coverage to include more datasets and languages.

Improving transcription for non-English languages, particularly in low-resource languages.

We will continue updating our benchmarks and improving our transcription models to provide the best value, accuracy, and performance in the market. Stay tuned for more updates!

Schedule a call with our expert transcription team.

Have questions about enterprise pricing for SaladCloud?

Book a 15 min call with our team.

Related Blog Posts

Salad will become a Render Subnet, Salad and Render Partnership

RNP-023 Approved: Salad Is Joining the Render Network

It's official. RNP-023 has passed the community vote, and Salad will now become an exclusive subnet on the Render Network. A few weeks ago we shared our proposal to fully...
Read More

Use Cline with SaladCloud: Building Real Apps for Under $0.01

At SaladCloud, we've been working on easy-to-deploy recipes designed to cover most agentic use cases out of the box. When you run LLMs on Salad, you're not worried about token...
Read More

Salad Proposes Integration with the Render Network

I’m excited to share that Salad has submitted a formal proposal alongside the Render Network Foundation to become a subnet on the Render Network. This would involve fully transitioning our...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.