SaladCloud Blog


Inference Benchmark on Salad: Distil-Whisper Large V2 vs. Whisper Large V3 for Speech-to-text

Salad Technologies

Hugging Face Distil-Whisper Large V2 is a distilled version of the OpenAI Whisper model that is 6 times faster, 49% smaller and performs within 1%  WER (word error rates) on out-of-distribution evaluation sets. However, it is currently only available for English speech recognition.

Building upon the insights from the previous inference benchmark of Whisper Large V3, we have conducted tests with Distill-Whisper Large V2 by using the same batch-processing infrastructure, against the same English CommonVoice and Spoken Wikipedia Corpus English (Part1, Part2) datasets. Let’s explore the distinctions between these two automatic speech recognition models in terms of cost and performance, and see which one better fits your needs.

Batch-Processing Framework

We used the same batch processing framework to test the two models, consisting of:

  • GPU Resource Pool: Hundreds of Salad nodes equipped with dedicated GPUs. Within each node, two processes are utilized following best practices: one dedicated on GPU inference to transcribe audio file, and another focused on I/O- and CPU-bound tasks, such as pulling jobs and downloading audio files, uploading generated assets and reporting task results.
  • Cloud Storage: Audio files and generated assets stored in Cloudflare R2, which is AWS S3-compatible and incurs zero egress fees.
  • Job Queue System: The Salad nodes retrieve jobs via AWS SQS, providing unique identifiers and accessible URLs for audio clips in Cloudflare R2. Direct data access without a job queue is also possible based on specific business logic. 
  • Job Recording System: Job results, including processing time, input audio URLs, output text URLs, etc., are stored in DynamoDB.

Advanced Batch-Processing Architecture for Massive Transcription

We aimed to adopt a decoupled architecture that provides the flexibility to choose the best and most cost-effective solution for each component from the industry. Simultaneously, we strived to keep the framework components fully managed and serverless to closely simulate the experience of using managed transcription services. 

Furthermore, two processes are employed in each node to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching the next audio clips earlier while the current one is still being transcribed, allows us to eliminate any waiting period.  After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings.

Advanced batch processing architecture for massive transcription

We also offer a data exploration tool – Jupyter notebook, designed to assist in preparing and analyzing inference benchmarks for various recognition models running on SaladCloud.

Discover our open-source code for a deeper dive:

Implementation of Inference and Benchmark Worker

Docker Images

Data Exploration Tool

Massive Transcription Tests for Whisper Large on SaladCloud

We launched two container groups on Salad, with each one dedicated to one of the two models. 

Each group was configured with 100 replicas (2vCPU and 12 GB RAM with all GPU types with 8GB or more VRAM) in SaladCloud, and ran for approximately 10 hours. Here is the comparison:

10-hour Transcription with 100 Replicas in Salad CloudWhisper Large V3Distil-Whisper Large V2
Number of Transcribed Audio Files 2,364,838 3,846,559
Total Audio Length (s)28,554,156  (8000 hours)47,207,294 (13113 hours)
Average Audio Length (s)12 12
Cost on SaladCloud
(GPU resource pool)
Around $100
100 Replicas (2vCPU,12GB RAM),
20 GPU types actually used
Around $100
100 Replicas (2vCPU,12GB RAM),
22 GPU types actually used
Cost on AWS and Cloudflare
(Job queue/Recording system & cloud storage)
Less than $10Around $15
Most Cost-Effective GPU Type for transcribing long audio files exceeding 30 secondsRTX 3060 
196 hours per dollar
RTX 2080 Ti
500 hours per dollar 
Most Cost-Effective GPU Type for transcribing short audio files lasting less than 30 secondsRTX 2080/3060/3060Ti/3070Ti
47 hours per dollar
RTX 2080/3060 Ti
90 hours per dollar
Best-Performing GPU Type for transcribing long audio files exceeding 30 secondsRTX 4080
Average real-time factor: 40,
transcribing 40 seconds of audio per second
RTX 4090
Average real-time factor: 93,
transcribing 93 seconds of audio per second
Best-Performing GPU Type for transcribing short audio files lasting less than 30 secondsRTX 3080Ti/4070Ti/4090
Average real-time factor: 8,
transcribing 8 seconds of audio per second
RTX 4090
Average real-time factor: 14,
transcribing 14 seconds of audio per second
System Throughput Transcribing 793 seconds of audio per secondTranscribing 1311 seconds of audio per second

Performance evaluation of the two Whisper Large models

Different from those obtained in local tests with a few machines in a LAN, all these numbers are achieved in a global and distributed cloud environment that provides the transcription at a large scale, including the entire process from receiving requests to transcribing and sending the responses.

Evidently, Distill-Whisper Large V2 outperforms Whisper Large V3 in both cost and performance. With a model size half that of Whisper, Distill-Whisper can run faster and allow for the utilization of a broader range of GPU types. This significant reduction in cost and improvement in performance are noteworthy outcomes. If the requirement is mainly for English speech recognition, Distill-Whisper is recommended.

While Distil-Whisper boasts a sixfold increase in speed compared to Whisper, the real-world tests indicate that its system throughput is only 165% of that of Whisper. Similar to a car, its top speed depends on factors such as the engine, gears, chassis, and tires. Simply upgrading the engine to one that is 200% more powerful doesn’t assure a doubling of the maximum speed. 

In a global and distributed cloud environment offering services at a large scale, various factors can impact overall performance, including distance, transmission and processing delays, model size, and performance, among others. Hence, a comprehensive approach is essential for system design and implementation, encompassing use cases, business goals, resource configuration, and algorithms, among other considerations.

Performance Comparison across Different Clouds

Thanks to the open-source speech recognition model, Whisper Large V3, and the advanced batch-processing architecture harnessing hundreds of consumer GPUs on SaladCloud, we have already achieved a remarkable 500-fold cost reduction while retaining the same level of accuracy as other public cloud providers: $1 dollar can transcribe 11,736 minutes of audio (nearly 200 hours) with the most cost-effective GPU type.

Distil-Whisper Large V2 propels us even further, delivering an incredible 1000-fold cost reduction for English speech recognition, compared to managed transcription services. This equates to $1 transcribing 29,994 minutes of audio, or nearly 500 hours. We believe these will fundamentally transform the speech recognition industry.

Minutes per dollar transcribed for Distil-Whisper 2: An inference comparison across SaladCloud and Speech-to-text APIs

SaladCloud: The Most Affordable GPU Cloud for Massive Audio Transcription

For voice AI companies in pursuit of cost-effective and robust GPU solutions at scale, SaladCloud emerges as a game-changing solution. Boasting the market’s most competitive GPU prices, it tackles the issues of soaring cloud expenses and constrained GPU availability.

In an era where cost-efficiency and performance take precedence, choosing the right tools and architecture can be transformative. Our recent Inference Benchmark of Whisper Large V3 and Distil-Whisper Large V2 exemplifies the savings and efficiency that can be achieved through innovative approaches. We encourage developers and startups to explore our open-source resources and witness the potential firsthand.

Have questions about SaladCloud for your workload?

Book a 15 min call with our team. Get $50 in testing credits.

Related Blog Posts

AI transcription - Parakeet TRT 1.1B batch transription compared against APIs

AI Transcription Benchmark: 1 Million Hours of Youtube Videos with Parakeet TDT 1.1B for Just $1260, a 1000-fold cost reduction 

Building upon the inference benchmark of Parakeet TDT 1.1B on SaladCloud and with our ongoing efforts to enhance the system architecture and implementation for batch jobs, we have achieved a 1000-fold...
Read More
Self-managed Openvoice vs Metavoice comparison: A Text to speech API alternative

Text-to-Speech (TTS) API Alternative: Self-Managed OpenVoice vs MetaVoice Comparison

A cost-effective alternative to Text-to-speech APIs In the realm of text-to-speech (TTS) technology, two open-source models have recently garnered everyone's attention: OpenVoice and MetaVoice. Each model has unique capabilities in...
Read More

Cost-effective Stable Diffusion fine tuning on Salad

Stable Diffusion XL (SDXL) fine tuning as a service I recently wrote a blog about fine tuning Stable Diffusion XL (SDXL) on interruptible GPUs at low cost, starring my dog...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.