SaladCloud Blog

AI Batch Transcription Benchmark: Transcribing 1 Million+ Hours of Videos in just 7 days for $1800

AI batch transcription of 1 million hours of video

AI batch transcription benchmark: Speech-to-text at scale Building upon the inference benchmark of Parakeet TDT 1.1B for YouTube videos on SaladCloud and with our ongoing efforts to enhance the system architecture and implementation for batch jobs, we successfully transcribed over 66,000 hours of YouTube videos using a Salad container group consisting of 100 replicas running for 10 hours. Through this approach, we achieved a cost reduction of 1000-fold while maintaining the same level of accuracy as managed transcription services. In this deep dive, we will delve into what the system architecture, performance/throughput, time and cost would look like if we were to transcribe 1 Million hours of YouTube videos. Prior to the test, we created the dataset based on publicly available videos on YouTube. This dataset comprises over 4 Million video URLs sourced from more than 5000 YouTube channels, amounting to approximately 1.6 million hours of content. For detailed methods on collecting and processing data from YouTube on SaladCloud, as well as the reference design and example code, please refer to the guide.  System architecture for AI batch transcription pipeline The transcription pipeline comprises: Job Injection Strategy and Job Queue Settings The provided job filler supports multiple job injection strategies. It can inject millions of hours of video URLs to the job queue instantly and remains idle until the pipeline completes all tasks. However, a potential issue with this approach arises when certain nodes in the pipeline experience downtime and fail to process and remove jobs from the queue. Consequently, these jobs may reappear for other nodes to attempt processing, potentially causing earlier injected jobs to be processed last, which may not be suitable for certain use cases. For this test, we used a different approach: initially, we injected a large batch of jobs into the pipeline every day and monitored progress. When the queue neared emptiness, we started injecting only a few jobs, with the goal of keeping the number of available jobs in the queue as low as possible for a period of time. This strategy allows us to prioritize completing older jobs before injecting a massive influx of new ones. For time-sensitive tasks, we can also implement autoscaling. By continually monitoring the job count in the queue, the job filler dynamically adjusts the number of Salad node groups. This adaptive approach ensures that specific quantities of tasks can be completed within a predefined timeframe while also offering the flexibility to manage costs efficiently during periods of reduced demand. For the job queue system, we set the AWS SQS Visibility Timeout to 1 hour. This allows sufficient time for downloading, chunking, buffering, and processing by most of the nodes in SaladCloud until final results are merged and uploaded to Cloudflare. If a node fails to process and remove polled jobs within the hour, the jobs become available again for other nodes to process. Additionally, the AWS SQS Retention Period is set to 2 days. Once the message retention quota is reached, messages are automatically deleted. This measure prevents jobs from lingering in the queue for an extended period without being processed for any reason, thereby avoiding wastage of node resources. Enhanced node implementation The transcription for audio involves resource-intensive operations on both CPU and GPU, including format conversion, re-sampling, segmentation, transcription and merging. The more CPU operations involved, the lower the GPU utilization experienced. Within each node in the GPU resource pool on SaladCloud, we follow best practices by utilizing two processes:  The inference process concentrates on GPU operations and runs on a single thread. It begins by loading the model, warming up the GPU, and then listens on a TCP port by running a Python/FastAPI app on a Unicorn server. Upon receiving a request, it invokes the transcription inference and promptly returns the generated assets. The benchmark worker process primarily handles various I/O- and CPU-bound tasks, such as downloading/uploading, pre-processing, and post-processing. To maximize performance with better scalability, we adopt multiple threads to concurrently handle various tasks, with two queues created to facilitate information exchange among these threads. Thread Description Downloader In most cases, we require 3 threads to concurrently pull jobs from the job queue and download audio files from YouTube, and efficiently feed the inference server. It also performs the following pre-processing steps:1)Removal of bad audio files.2)Format conversion from Mp4A to MP3.3)Chunking very long audio into 10-minute clips.4)Metadata extraction (URL, file/clid ID, length). The pre-processed audio files are stored in a shared folder, and their metadata are added to the transcribing queue simultaneously. To prevent the download of excessive audio files, we enforce a maximum length limit on the transcribing queue. When the queue reaches its capacity, the downloader will sleep for a while. Caller It reads metadata from the transcribing queue, and subsequently sends a synchronous request, including the audio filename, to the inference server. Upon receiving the response, it forwards the generated texts along with statistics, and the transcribed audio filename to the reporting queue. The simplicity of the caller is crucial as it directly influences the inference performance. Reporter The reporter, upon reading the reporting queue, deletes the processed audio files from the shared folder and manages post-processing tasks, including merging results and calculating real-time factor and word count.  Eventually, it uploads the generated assets to Cloudflare, reports the job results to AWS DynamoDB and deletes the processed jobs from AWS SQS. By running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching and preparing the next audio clips concurrently and in advance while the current one is still being transcribed, we can eliminate any waiting period.  After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings. 1 Million hours of YouTube video batch transcription tests on SaladCloud We established a container group with 100 replicas, each equipped with 2vCPU, 12 GB RAM, and a GPU with 8GB or more VRAM on SaladCloud.

AI Transcription Benchmark: 1 Million Hours of Youtube Videos with Parakeet TDT 1.1B for Just $1260, a 1000-fold cost reduction 

AI transcription - Parakeet TRT 1.1B batch transription compared against APIs

Building upon the inference benchmark of Parakeet TDT 1.1B on SaladCloud and with our ongoing efforts to enhance the system architecture and implementation for batch jobs, we have achieved a 1000-fold cost reduction for AI transcription with Salad. This incredible cost-performance comes while maintaining the same level of accuracy as other managed transcription services.  YouTube is the world’s most widely used video-sharing platform, featuring a wealth of public content, including talks, news, courses, and more. There might be instances where you need to quickly understand  updates of a global event or summarize a topic, but you may not be able to watch videos individually. In addition, the millions of YouTube videos are a gold-mine of training data for many AI applications. Many companies have a need to do large-scale, AI transcription in batch today but cost is a prohibiting factor. In this deep dive, we will utilize publicly available YouTube videos as datasets and the high-speed ASR  (Automatic Speech Recognition) model – Parakeet TDT 1.1B, and explore methods for constructing a batch-processing system for large-scale AI transcription of videos, using the substantial computational power of SaladCloud’s massive network of consumer GPUs across a global, high-speed distributed network. How to download YouTube videos for batch AI transcription The Python library, pytube, is a lightweight tool designed for handling YouTube videos, that can simplify our tasks significantly. Firstly, pytube offers the APIs for interacting with YouTube playlists, which are collections of videos usually organized around specific themes. Using the APIs, we can retrieve all the video URLs within a specific playlist.  Secondly, prior to downloading a video, we can access its metadata, including details such as the title, video resolution, frames per second (fps), video codec, audio bit rate (abr), and audio codec, etc. If a video on YouTube supports an audio codec, we can enhance efficiency by exclusively downloading its audio. This approach not only reduces bandwidth requirements but also results in substantial time savings, given that the video size is typically ten times larger than its corresponding audio. Bellow is the code snippet for downloading from YouTube: The audio files downloaded from YouTube primarily utilize the MPEG-4 audio (Mp4a) file format, commonly employed for streaming large audio tracks. We can convert these audio files from Mp4A to MP3, a format universally accepted by all ASR models.  Additionally, the duration of audio files sourced from YouTube exhibits considerable variation, ranging from a few minutes to tens of hours. To leverage massive and cost-effective GPU types, as well as to optimize GPU resource utilization, it is essential to segment all lengthy audio into fixed-length clips before inputting them into the model. The results can then be aggregated before returning the final transcription. Advanced system architecture for massive video transcription We can reuse our existing system architecture for audio transcription with a few enhancements:  In a long-term running batch-job system, implementing auto scaling becomes crucial. By continuously monitoring the job count in the message queue, we can dynamically adjust the number of Salad nodes or groups. This adaptive approach allows us to respond effectively to variations in system load, providing the flexibility to efficiently manage costs during lower demand periods or enhance throughput during peak loads. Enhanced node implementation for both video and audio AI transcription Modifications have been made on the node implementation, enabling it to handle both video and audio for AI transcription. The inference process remains unchanged, running on a single thread and dedicated to GPU-based transcription. We have introduced additional features in the benchmark worker process, specifically designed to handle I/O and CPU-bound tasks and running multiple threads: Running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks provides the flexibility to update each component independently. Introducing multiple threads in the benchmark worker process to handle different tasks eliminates waiting periods by fetching and preparing the next audio clips in advance while the current one is still being transcribed. Consequently, as soon as one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time and increases system throughput but also results in more significant cost savings. Massive YouTube video transcription tests on SaladCloud We created a container group with 100 replicas (2vCPU and 12 GB RAM with 20+ different GPU types) in SaladCloud. The group was operational for approximately 10 hours, from 10:00 pm to 8:00 am PST during weekdays, successfully downloading and transcribing a total of 68,393 YouTube videos. The cumulative length of these videos amounted to 66,786 hours, with an average duration of 3,515 seconds. Hundreds of Salad nodes from worldwide networks actively engaged in the tasks. They are all positioned in the high-speed networks, near the edges of the YouTube Global CDN (with an average latency of 33ms). This setup guarantees local access and ensures optimal system throughput for downloading content from YouTube. According to the AWS DynamoDB metrics, specifically writes per second, which serve as a monitoring tool for transcription jobs, the system reached its maximum capacity, processing approximately 2 videos (totaling 7500 seconds) per second, roughly one hour after the container group was launched. The selected YouTube videos for this test vary widely in length, ranging from a few minutes to over 10 hours, causing notable fluctuations in the processing curve. Let’s compare the results of the two benchmark tests conducted on Parakeet TDT 1.1B for audio and video: Parakeet Audio Parakeet Video Datasets English CommonVoice and Spoken Wikipedia Corpus English YouTube videos include public talks, news and courses. Average Input Length (s) 12 3515 Cost on SaladCloud (GPU Resource Pool and Global Distribution Network) Around $100100 Replicas (2vCPU,12GB RAM,20+ GPU types) for 10 hours Around $100100 Replicas (2vCPU,12GB RAM,20+ GPU types) for 10 hours Cost on AWS and Cloudflare(Job Queue/Recording System and Cloud Storage ) Around $20 Around $2 Node Implementation 3 downloader threads;Segmentation of long audio; Merging texts. Download audio from YouTube playlists and videos;3 downloader threads;Segmentation of long audio;Format conversion from Mp4a to MP3;Merging texts. Number of Transcription 5,209,130 68,393 Total

Text-to-Speech (TTS) API Alternative: Self-Managed OpenVoice vs MetaVoice Comparison

Self-managed Openvoice vs Metavoice comparison: A Text to speech API alternative

A cost-effective alternative to Text-to-speech APIs In the realm of text-to-speech (TTS) technology, two open-source models have recently garnered everyone’s attention: OpenVoice and MetaVoice. Each model has unique capabilities in voice synthesis, but both were recently open sourced. We conducted benchmarks for both models on SaladCloud showing a world of efficiency and cost-effectiveness, highlighting the platform’s ability to democratize advanced voice synthesis technologies. The benchmarks focused on self-managed OpenVoice and MetaVoice as a far cheaper alternative to popular text to speech APIs. In this article, we will delve deeper into each of these models, exploring their distinctive features, capabilities, price, speed, quality and how they can be used in real-world applications. Our goal is to provide a comprehensive understanding of these technologies, enabling you to make informed decisions about which model best suits your voice synthesis requirements. If you are serving TTS inference at scale, utilizing a self-managed, open-source model framework on a distributed cloud like Salad is 50-90% cheaper compared to APIs. Efficiency and affordability on Salad’s distributed cloud Recently, we benchmarked OpenVoice and MetaVoice on SaladCloud’s global network of distributed GPUS. Tapping into thousands of latent consumer GPUs, Salad’s GPU prices start from $0.02/hour. With more than 1 Million PCs on the network, Salad’s distributed infrastructure provides the computational power needed to process large datasets swiftly, while its cost-efficient pricing model ensures that businesses can leverage these advanced technologies without breaking the bank. Running OpenVoice on Salad comes out to be 300 times cheaper than Azure Text to Speech service. Similarly, MetaVoice on Salad is 11X cheaper than AWS Polly Long Form. A common thread: Open Source Text-to-Speech innovation OpenVoice TTS, OpenVoice Cloning, and MetaVoice share a foundational principle: they are all open-source text-to-speech models. These models are not only free to use but also offer transparency in their development processes. Users can inspect the source code, contribute to improvements, and customize the models to fit their specific needs. With the source code, developers and researchers can customize and enhance these models to suit their specific needs, driving innovation in the TTS domain. A closer look at each model: OpenVoice and MetaVoice OpenVoice is an open-source, instant voice cloning technology that enables the creation of realistic and customizable speech from just a short audio clip of a reference speaker. Developed by MyShell.ai, OpenVoice stands out for its ability to replicate the voice’s tone color while offering extensive control over various speech attributes such as emotion and rhythm. OpenVoice voice replication process involvesseveral key steps that can be used both together or separately: OpenVoice Base TTS OpenVoice’s base Text-to-Speech (TTS) engine is a cornerstone of its framework, efficiently transforming written text into spoken words. This component is particularly valuable in scenarios where the primary goal is text-to-speech conversion without the need for specific voice toning or cloning. The ease with which this part of the model can be isolated and utilized independently makes it a versatile tool, ideal for applications that demand straightforward speech synthesis. OpenVoice Benchmark: 6 Million+ words per $ on Salad OpenVoice Cloning Building upon the base TTS engine, this feature adds a layer of sophistication by enabling the replication of a reference speaker’s unique vocal characteristics. This includes the extraction and embodiment of tone color, allowing for the creation of speech that not only sounds natural but also carries the emotional and rhythmic nuances of the original speaker. OpenVoice’s cloning capabilities extend to zero-shot cross-lingual voice cloning, a remarkable feature that allows for the generation of speech in languages not present in the training dataset. This opens up a world of possibilities for multilingual applications and global reach. MetaVoice-1B MetaVoice-1B is a robust 1.2 billion parameter base model, trained on an extensive dataset of 100,000 hours of speech. Its design is focused on achieving natural-sounding speech with an emphasis on emotional rhythm and tone in English. A standout feature of MetaVoice 1B is its zero-shot cloning capability for American and British voices, requiring just 30 seconds of reference audio for effective replication. The model also supports cross-lingual voice cloning with fine-tuning, showing promising results with as little as one minute of training data for Indian speakers. MetaVoice-1B is engineered to capture the nuances of emotional speech, ensuring that the synthesized output resonates with listeners on a deeper level. MetaVoice Benchmark: 23,300 words per $ on Salad Benchmark results: Price comparison of voice synthesis models on SaladCloud The following table presents the results of our benchmark tests, where we ran the models OpenVoice TTS, OpenVoice Cloning, and MetaVoice on SaladCloud GPUs. For consistency, we used the text from Isaac Asimov’s book “Robots and Empire”, available on Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine , comprising approximately 150,000 words, and processed it through all compatible Salad GPUs. Model Name Most Cost-EfficientGPU Words per Dollar Second Most CostEfficient GPU Words per Dollar OpenVoice TTS RTX 2070 6.6 Million GTX 1650 6.1 million OpenVoice Cloning GTX 1650 4.7 Million RTX 2070 4.02 million MetaVoice RTX 3080 23,300 RTX 3080 Ti 15,400 Table: Comparison of OpenVoice Text-to-Speech, OpenVoice Cloning and MetaVoice The benchmark results clearly indicate that OpenVoice, both in its TTS and Cloning variants, is significantly more cost-effective compared to MetaVoice. The OpenVoice TTS model, when run on an RTX 2070 GPU, achieves an impressive 6.6 Million words per dollar, making it the most efficient option among the tested models. The price of using RTX2070 on SaladCloud is $0.06/hour which together with vCPU and RAM we used got us to a total of $0.072/hour. OpenVoice Cloning also demonstrates strong cost efficiency, particularly when using the GTX 1650, which processes 4.7 Million words per dollar. This is a notable advantage for applications requiring less robotic voice. In contrast, MetaVoice’s performance on the RTX 3080 and RTX 3080 Ti GPUs yields significantly fewer words per dollar, indicating a higher cost for processing speech. However, don’t rush to dismiss MetaVoice just yet; upcoming comparisons may offer a different perspective that could sway your opinion.

MetaVoice AI Text-to-Speech (TTS) Benchmark: Narrate 100,000 words for only $4.29 on Salad

MetaVoice Text-to-Speech gpu benchmark on SaladCloud

Note: Do not miss out on listening to voice clones of 10 different celebrities reading Harry Potter and the Sorcerer’s Stone towards the end of the blog. Introduction to MetaVoice-1B MetaVoice-1B is an advanced text-to-speech (TTS) model boasting 1.2 billion parameters, meticulously trained on a vast corpus of 100,000 hours of speech. Engineered with a focus on producing emotionally resonant English speech rhythms and tones, MetaVoice-1B stands out for its accuracy and realistic voice synthesis. One standout feature of MetaVoice-1B is its ability to perform zero shot voice cloning. This feature requires only a 30-second audio snippet to accurately replicate American & British voices. It also includes cross-lingual cloning capabilities demonstrated with as little as one minute of training data for Indian accents. A versatile tool released under the permissive Apache 2.0 license, MetaVoice-1B is designed for long-form synthesis. The architecture of MetaVoice-1B MetaVoice-1B’s architecture is a testament to its innovative design. Combining causal GPT structures and non-causal transformers, it predicts a series of hierarchical EnCodec tokens from text and speaker information. This intricate process includes condition-free sampling, enhancing the model’s cloning proficiency. The text is processed through a custom-trained BPE tokenizer, optimizing the model’s linguistic capabilities without the need for predicting semantic tokens, a step often deemed necessary in similar technologies. MetaVoice cloning benchmark methodology on SaladCloud GPUs Encountered Limitations and Adaptations During the evaluation, we encountered limitations with the maximum length of text that MetaVoice could process in one go. The default token limit is set to 2048 tokens per batch. But we noticed that even with a smaller number of tokens, the model starts to act not as expected. To solve the limit issue, we had to preprocess our data by dividing the text into smaller segments, specifically two-sentence pieces, to accommodate the model’s capabilities. To break the text into sentences, we used Punkt Sentence Tokenizer. The text source remained consistent with previous benchmarks, utilizing Isaac Asimov’s “Robots and Empire,” available from Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine. For the voice cloning component, we utilized a one-minute sample of Benedict Cumberbatch’s narration. The synthesized output very closely mirrored the distinctive qualities of Cumberbatch’s narration, demonstrating MetaVoice’s cloning capabilities which are the best we’ve yet seen. Here is a voice cloning example featuring Benedict Cumberbatch: GPU Specifications and Selection MetaVoice documentation specifies the need for GPUs with VRAM of 12GB or more. Despite this, our trials included GPUs with lower VRAM, which still performed adequately. But this required a careful selection process from Salad’s GPU fleet to ensure compatibility. We standardized each node with 1 vCPU and 8GB of RAM to maintain a consistent testing environment. Benchmarking Workflow The benchmarking procedure was incorporating multi-threaded operations to enhance efficiency. The process involved parallel downloading of parts of text and the voice reference sample from Azure and processing text through MetaVoice model. Completing the cycle, the resulting audio was then uploaded back to Azure. This comprehensive workflow was designed to simulate a typical application scenario, providing a realistic assessment of MetaVoice’s operational performance on Salad Cloud GPUs. Benchmark Findings: Cost-Performance and Inference Speed Words per Dollar Efficiency Our benchmarking results reveal that the RTX 3080 GPU leads in terms of cost-efficiency for MetaVoice, achieving an impressive 23,300 words per dollar. The RTX 3080 Ti follows closely with 15,400 words per dollar. These figures highlight the resource-intensive nature of MetaVoice, requiring powerful GPUs to operate efficiently. Speed Analysis and GPU Requirements Our speed analysis revealed that GPUs with 10GB or more VRAM performed consistently, processing approximately 0.8 to 1.2 words per second. In contrast, GPUs with lower VRAM demonstrated significantly reduced performance, rendering them unsuitable for running MetaVoice. This aligns with the developers’ recommendation of using GPUs with at least 12GB VRAM to ensure optimal functionality. Cost Analysis for an Average Book To provide a practical perspective, let’s consider the cost of converting an average book into speech using MetaVoice on Salad Cloud GPUs. Assuming an average book contains approximately 100,000 words: Creating a narration of “Harry Potter and the Sorcerer’s Stone” by Benedict Cumberbatch would cost around $3.30 with an RTX 3080 and $5.00 with an RTX 3080 Ti. Here is an example of a voice clone of Benedict Cumberbatch reading Harry Potter: Notice that we did not change any model parameters, or added business logic. We only added batch processing sentence by sentence. We also cloned other celebrity voices to read out the first page of Harry Potter and the Sorcerer’s Stone. Here’s a collection of different voice clones reading Harry Potter using MetaVoice. MetaVoice GPU Benchmark on SaladCloud – Conclusion In conclusion, the combination of MetaVoice and SaladCloud GPUs presents a cost-effective and high-quality solution for text-to-speech and voice cloning projects. Whether for large-scale audiobook production or specialized projects like celebrity-narrated books, this technology offers a new level of accessibility and affordability in voice synthesis. As we move forward, it will be exciting to see how these advancements continue to shape the landscape of digital content creation.

Parakeet TDT 1.1B Inference Benchmark on SaladCloud: 1,000,000 hours of transcription for Just $1260

Blog_parakeet_speech_to_text_transcription_benchmark1

Parakeet TDT 1.1B GPU benchmark The Automatic Speech Recognition (ASR) model, Parakeet TDT 1.1B, is the latest addition to NVIDIA’s Parakeet family. Parakeet TDT 1.1B boasts unparalleled accuracy and significantly faster performance compared to other models in the same family. Using our latest batch-processing framework, we conducted comprehensive tests with Parakeet TDT 1.1B against extensive datasets, including English CommonVoice and Spoken Wikipedia Corpus English(Part1, Part2).  In this detailed GPU benchmark, we will delve into the design and construction of a high-throughput, reliable and cost-effective batch-processing system within SaladCloud. Additionally, we will conduct a comparative analysis of the inference performance between Parakeet TDT 1.1B and other popular ASR models like Whisper Large V3 and Distil-Whisper Large V2.  Advanced system architecture for batch jobs Our latest batch processing framework consists of: HTTP handlers using AWS Lambda or Azure Functions can be implemented for both the Job Queue System and the Job Recording System. This provides convenient access, eliminating the necessity of installing a specific Cloud Provider’s SDK/CLIs within the application container image. We aimed to keep the framework components fully managed and serverless to closely simulate the experience of using managed transcription services. A decoupled architecture provides the flexibility to choose the best and most cost-effective solution for each component from the industry. Enhanced Node Implementation for High Performance and Throughout We have refined the node implementation to further enhance the system performance and throughput. Within each node in the GPU resource pool in SaladCloud, we follow best practices by utilizing two processes: 1) Inference Process The transcription for audio involves resource-intensive operations on both CPU and GPU, including format conversion, re-sampling, segmentation, transcription and merging. The more CPU operations involved, the lower the GPU utilization experienced. While having the capacity to fully leverage the CPU, multiprocessing or multithreading-based concurrent inference over a single GPU might limit optimal GPU cache utilization and impact performance. This is attributed to each inference running at its own layer or stage. The multiprocessing approach also consumes more VRAM as every process requires a CUDA context and loads its own model into GPU VRAM for inference.  Following best practices, we delegate more CPU-bound pre-processing and post-processing tasks to the benchmark worker process. This allows the inference process to concentrate on GPU operations and run on a single thread. The process begins by loading the model, warming up the GPU, and then listens on a TCP port by running a Python/FastAPI app on a Unicorn server. Upon receiving a request, it invokes the transcription inference and promptly returns the generated assets.  Batch inference can be employed to enhance performance by effectively leveraging GPU cache and parallel processing capabilities. However, it requires more VRAM and might delay the return of the result until every single sample in the input is processed. The choice of using batch inference and determining the optimal batch size depends on model, dataset, hardware characteristics and use case. This also requires experimentation and ongoing performance monitoring. 2) Benchmark Worker Process The benchmark worker process primarily handles various I/O- and CPU-bound tasks, such as downloading/uploading, pre-processing, and post-processing. The Global Interpreter Lock (GIL) in Python permits only one thread to execute Python code at a time within a process. While the GIL can impact the performance of multithreaded applications, certain operations remain unaffected, such as I/O operations and calling external programs. To maximize performance with better scalability, we adopt multiple threads to concurrently handle various tasks, with several queues created to facilitate information exchange among these threads. Thread Description Downloader In most cases, we require 2 to 3 threads to concurrently pull jobs and download audio files, and efficiently feed the inference pipeline while preventing the download of excessive audio files. The actual number depends on the characteristics of the application and dataset, as well as network performance. It also performs the following pre-processing steps:1) Removal of bad audio files.2) Format conversion and re-sampling.3) Chunking very long audio into 15-minute clips.4) Metadata extraction (URL, file/clid ID, length). The pre-processed audio files and their corresponding metadata JSON files are stored in a shared folder. Simultaneously, the filenames of the JSON files are added to the transcribing queue. Caller It reads a JSON filename from the transcribing queue, retrieves the metadata by reading the corresponding file in the shared folder, and subsequently sends a synchronous request, including the audio filename, to the inference server. Upon receiving the response, it forwards the generated texts along with statistics to the reporting queue, while simultaneously sending the transcribed audio and JSON filenames to the cleaning queue.  The simplicity of the caller is crucial as it directly influences the inference performance. Reporter The reporter, upon reading the reporting queue, manages post-processing tasks, including merging results and format conversion. Eventually, it uploads the generated assets and reports the job results. Multiple threads may be required if the post-processing is resource-intensive. Cleaner After reading the cleaning queue, the cleaner deletes the processed audio files and their corresponding JSON files from the shared folder. By running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching and preparing the next audio clips concurrently and in advance while the current one is still being transcribed, eliminates any waiting period.  After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings. Single-Node Test using JupyterLab on SaladCloud Before deploying the application container image on a large scale in SaladCloud, we can build a specialized application image with JupyterLab and conduct the single-node test across various types of Salad nodes. With JupyterLab’s terminal, we can log into a container instance running on SaladCloud, gaining OS-level access. This enables us to conduct various tests and optimize the configurations and parameters of the model and application. These include: Analysis of single-node test using JupyterLab Based on our tests using JupyterLab, we found that the inference of Parakeet TDT 1.1B for audio files lasting

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110

Whisper large v3 - Automatic speech - recognition - gpu benchmark

Save over 99.8% on audio transcription using Whisper Large V3 and consumer GPUs A 99.8% cost-savings for automatic speech recognition sounds unreal. But with the right choice of GPUs and models, this is very much possible and highlights the needless overspending on managed transcription services today. In this deep dive, we will benchmark the latest Whisper Large V3 model from Open AI for inference against the extensive English CommonVoice and Spoken Wikipedia Corpus English (Part1, Part2) datasets, delving into how we accomplished an exceptional 99.8% cost reduction compared to other public cloud providers. Building upon the inference benchmark of Whisper Large V2 and with our continued effort to enhance the system architecture and implementation for batch jobs, we have achieved substantial reductions in both audio transcription costs and time while maintaining the same level of accuracy as the managed transcription services. Behind The Scenes: Advanced System Architecture for Batch Jobs Our batch processing framework comprises of the following: We aimed to keep the framework components fully managed and serverless to closely simulate the experience of using managed transcription services. A decoupled architecture provides the flexibility to choose the best and most cost-effective solution for each component from the industry. Within each node in the GPU resource pool in SaladCloud, two processes are utilized following best practices: one dedicated to GPU inference and another focused on I/O and CPU-bound tasks, such as downloading/uploading, preprocessing, and post-processing. 1) Inference Process The inference process operates on a single thread. It begins by loading the Whisper Large V3 model, warming up the GPU, and then listens on a TCP port by running a Python/FastAPI app in a Unicorn server. Upon receiving a request, it calls the transcription inference and returns the generated assets.  The chunking algorithm is configured for batch processing, where long audio files are segmented into 30-second clips, and these clips are simultaneously fed into the model. The batch inference significantly enhances performance by effectively leveraging the GPU cache and parallel processing capabilities.  2) Benchmark Worker Process The benchmark worker process primarily handles various I/O tasks, as well as pre- and post processing. Multiple threads are concurrently performing various tasks: one thread pulls jobs and downloads audio clips; another thread calls the inference, while the remaining threads manage tasks such as uploading generated assets, reporting job results and cleaning the environment, etc. Several queues are created to facilitate information exchange among these threads. Running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching the next audio clips earlier while the current one is still being transcribed, allows us to eliminate any waiting period. After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings. Deployment on SaladCloud We created a container group with 100 replicas (2 vCPU and 12 GB RAM with 20 different GPU types) in SaladCloud, and ran it for approximately 10 hours. In this period, we successfully transcribed over 2 million audio files, totalling nearly 8000 hours in length. The test incurred around $100 in SaladCloud costs and less than $10 on both AWS and Cloudflare. Results from the Whisper Large v3 benchmark Among the 20 GPU types, based on the current datasets, the RTX 3060 stands out as the most cost-effective GPU type for long audio files exceeding 30 seconds. Priced at $0.10 per hour on SaladCloud, it can transcribe nearly 200 hours of audio per dollar.  For short audio files lasting less than 30 seconds, several GPU types exhibit similar performance, transcribing approximately 47 hours of audio per dollar.  On the other hand, the RTX 4080 outperforms others as the best-performing GPU type for long audio files exceeding 30 seconds, boasting an average real-time factor of 40. This implies that the system can transcribe 40 seconds of audio per second. While for short audio files lasting less than 30 seconds, the best average real-time factor is approximately 8 by a couple of GPU types, indicating the ability to transcribe 8 seconds of audio in just 1 second. Analysis of the benchmark results Different from those obtained in local tests with several machines in a LAN, all these numbers are achieved in a global and distributed cloud environment that provides the transcription at a large scale, including the entire process from receiving requests to transcribing and sending the responses. There are various methods to optimize the results. Aiming for reduced costs, improved performance or even both, and different approaches may yield distinct outcomes. The Whisper models come in five configurations of varying model sizes: tiny, base, small, medium, and large(v1/v2/v3). The large versions are multilingual and offer better accuracy, but they demand more powerful GPUs and run relatively slowly. On the other hand, the smaller versions support only English with slightly lower accuracy, but it requires less powerful GPUs and runs very fast. Choosing more cost-effective GPU types in the resource pool will result in additional cost savings. If performance is the priority, selecting higher-performing GPU types is advisable, while still remaining significantly less expensive than managed transcription services.  Additionally, audio length plays a crucial role in both performance and cost, and it’s essential to optimize the resource configuration based on your specific use cases and business goals. Discover our open-source code for a deeper dive: Implementation of Inference and Benchmark Worker Docker Images Data Exploration Tool Performance Comparison across Different Clouds The results indicate that AI transcription companies are massively overpaying for cloud today. With the open-source automatic speech recognition model – Whisper Large V3, and the advanced batch-processing architecture leveraging hundreds of consumer GPUs on SaladCloud, we can deliver transcription services at a massive scale and at an exceptionally low cost, while maintaining the same level of accuracy as managed transcription services.  With the most cost-effective GPU type for Whisper Large V3 inference on SaladCloud, $1 dollar can transcribe 11,736 minutes of audio (nearly 200 hours), showcasing a 500-fold