SaladCloud Blog

Text-to-Speech (TTS) API Alternative: Self-Managed OpenVoice vs MetaVoice Comparison

Self-managed Openvoice vs Metavoice comparison: A Text to speech API alternative

A cost-effective alternative to Text-to-speech APIs In the realm of text-to-speech (TTS) technology, two open-source models have recently garnered everyone’s attention: OpenVoice and MetaVoice. Each model has unique capabilities in voice synthesis, but both were recently open sourced. We conducted benchmarks for both models on SaladCloud showing a world of efficiency and cost-effectiveness, highlighting the platform’s ability to democratize advanced voice synthesis technologies. The benchmarks focused on self-managed OpenVoice and MetaVoice as a far cheaper alternative to popular text to speech APIs. In this article, we will delve deeper into each of these models, exploring their distinctive features, capabilities, price, speed, quality and how they can be used in real-world applications. Our goal is to provide a comprehensive understanding of these technologies, enabling you to make informed decisions about which model best suits your voice synthesis requirements. If you are serving TTS inference at scale, utilizing a self-managed, open-source model framework on a distributed cloud like Salad is 50-90% cheaper compared to APIs. Efficiency and affordability on Salad’s distributed cloud Recently, we benchmarked OpenVoice and MetaVoice on SaladCloud’s global network of distributed GPUS. Tapping into thousands of latent consumer GPUs, Salad’s GPU prices start from $0.02/hour. With more than 1 Million PCs on the network, Salad’s distributed infrastructure provides the computational power needed to process large datasets swiftly, while its cost-efficient pricing model ensures that businesses can leverage these advanced technologies without breaking the bank. Running OpenVoice on Salad comes out to be 300 times cheaper than Azure Text to Speech service. Similarly, MetaVoice on Salad is 11X cheaper than AWS Polly Long Form. A common thread: Open Source Text-to-Speech innovation OpenVoice TTS, OpenVoice Cloning, and MetaVoice share a foundational principle: they are all open-source text-to-speech models. These models are not only free to use but also offer transparency in their development processes. Users can inspect the source code, contribute to improvements, and customize the models to fit their specific needs. With the source code, developers and researchers can customize and enhance these models to suit their specific needs, driving innovation in the TTS domain. A closer look at each model: OpenVoice and MetaVoice OpenVoice is an open-source, instant voice cloning technology that enables the creation of realistic and customizable speech from just a short audio clip of a reference speaker. Developed by MyShell.ai, OpenVoice stands out for its ability to replicate the voice’s tone color while offering extensive control over various speech attributes such as emotion and rhythm. OpenVoice voice replication process involvesseveral key steps that can be used both together or separately: OpenVoice Base TTS OpenVoice’s base Text-to-Speech (TTS) engine is a cornerstone of its framework, efficiently transforming written text into spoken words. This component is particularly valuable in scenarios where the primary goal is text-to-speech conversion without the need for specific voice toning or cloning. The ease with which this part of the model can be isolated and utilized independently makes it a versatile tool, ideal for applications that demand straightforward speech synthesis. OpenVoice Benchmark: 6 Million+ words per $ on Salad OpenVoice Cloning Building upon the base TTS engine, this feature adds a layer of sophistication by enabling the replication of a reference speaker’s unique vocal characteristics. This includes the extraction and embodiment of tone color, allowing for the creation of speech that not only sounds natural but also carries the emotional and rhythmic nuances of the original speaker. OpenVoice’s cloning capabilities extend to zero-shot cross-lingual voice cloning, a remarkable feature that allows for the generation of speech in languages not present in the training dataset. This opens up a world of possibilities for multilingual applications and global reach. MetaVoice-1B MetaVoice-1B is a robust 1.2 billion parameter base model, trained on an extensive dataset of 100,000 hours of speech. Its design is focused on achieving natural-sounding speech with an emphasis on emotional rhythm and tone in English. A standout feature of MetaVoice 1B is its zero-shot cloning capability for American and British voices, requiring just 30 seconds of reference audio for effective replication. The model also supports cross-lingual voice cloning with fine-tuning, showing promising results with as little as one minute of training data for Indian speakers. MetaVoice-1B is engineered to capture the nuances of emotional speech, ensuring that the synthesized output resonates with listeners on a deeper level. MetaVoice Benchmark: 23,300 words per $ on Salad Benchmark results: Price comparison of voice synthesis models on SaladCloud The following table presents the results of our benchmark tests, where we ran the models OpenVoice TTS, OpenVoice Cloning, and MetaVoice on SaladCloud GPUs. For consistency, we used the text from Isaac Asimov’s book “Robots and Empire”, available on Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine , comprising approximately 150,000 words, and processed it through all compatible Salad GPUs. Model Name Most Cost-EfficientGPU Words per Dollar Second Most CostEfficient GPU Words per Dollar OpenVoice TTS RTX 2070 6.6 Million GTX 1650 6.1 million OpenVoice Cloning GTX 1650 4.7 Million RTX 2070 4.02 million MetaVoice RTX 3080 23,300 RTX 3080 Ti 15,400 Table: Comparison of OpenVoice Text-to-Speech, OpenVoice Cloning and MetaVoice The benchmark results clearly indicate that OpenVoice, both in its TTS and Cloning variants, is significantly more cost-effective compared to MetaVoice. The OpenVoice TTS model, when run on an RTX 2070 GPU, achieves an impressive 6.6 Million words per dollar, making it the most efficient option among the tested models. The price of using RTX2070 on SaladCloud is $0.06/hour which together with vCPU and RAM we used got us to a total of $0.072/hour. OpenVoice Cloning also demonstrates strong cost efficiency, particularly when using the GTX 1650, which processes 4.7 Million words per dollar. This is a notable advantage for applications requiring less robotic voice. In contrast, MetaVoice’s performance on the RTX 3080 and RTX 3080 Ti GPUs yields significantly fewer words per dollar, indicating a higher cost for processing speech. However, don’t rush to dismiss MetaVoice just yet; upcoming comparisons may offer a different perspective that could sway your opinion.