SaladCloud Blog

benchmark

Stable diffusion 1.5 benchmark on SaladCloud

Stable diffusion 1.5 benchmark: 14,000+ images per dollar on SaladCloud

Stable diffusion 1.5 benchmark on consumer GPUs Since our last stable diffusion benchmark nearly a year ago, a lot has changed. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. In this Stable Diffusion 1.5 benchmark, we evaluate the performance of SD 1.5 on 3 consumer GPUs on SaladCloud: RTX 4090, RTX 3090 & RTX 3060 Ti. Stable diffusion 1.5 benchmark design We deployed the “Dreamshaper 8 – ComfyUI” recipe on SaladCloud, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 25 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3060 Ti (8gb vram), RTX 3090 (24gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the Dreamshaper 8 recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results from the stable diffusion 1.5 benchmark While the RTX 4090 unsurprisingly had the best raw performance, the RTX 3090 came in very close, at better cost-performance. The default configuration with the RTX 3060 Ti showed remarkably good response times, and the best cost-performance. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. RTX 4090 (24gb vram) RTX 3090 (24gb vram) RTX 3060 Ti (8gb vram) [default] Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Stable diffusion 1.5 benchmark: 14,000+ images per dollar on SaladCloud Read More »

Stable diffusion XL (SDXL) GPU benchmark on SaladCloud

Stable Diffusion XL (SDXL) benchmark: 3405 images per dollar on SaladCloud

Stable Diffusion XL (SDXL) benchmark on 3 RTX GPUs Since our last SDXL benchmark nearly a year ago, a lot has changed. Community adoption of SDXL has increased significantly, and along with that comes better tooling, performance increases, and better understanding of how to get good results from the model. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has also introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. With ComfyUI and a new GPU pricing, we benchmark SDXL on 3 different consumer GPUs – RTX 4090, RTX 4080 and RTX 3090. SDXL Benchmark Design We deployed the “SDXL with Refiner – ComfyUI” recipe on Salad, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 18 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3090 (24gb vram), RTX 4080 (16gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the SDXL with Refiner recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results from the SDXL benchmark The RTX 4090 achieved the best performance, both in terms of inference time and cost-per-image, returning images in as little as 6.2s / Image, and at a cost as low as 3405 images / $. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. RTX 4090 (24gb vram) RTX 4080 (16gb vram) RTX 3090 (24gb vram) Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Stable Diffusion XL (SDXL) benchmark: 3405 images per dollar on SaladCloud Read More »

Molecular Simulation GROMACS Benchmark on SaladCloud

Molecular Simulation: GROMACS Benchmark on 30 GPUs on SaladCloud, 90+% Cost Savings

Benchmarking GROMACS for Molecular Simulation on consumer GPUs In this deep dive, we will benchmark GROMACS on SaladCloud, analyzing simulation speed and cost-effectiveness across a spectrum of molecular systems—small, medium, and large. Additionally, we will provide recommendations for selecting the most appropriate resource types for various workloads on SaladCloud. Building on the OpenMM benchmark on SaladCloud and our continuous efforts to optimize system architecture and batch job implementation, we have achieved a 90% cost savings by using consumer GPUs for molecular simulations with GROMACS, compared to CPUs and data center GPUs.This capability enables effective static and dynamic load balancing across the system’s various components. GROMACS is a highly optimized, open-source software package for molecular dynamics simulations. Researchers in fields like biochemistry, biophysics, and materials science widely use it to study the physical movements of atoms and molecules over time. GROMACS stands out for its exceptional performance compared to other programs, efficiently leveraging both CPU and GPU resources. This capability enables effective static and dynamic load balancing across the system’s various components. GROMACS benchmark methodology The gmx mdrun is the main computational chemistry engine within GROMACS. The following command is to perform molecular dynamics simulations in the target environment: The mdrun program reads the input TPR file (-s), which contains the initial molecular topology and parameters, and produces several output files (-deffnm) with different extension names for logs, trajectories, structures and energies. GROMACS relies on close collaboration between the CPU and GPU to achieve optimal performance. Although many calculations can be offloaded to the GPU by using the options (-nb, -pme, -bonded, -update), the program still demands considerable CPU processing power and multiple threads for task management, communication, and I/O operations. To fully utilize a powerful GPU, GROMACS also depends on robust CPU performance. While running more OpenMP threads than the number of physical cores could be beneficial in certain situations for GROMACS, but for our benchmark test, we only selected Salad nodes with CPUs that have 8 or more cores and configured each node to run 8 OpenMP threads (-ntmpi, -ntomp). We used GROMACS 2024.1 with CUDA 11.8 to build the container image. When running on SaladCloud, it first runs the simulations against typical molecular systems, reports the test data to an AWS DynamoDB table, and then exits. Finally, the data is downloaded and analyzed using Pandas on JupyterLab. Two key performance indicators are collected and analyzed during the test: ns/day stands for nanoseconds per day. It measures simulation speed, indicating how many nanoseconds of simulated time can be computed in one day of real time. ns/dollar stands for nanoseconds per dollar. It is a measure of cost-effectiveness, showing how many nanoseconds of simulated time can be computed for one dollar. Below are the two scenarios and the methods used to collect data and calculate the final results: Scenario Resource Simulation Speed (ns/day) Cost Effectiveness (ns/dollar) ConsumerGPUs 8 cores for 8 OpenMP threads 30 GPU types Create a container group with 100 instances with all GPU types on SaladCloud, and run it for a few hours. Once the code execution is finished on an instance, SaladCloud will allocate a new node and continuously run the instance.   Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type. Calculate the average performance for each GPU type. Pricing from the SaladCloud Price Calculator: $0.072/hour for 16 vCPUs, 8GB RAM$0.015 ~ $0.18/hour for different GPU types (Priority: Batch ) https://salad.com/pricing  Data CenterGPUs 16 Cores for 16 OpenMP threads A40 48GBA100 40GBH100 80GB Use the test data in the GROMACS benchmarks by NHR@FAU. The lowest prices are selected from the data center GPU market, that closely match the resource requirements: $1.86/hour for A40 (24 vCPUs)$1.29/hour for A100 (30 vCPUs)$2.99/hour for H100 (30 vCPUs) https://getdeploying.com/reference/cloud-gpu It is worth mentioning that performance can be influenced by many factors, such as operating systems (Windows, Linux, or WSL) and their versions, CPU models, GPU models and driver versions, CUDA framework versions, GROMACS versions, and additional features enabled in the runtime environment. It is very common to see different results between our benchmarks and those of others. Benchmark Results Here are six typical biochemical systems used to benchmark GROMACS: No Model Description Size 1 R-143a in hexane (20,248 atoms) with very high output rate Small 2 A short RNA piece with explicit water (31,889 atoms) Small 3 A protein inside a membrane surrounded by explicit water (80,289 atoms) Medium 4 A protein in explicit water (170,320 atoms) Medium 5 A protein membrane channel with explicit water (615,924 atoms) Large 6 A huge virus protein (1,066,628 atoms) Large Model 1: R-143a in hexane (20,248 atoms) with very high output rate Model 2: A short RNA piece with explicit water (31,889 atoms) Model 3: A protein inside a membrane surrounded by explicit water (80,289 atoms) Model 4: A protein in explicit water (170,320 atoms) Model 5: A protein membrane channel with explicit water (615,924 atoms) Model 6: A huge virus protein (1,066,628 atoms) Observations from the GROMACS benchmark Here are some interesting observations from the GROMACS benchmarks: The VRAM usage for all simulations is only 1-2 GB, which means nearly all GPU types can theoretically be utilized to run these models. GROMACS primarily utilizes the CUDA Cores of GPUs (not Tensor Cores), and typically operates in single-precision (FP32). High-end GPUs generally outperform low-end models in simulation speed due to their greater number of CUDA cores and higher memory bandwidth. However, the flagship model of a GPU generation often surpasses the low-end models of the following generation. For smaller models, GPUs are often underutilized, and communication between the CPU and GPU can become a bottleneck, making CPU performance a critical factor in overall system performance. On nodes with GPUs of similar performance, higher CPU clock speeds and more physical cores usually lead to better performance. Data center GPUs are typically paired with more powerful CPUs that have additional cores, allowing them to run GROMACS significantly faster than consumer GPUs in Models 1 and 2. Large models can fully exploit the vast number of CUDA

Molecular Simulation: GROMACS Benchmark on 30 GPUs on SaladCloud, 90+% Cost Savings Read More »

OpenMM-benchmark-on-GPUs-Salad-Blog-cover

Molecular Simulation: OpenMM Benchmark on 25 consumer GPUs, 95% less cost

Benchmarking OpenMM for Molecular Simulation on consumer GPUs OpenMM is one of the most popular toolkits for molecular dynamics simulations, renowned for its high-performance, flexibility and extensibility. It enables users to easily incorporate new features, such as novel forces, integration algorithms, and simulation protocols, which can run efficiently on both CPUs and GPUs. In this analysis, we use typical biochemical systems to benchmark OpenMM on SaladCloud’s network of AI-enabled consumer GPUs. We will analyze simulation speed and cost-effectiveness in each case and discuss how to build high-performance and reliable molecular simulation workloads on SaladCloud. This approach supports unlimited throughput and offers over 95% cost savings compared to solutions based on data center GPUs. Why run Molecular Simulations on GPUs? GPUs have a high degree of parallelism, which means they can perform many calculations simultaneously. This is particularly useful for molecular simulations, which involve a large number of repetitive calculations, such as evaluating forces between atoms. Using GPUs can significantly accelerate molecular simulations, offering nearly real-time feedback and allowing researchers to run more simulations in less time. This enhanced efficiency accelerates the pace of discovery and lowers computational costs. OpenMM benchmark methodology The OpenMM team has provided benchmarking code in Python, along with benchmarks of simulation speed for typical biochemical systems based on OpenMM 8.0. To conduct the benchmarking test, you can run the following scripts on the target environment: Following the OpenMM benchmarks, we used OpenMM 8.0 with CUDA 11.8 to build the container image. When running on SaladCloud, it first executes the benchmarking code, reports the test data to an AWS DynamoDB table, and then exits. Finally, the data is downloaded and analyzed using Pandas on JupyterLab. We primarily focused on two key performance indicators across three scenarios: ns/day stands for nanoseconds per day. It measures simulation speed, indicating how many nanoseconds of simulated time can be computed in one day of real time.  ns/dollar stands for nanoseconds per dollar. It is a measure of cost-effectiveness, showing how many nanoseconds of simulated time can be computed for one dollar. Molecular simulations often operate on the timescale of nanoseconds to microseconds, as molecular motions and interactions occur very rapidly. Below are the three scenarios and the methods used to collect data and calculate the final results: Scenario Resource Simulation Speed (ns/day) Cost Effectiveness (ns/dollar) CPUs 16 vCPUs8GB RAM Create a container group with 10 CPU-only instances on SaladCloud, and run it for a few hours.  Once the code execution is finished on an instance, SaladCloud will allocate a new node and continuously run the instance.    Collect test data from tens of unique Salad nodes to calculate the average performance. Pricing from the SaladCloud Price Calculator: $0.040/hour for   8 vCPUs, 8GB RAM$0.072/hour for 16 vCPUs, 8GB RAM $0.02 ~ $0.30/hour for different GPU types https://salad.com/pricing Consumer GPUs 8 vCPUs 8GB RAM 20+ GPU types Create a container group with 100 instances with all GPU types on SaladCloud, and run it for a few hours.  Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type. Calculate the average performance for each GPU type. Pricing from the SaladCloud Price Calculator: $0.040/hour for   8 vCPUs, 8GB RAM$0.072/hour for 16 vCPUs, 8GB RAM $0.02 ~ $0.30/hour for different GPU types https://salad.com/pricing Datacenter GPUs A100H100 Use the test data in the OpenMM benchmarks. Pricing from the AWS EC2 Capacity Blocks: $1.844/hour for 1 A100$4.916/hour for 1 H100 https://aws.amazon.com/ec2/capacityblocks/pricing/ It is worth mentioning that performance can be influenced by many factors, such as operating systems (Windows, Linux, or WSL) and their versions, CPU models, GPU models and driver versions, CUDA framework versions, OpenMM versions, and additional features enabled in the runtime environment. It is very common to see different results between our benchmarks and those of others. Benchmark Results Here are five typical biochemical systems used to benchmark OpenMM 8.0, along with the corresponding test scripts: Model Description Test script 1 Dihydrofolate Reductase (DHFR), Explicit-PME This is a 159 residue protein with 2489 atoms. The version used for explicit solvent simulations included 7023 TIP3P water molecules, giving a total of 23,558 atoms. All simulations used the AMBER99SB force field and a Langevin integrator. python benchmark.py –platform=CUDA or CPU –seconds=60 –test=pme 2 Apolipoprotein A1 (ApoA1), PME This consists of 392 protein residues, 160 POPC lipids, and 21,458 water molecules, for a total of 92,224 atoms. All simulations used the AMBER14 force field. python benchmark.py –platform=CUDA or CPU –seconds=60 –test=apoa1pme 3 Cellulose, PME It consists of a set of cellulose molecules (91,044 atoms) solvated with 105,855 water molecules, for a total of 408,609 atoms. python benchmark.py –platform=CUDA or CPU –seconds=60 –test=amber20-cellulose 4 Satellite Tobacco Mosaic Virus (STMV), PME It consists of 8820 protein residues, 949 RNA bases, 300,053 water molecules, and 649 sodium ions, for a total of 1,067,095 atoms. python benchmark.py –platform=CUDA or CPU–seconds=60–test=amber20-stmv 5 AMOEBA DHFR, PME Full mutual polarization was used, with induced dipoles iterated until they converged to a tolerance of 1e-5. python benchmark.py –platform=CUDA or CPU –seconds=60  –test=amoebapme Model 1: Dihydrofolate Reductase (DHFR), Explicit-PME Model 2: Apolipoprotein A1 (ApoA1), PME Model 3: Cellulose, PME Model 4: Satellite Tobacco Mosaic Virus (STMV), PME Model 5: AMOEBA DHFR, PME Observations from the OpenMM GPU benchmarks: Here are some interesting observations from the OpenMM GPU benchmarks: The VRAM usage for all simulations is only 1-2 GB, which means nearly all platforms (CPU-only or GPU) and all GPU types can theoretically be utilized to run these models. For all models, the simulation speed of GPUs is significantly higher than that of CPUs, ranging from nearly hundreds of times in Model 1 to more than tens of thousands of times in Model 5. In general, high-end GPUs outperform low-end GPUs in terms of simulation speed. However, the flagship model of a given GPU family often surpasses the low-end models of the next family. As models become more complex with additional molecules and atoms, the performance differences between low-end GPUs and high-end GPUs become more pronounced, ranging from a few times to tens of times. For example, in Model

Molecular Simulation: OpenMM Benchmark on 25 consumer GPUs, 95% less cost Read More »

AI transcription - Parakeet TRT 1.1B batch transription compared against APIs

AI Transcription Benchmark: 1 Million Hours of Youtube Videos with Parakeet TDT 1.1B for Just $1260, a 1000-fold cost reduction 

Building upon the inference benchmark of Parakeet TDT 1.1B on SaladCloud and with our ongoing efforts to enhance the system architecture and implementation for batch jobs, we have achieved a 1000-fold cost reduction for AI transcription with SaladCloud. This incredible cost-performance comes while maintaining the same level of accuracy as other managed transcription services.  YouTube is the world’s most widely used video-sharing platform, featuring a wealth of public content, including talks, news, courses, and more. There might be instances where you need to quickly understand  updates of a global event or summarize a topic, but you may not be able to watch videos individually. In addition, the millions of YouTube videos are a gold-mine of training data for many AI applications. Many companies have a need to do large-scale, AI transcription in batch today but cost is a prohibiting factor. In this deep dive, we will utilize publicly available YouTube videos as datasets and the high-speed ASR  (Automatic Speech Recognition) model – Parakeet TDT 1.1B, and explore methods for constructing a batch-processing system for large-scale AI transcription of videos, using the substantial computational power of SaladCloud’s massive network of consumer GPUs across a global, high-speed distributed network. How to download YouTube videos for batch AI transcription The Python library, pytube, is a lightweight tool designed for handling YouTube videos, that can simplify our tasks significantly. Firstly, pytube offers the APIs for interacting with YouTube playlists, which are collections of videos usually organized around specific themes. Using the APIs, we can retrieve all the video URLs within a specific playlist.  Secondly, prior to downloading a video, we can access its metadata, including details such as the title, video resolution, frames per second (fps), video codec, audio bit rate (abr), and audio codec, etc. If a video on YouTube supports an audio codec, we can enhance efficiency by exclusively downloading its audio. This approach not only reduces bandwidth requirements but also results in substantial time savings, given that the video size is typically ten times larger than its corresponding audio. Bellow is the code snippet for downloading from YouTube: The audio files downloaded from YouTube primarily utilize the MPEG-4 audio (Mp4a) file format, commonly employed for streaming large audio tracks. We can convert these audio files from Mp4A to MP3, a format universally accepted by all ASR models.  Additionally, the duration of audio files sourced from YouTube exhibits considerable variation, ranging from a few minutes to tens of hours. To leverage massive and cost-effective GPU types, as well as to optimize GPU resource utilization, it is essential to segment all lengthy audio into fixed-length clips before inputting them into the model. The results can then be aggregated before returning the final transcription. Advanced system architecture for massive video transcription We can reuse our existing system architecture for audio transcription with a few enhancements:  In a long-term running batch-job system, implementing auto scaling becomes crucial. By continuously monitoring the job count in the message queue, we can dynamically adjust the number of Salad nodes or groups. This adaptive approach allows us to respond effectively to variations in system load, providing the flexibility to efficiently manage costs during lower demand periods or enhance throughput during peak loads. Enhanced node implementation for both video and audio AI transcription Modifications have been made on the node implementation, enabling it to handle both video and audio for AI transcription. The inference process remains unchanged, running on a single thread and dedicated to GPU-based transcription. We have introduced additional features in the benchmark worker process, specifically designed to handle I/O and CPU-bound tasks and running multiple threads: Running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks provides the flexibility to update each component independently. Introducing multiple threads in the benchmark worker process to handle different tasks eliminates waiting periods by fetching and preparing the next audio clips in advance while the current one is still being transcribed. Consequently, as soon as one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time and increases system throughput but also results in more significant cost savings. Massive YouTube video transcription tests on SaladCloud We created a container group with 100 replicas (2vCPU and 12 GB RAM with 20+ different GPU types) in SaladCloud. The group was operational for approximately 10 hours, from 10:00 pm to 8:00 am PST during weekdays, successfully downloading and transcribing a total of 68,393 YouTube videos. The cumulative length of these videos amounted to 66,786 hours, with an average duration of 3,515 seconds. Hundreds of Salad nodes from worldwide networks actively engaged in the tasks. They are all positioned in the high-speed networks, near the edges of the YouTube Global CDN (with an average latency of 33ms). This setup guarantees local access and ensures optimal system throughput for downloading content from YouTube. According to the AWS DynamoDB metrics, specifically writes per second, which serve as a monitoring tool for transcription jobs, the system reached its maximum capacity, processing approximately 2 videos (totaling 7500 seconds) per second, roughly one hour after the container group was launched. The selected YouTube videos for this test vary widely in length, ranging from a few minutes to over 10 hours, causing notable fluctuations in the processing curve. Let’s compare the results of the two benchmark tests conducted on Parakeet TDT 1.1B for audio and video: Parakeet Audio Parakeet Video Datasets English CommonVoice and Spoken Wikipedia Corpus English YouTube videos include public talks, news and courses. Average Input Length (s) 12 3515 Cost on SaladCloud (GPU Resource Pool and Global Distribution Network) Around $100100 Replicas (2vCPU,12GB RAM,20+ GPU types) for 10 hours Around $100100 Replicas (2vCPU,12GB RAM,20+ GPU types) for 10 hours Cost on AWS and Cloudflare(Job Queue/Recording System and Cloud Storage ) Around $20 Around $2 Node Implementation 3 downloader threads;Segmentation of long audio; Merging texts. Download audio from YouTube playlists and videos;3 downloader threads;Segmentation of long audio;Format conversion from Mp4a to MP3;Merging texts. Number of Transcription 5,209,130 68,393 Total

AI Transcription Benchmark: 1 Million Hours of Youtube Videos with Parakeet TDT 1.1B for Just $1260, a 1000-fold cost reduction  Read More »

MetaVoice Text-to-Speech gpu benchmark on SaladCloud

MetaVoice AI Text-to-Speech (TTS) Benchmark: Narrate 100,000 words for only $4.29 on SaladCloud

Note: Do not miss out on listening to voice clones of 10 different celebrities reading Harry Potter and the Sorcerer’s Stone towards the end of the blog. Introduction to MetaVoice-1B MetaVoice-1B is an advanced text-to-speech (TTS) model boasting 1.2 billion parameters, meticulously trained on a vast corpus of 100,000 hours of speech. Engineered with a focus on producing emotionally resonant English speech rhythms and tones, MetaVoice-1B stands out for its accuracy and realistic voice synthesis. One standout feature of MetaVoice-1B is its ability to perform zero shot voice cloning. This feature requires only a 30-second audio snippet to accurately replicate American & British voices. It also includes cross-lingual cloning capabilities demonstrated with as little as one minute of training data for Indian accents. A versatile tool released under the permissive Apache 2.0 license, MetaVoice-1B is designed for long-form synthesis. The architecture of MetaVoice-1B MetaVoice-1B’s architecture is a testament to its innovative design. Combining causal GPT structures and non-causal transformers, it predicts a series of hierarchical EnCodec tokens from text and speaker information. This intricate process includes condition-free sampling, enhancing the model’s cloning proficiency. The text is processed through a custom-trained BPE tokenizer, optimizing the model’s linguistic capabilities without the need for predicting semantic tokens, a step often deemed necessary in similar technologies. MetaVoice cloning benchmark methodology on SaladCloud GPUs Encountered Limitations and Adaptations During the evaluation, we encountered limitations with the maximum length of text that MetaVoice could process in one go. The default token limit is set to 2048 tokens per batch. But we noticed that even with a smaller number of tokens, the model starts to act not as expected. To solve the limit issue, we had to preprocess our data by dividing the text into smaller segments, specifically two-sentence pieces, to accommodate the model’s capabilities. To break the text into sentences, we used Punkt Sentence Tokenizer. The text source remained consistent with previous benchmarks, utilizing Isaac Asimov’s “Robots and Empire,” available from Internet Archive: Digital Library of Free & Borrowable Books, Movies, Music & Wayback Machine. For the voice cloning component, we utilized a one-minute sample of Benedict Cumberbatch’s narration. The synthesized output very closely mirrored the distinctive qualities of Cumberbatch’s narration, demonstrating MetaVoice’s cloning capabilities which are the best we’ve yet seen. Here is a voice cloning example featuring Benedict Cumberbatch: GPU Specifications and Selection MetaVoice documentation specifies the need for GPUs with VRAM of 12GB or more. Despite this, our trials included GPUs with lower VRAM, which still performed adequately. But this required a careful selection process from SaladCloud’s GPU fleet to ensure compatibility. We standardized each node with 1 vCPU and 8GB of RAM to maintain a consistent testing environment. Benchmarking Workflow The benchmarking procedure was incorporating multi-threaded operations to enhance efficiency. The process involved parallel downloading of parts of text and the voice reference sample from Azure and processing text through MetaVoice model. Completing the cycle, the resulting audio was then uploaded back to Azure. This comprehensive workflow was designed to simulate a typical application scenario, providing a realistic assessment of MetaVoice’s operational performance on SaladCloud GPUs. Benchmark Findings: Cost-Performance and Inference Speed Words per Dollar Efficiency Our benchmarking results reveal that the RTX 3080 GPU leads in terms of cost-efficiency for MetaVoice, achieving an impressive 23,300 words per dollar. The RTX 3080 Ti follows closely with 15,400 words per dollar. These figures highlight the resource-intensive nature of MetaVoice, requiring powerful GPUs to operate efficiently. Speed Analysis and GPU Requirements Our speed analysis revealed that GPUs with 10GB or more VRAM performed consistently, processing approximately 0.8 to 1.2 words per second. In contrast, GPUs with lower VRAM demonstrated significantly reduced performance, rendering them unsuitable for running MetaVoice. This aligns with the developers’ recommendation of using GPUs with at least 12GB VRAM to ensure optimal functionality. Cost Analysis for an Average Book To provide a practical perspective, let’s consider the cost of converting an average book into speech using MetaVoice on SaladCloud GPUs. Assuming an average book contains approximately 100,000 words: Creating a narration of “Harry Potter and the Sorcerer’s Stone” by Benedict Cumberbatch would cost around $3.30 with an RTX 3080 and $5.00 with an RTX 3080 Ti. Here is an example of a voice clone of Benedict Cumberbatch reading Harry Potter: Notice that we did not change any model parameters, or added business logic. We only added batch processing sentence by sentence. We also cloned other celebrity voices to read out the first page of Harry Potter and the Sorcerer’s Stone. Here’s a collection of different voice clones reading Harry Potter using MetaVoice. MetaVoice GPU Benchmark on SaladCloud – Conclusion In conclusion, the combination of MetaVoice and SaladCloud GPUs presents a cost-effective and high-quality solution for text-to-speech and voice cloning projects. Whether for large-scale audiobook production or specialized projects like celebrity-narrated books, this technology offers a new level of accessibility and affordability in voice synthesis. As we move forward, it will be exciting to see how these advancements continue to shape the landscape of digital content creation. SaladCloud suggests: If you are just looking to generate AI voices, give Veed.io’s AI voice generator a try. With AI voices and AI avatars, Veed.io will generate ultra-realistic text-to-speech audio/video for personal and commercial use.

MetaVoice AI Text-to-Speech (TTS) Benchmark: Narrate 100,000 words for only $4.29 on SaladCloud Read More »

Whisper large v3 - Automatic speech - recognition - gpu benchmark

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110

Save over 99.8% on audio transcription using Whisper Large V3 and consumer GPUs A 99.8% cost-savings for automatic speech recognition sounds unreal. But with the right choice of GPUs and models, this is very much possible and highlights the needless overspending on managed transcription services today. In this deep dive, we will benchmark the latest Whisper Large V3 model from Open AI for inference against the extensive English CommonVoice and Spoken Wikipedia Corpus English (Part1, Part2) datasets, delving into how we accomplished an exceptional 99.8% cost reduction compared to other public cloud providers. Building upon the inference benchmark of Whisper Large V2 and with our continued effort to enhance the system architecture and implementation for batch jobs, we have achieved substantial reductions in both audio transcription costs and time while maintaining the same level of accuracy as the managed transcription services. Behind The Scenes: Advanced System Architecture for Batch Jobs Our batch processing framework comprises of the following: We aimed to keep the framework components fully managed and serverless to closely simulate the experience of using managed transcription services. A decoupled architecture provides the flexibility to choose the best and most cost-effective solution for each component from the industry. Within each node in the GPU resource pool in SaladCloud, two processes are utilized following best practices: one dedicated to GPU inference and another focused on I/O and CPU-bound tasks, such as downloading/uploading, preprocessing, and post-processing. 1) Inference Process The inference process operates on a single thread. It begins by loading the Whisper Large V3 model, warming up the GPU, and then listens on a TCP port by running a Python/FastAPI app in a Unicorn server. Upon receiving a request, it calls the transcription inference and returns the generated assets.  The chunking algorithm is configured for batch processing, where long audio files are segmented into 30-second clips, and these clips are simultaneously fed into the model. The batch inference significantly enhances performance by effectively leveraging the GPU cache and parallel processing capabilities.  2) Benchmark Worker Process The benchmark worker process primarily handles various I/O tasks, as well as pre- and post processing. Multiple threads are concurrently performing various tasks: one thread pulls jobs and downloads audio clips; another thread calls the inference, while the remaining threads manage tasks such as uploading generated assets, reporting job results and cleaning the environment, etc. Several queues are created to facilitate information exchange among these threads. Running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks, and fetching the next audio clips earlier while the current one is still being transcribed, allows us to eliminate any waiting period. After one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time for batch jobs but also leads to even more significant cost savings. Deployment on SaladCloud We created a container group with 100 replicas (2 vCPU and 12 GB RAM with 20 different GPU types) in SaladCloud, and ran it for approximately 10 hours. In this period, we successfully transcribed over 2 million audio files, totalling nearly 8000 hours in length. The test incurred around $100 in SaladCloud costs and less than $10 on both AWS and Cloudflare. Results from the Whisper Large v3 benchmark Among the 20 GPU types, based on the current datasets, the RTX 3060 stands out as the most cost-effective GPU type for long audio files exceeding 30 seconds. Priced at $0.10 per hour on SaladCloud, it can transcribe nearly 200 hours of audio per dollar.  For short audio files lasting less than 30 seconds, several GPU types exhibit similar performance, transcribing approximately 47 hours of audio per dollar.  On the other hand, the RTX 4080 outperforms others as the best-performing GPU type for long audio files exceeding 30 seconds, boasting an average real-time factor of 40. This implies that the system can transcribe 40 seconds of audio per second. While for short audio files lasting less than 30 seconds, the best average real-time factor is approximately 8 by a couple of GPU types, indicating the ability to transcribe 8 seconds of audio in just 1 second. Analysis of the benchmark results Different from those obtained in local tests with several machines in a LAN, all these numbers are achieved in a global and distributed cloud environment that provides the transcription at a large scale, including the entire process from receiving requests to transcribing and sending the responses. There are various methods to optimize the results. Aiming for reduced costs, improved performance or even both, and different approaches may yield distinct outcomes. The Whisper models come in five configurations of varying model sizes: tiny, base, small, medium, and large(v1/v2/v3). The large versions are multilingual and offer better accuracy, but they demand more powerful GPUs and run relatively slowly. On the other hand, the smaller versions support only English with slightly lower accuracy, but it requires less powerful GPUs and runs very fast. Choosing more cost-effective GPU types in the resource pool will result in additional cost savings. If performance is the priority, selecting higher-performing GPU types is advisable, while still remaining significantly less expensive than managed transcription services.  Additionally, audio length plays a crucial role in both performance and cost, and it’s essential to optimize the resource configuration based on your specific use cases and business goals. Discover our open-source code for a deeper dive: Implementation of Inference and Benchmark Worker Docker Images Data Exploration Tool Performance Comparison across Different Clouds The results indicate that AI transcription companies are massively overpaying for cloud today. With the open-source automatic speech recognition model – Whisper Large V3, and the advanced batch-processing architecture leveraging hundreds of consumer GPUs on SaladCloud, we can deliver transcription services at a massive scale and at an exceptionally low cost, while maintaining the same level of accuracy as managed transcription services.  With the most cost-effective GPU type for Whisper Large V3 inference on SaladCloud, $1 dollar can transcribe 11,736 minutes of audio (nearly 200 hours), showcasing a 500-fold

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110 Read More »

Recognize anything model++ gpu benchmark

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs

What is the Recognize Anything Model++? The Recognize Anything Model++ (RAM++) is a state of the art image tagging foundational model released last year, with pre-trained model weights available on huggingface hub. It significantly outperforms other open models like CLIP and BLIP in both the scope of recognized categories and accuracy. But how much does it cost to run RAM++ on consumer GPUs? In this benchmark, we tag 144,485 images from the COCO 2017 and AVA image datasets, evaluating inference speed and cost-performance. The evaluation was done across 167 nodes on SaladCloud representing 19 different consumer GPU classes. To do this, we created a container group targeting a capacity of 100 nodes, with the “Stable Diffusion Compatible” GPU class. All nodes were assigned 2 vCPU and 8GB RAM. Here’s what we found. Up to 309k images tagged per dollar on RTX 2080 In keeping with a trend we often see here, the best cost-performance is coming from the lower end GPUs, RTX 20- and 30-series cards. In general, we find that the smallest/cheapest GPU that can do the job you need is likely to have the best cost-performance, in terms of inferences per dollar. RAM++ is a fairly small, lightweight model (3GB), and achieved its best performance on the RTX 2080, with just over 309k inferences per dollar. Average Inference Time Is <300ms Across All GPUs We see relatively quick inference times across all GPU types, but we also see a pretty wide distribution of performance, even within a single GPU type. Zooming in, we can see this wide distribution is also present within a single node. Further, we see no significant correlation between inference time and number of tags generated. GPU Correlation between inference time and number of tags RTX 2080 0.04255 RTX 2080 SUPER -0.02209 RTX 2080 Ti -0.03439 RTX 3060 0.00074 RTX 3060 Ti 0.00455 RTX 3070 0.00138 RTX 3070 Laptop GPU -0.00326 RTX 3070 Ti -0.01494 RTX 3080 -0.00041 RTX 3080 Laptop GPU -0.09197 RTX 3080 Ti 0.02748 RTX 3090 -0.00146 RTX 4060 0.03447 RTX 4060 Laptop GPU -0.08151 RTX 4060 Ti 0.04153 RTX 4070 0.01393 RTX 4070 Laptop GPU -0.05811 RTX 4070 Ti 0.00359 RTX 4080 0.02090 RTX 4090 -0.03002 Based on this, you should expect to see fairly wide variation in inference time in production regardless of your GPU selection or image properties. Results from the Recognize Anything Model++ (RAM++) benchmark Consumer GPUs offer a highly cost-effective solution for batch image tagging, coming in between 60x-300x the cost efficiency of managed services like Azure AI Computer Vision. The Recognize Anything paper and code repository offer guides to train and fine-tune this model on your own data, so even if you have unusual categories, you should consider RAM++ instead of commercially available managed services. Resources Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs Read More »

Segment anything model (SAM) benchmark on consumer GPUs on SaladCloud

Segment Anything Model (SAM) Benchmark: 50K Images/$ on Consumer GPUs

What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a foundational image segmentation model released by Meta AI Research last year, with pre-trained model weights available through the GitHub repository. It can be prompted with a point or a bounding box, and performs well on a variety of segmentation tasks. More importantly, it carries the permissive Apache 2.0 license, allowing commercial use. As companies deploy this model for use cases ranging from image labeling, background removal, inpainting and more, cost of running SAM in production is a primary concern. Benchmarking the Segment Anything Model (SAM) on SaladCloud In this benchmark, we do an unprompted full-image segmentation on 152,848 images from the COCO 2017 and AVA image datasets. We evaluate inference speed and cost-performance across 302 nodes on SaladCloud representing 22 different consumer GPU classes. To do this, we created a container group targeting a capacity of 100 nodes, with the “Stable Diffusion Compatible” GPU class. All nodes were assigned 2 vCPU and 8GB RAM. Here’s what we found. 50K+ images segmented per dollar on RTX 3060 Ti & RTX 3070 Ti As is nearly always the case with smaller models, the best cost-performance is coming from the lower end GPUs, mostly the RTX 30-series cards. In this case, we see a significant bump in cost-performance on the Ti cards. This makes sense since they are priced the same as their non-Ti counterparts but have more CUDA cores. The stand-out performers here are the RTX 3060 Ti, and the RTX 3070 Ti, each offering at least 50k inferences per dollar. Inference time is fairly consistent within a particular node Zooming into performance within a single GPU class – the RTX 3070 Ti, we see that the bulk of inference times fall within a narrow range on any particular node, with some significant outliers. We do see some variability across different nodes, with one standing out as particularly bad. We often see a small amount of variability in performance across nodes on SaladCloud, since each one is an individual residential gaming PC, with a variety of different CPUs, RAM speed, motherboard configurations, etc. Our one outlier node (31b6, circled above) is indicative of something anomalous with that machine. We’re always working to get better at detecting these scenarios before your workloads get to a bad machine. But the best practice is to monitor the performance of your application, and terminate nodes that display anomalous behavior. The range of inference time on one of our nodes (67acdb6b) may look concerning at first. But if we zoom in, we see those outlier times are exceedingly uncommon, with the vast majority of inferences clustered within a narrow range. And indeed, if we filter out the outliers, we see a much tighter grouping within each individual node. But we also start to see 2 distinct groupings of machines: It is a little concerning that some machines are 35-40% faster than others, so this gets sent to our engineering team for further investigation. The above cost-performance numbers include all these outliers and variability, so I suspect that it is possible to beat those numbers. Results from the Segment Anything Model (SAM) benchmark The RTX 3060 Ti and RTX 3070 Ti running the Segment Anything Model (SAM) offer a highly cost-effective solution for batch image segmentation, coming in at 50x the cost efficiency of managed services like Azure AI Computer Vision. Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Segment Anything Model (SAM) Benchmark: 50K Images/$ on Consumer GPUs Read More »

Deploy your own ChatGPT with Ollama, Huggingface Chat UI and Salad

Your own ChatGPT for just $0.04/hr – with Ollama, ChatUI and SaladCloud

Deploy your own LLM with Ollama & Huggingface Chat UI on SaladCloud How much does it cost to build and deploy a ChatGPT-like product today? The cost could be anywhere from thousands to millions – depending on the model, infrastructure and use case. Even the same task could cost anywhere from $1000 to $100,000. But with the advancement of open-source models & open infrastructure, there’s been tremendous interest in building a cost-efficient ChatGPT-like tool for various real-life applications. In this article, we explore how tools like Ollama and Huggingface Chat UI can simplify this process, particularly when deployed on Salad’s distributed cloud infrastructure. The challenges in hosting & implementing LLMs In today’s digital ecosystem, Large Language Models (LLMs) have revolutionized various sectors, including technology, healthcare, education, and customer service. Their ability to understand and generate human-like text has made them immensely popular, driving innovations in chatbots, content creation, and more. These models, with their vast knowledge bases and sophisticated algorithms, can converse, comprehend complex topics, write code, and even compose poetry. This makes them highly versatile tools for many enterprise & everyday use-cases. However, hosting and implementing these LLMs poses significant challenges. Despite these challenges, the integration of LLMs into platforms continues to grow, driven by their vast potential and the continuous advancements in the field. As solutions like Hugging Face’s Chat UIand SaladCloud offer more accessible and efficient ways to deploy these models, we’re likely to see an even greater adoption and innovation across industries. What is Ollama? Ollama is a tool that enables the local execution of open-source large language models like Llama 2 and Mistral 7B on various operating systems, including Mac OS, Linux, and soon Windows. It simplifies the process of running LLMs by allowing users to execute models with a simple terminal command or an API call. Ollama optimizes setup and configuration, specifically tailoring GPU usage for efficient performance. It supports a variety of models and variants, all accessible through the Ollama model library, making it a versatile and user-friendly solution for running powerful language models locally. Here is a list of supported models: Model Parameters Size Download Llama2 7B 3.8GB ollama run llama2 Mistral 7B 4.1GB ollama run mistral Dolphin Phi 2.7B 1.6GB ollama run dolphin-phi Phi-2 2.7B 1.7GB ollama run phi Neural Chat 7B 4.1GB ollama run neural-chat Starling 7B 4.1GB ollama run starling-lm Code Llama 7B 3.8GB ollama run codellama Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored Llama 2 13B 13B 7.3GB ollama run llama2:13b Llama 2 70B 70B 39GB ollama run llama2:70b Orca Mini 3B 1.9GB ollama run orca-mini Vicuna 7B 3.8GB ollama run vicuna LLaVA 7B 4.5GB ollama run llava What is Huggingface Chat UI? Huggingface Chat UI is a powerful tool for practitioners in the Large Language Model (LLM) space looking to deploy a ChatGPT-like conversational interface. It enables interaction with models hostedon Huggingface, leveraging its text generation inference or any custom API powered by LLM. Chat UI has such capabilities as conversational history, memory, authentication, and theming. Huggingface Chat UI is an ideal choice for those looking to create a more engaging and robust conversational agent. Integrating Ollama and Huggingface Chat UI for deploying on SaladCloud The main goal of our project is to integrate Ollama with Huggingface Chat UI and deploy them to SaladCloud.The final version of the code can be found here: GitHub – SaladTechnologies/ollama-chatui In order to achieve our goal we did the following: 1. Clone Ollama Repository We start by cloning the Ollama repository from Ollama Git Repo. This repository serves as the base of the project.Ollama is a user-friendly tool and can be operated via terminal or as a REST API. In this project, the intention is to run Ollama in a Docker container and connect it to Chat UI. The Dockerfile from Ollama repository shows that it runs on host 0.0.0.0 and port 11434. However, since direct access to Ollama isn’t required but rather through the UI, this configuration will be modified later. 2. Setting Up Huggingface Chat UI Chat UI git repo: GitHub – huggingface/chat-ui: Open source codebase powering the HuggingChat app From the Chat UI Readme, we can see that we need to follow a few steps to make it work in our custom solution: Notice that the path to ollama is specified as http://127.0.0.1:11434. 3. Connecting Ollama and Chat UI We now need to connect Ollama and ChatUI. This involves ensuring that the Chat UI can communicate with the Ollama instance, typically by setting the appropriate port and host settings in the UI configuration to match the Ollama Docker deployment. First we clone the ChatUI repo in our Dockerfile and replace the host that Ollama uses with 127.0.0.1. Next expose port 3000 that is used by ChatUi.We will also replace the entrypoint with our custom shell script: With this script, we establish the necessary .env.local file and populate it with configurations. Next, we initiate the Ollama server in a separate tmux session to download the desired model. ChatUI is then activated on port 3000. For any adjustments in model settings, refer to the models_config/model.local file. We have also converted the MongoDB URL, Huggingface Token, and Model name into environment variables to facilitate seamless alterations during deployment to SaladCloud. Additionally, a DOWNLOAD_TIME variable is defined. Since Ollama runs in a tmux session, it allows subsequent commands to execute even if the server isn’t fully operational. To ensure that Ollama is fully active before initiating ChatUI, we incorporate a sleep duration. This duration is model-dependent; forinstance, downloading llama2 might take around 8 minutes. 4. Deploying to SaladCloud After setting up and connecting Ollama and Chat UI, the complete system is ready for deployment to Salad’s cloud infrastructure. The integrated solution will be hosted on Salad’s robust cloud platform. Detailed deployment instructions and necessary files are accessible through the Salad Technologies Ollama Chat UI GitHub repository or by pulling the image from Salad Docker Registry: saladtechnologies/ollama-chatui-salad:1.0.0. To deploy our solution we need to follow instructions: Deploy a Container Group with

Your own ChatGPT for just $0.04/hr – with Ollama, ChatUI and SaladCloud Read More »

Scroll to Top