SaladCloud Blog

Save up to 77% on Zero-Knowledge Proofs with SaladCloud

Use decentralized SaladCloud compute as the most cost-effective power for zero-knowledge proof calculations. Introduction Over the past decade, digital technology has accelerated at a breathtaking pace. What used to be a world of simple 0s and 1s is now fully integrated into all aspects of daily life—social interactions, business workflows, even politics. Among the foundational technologies that impact society at large is blockchain. Blockchain emphasizes transparency—anyone can verify what happened on the ledger and when. But people and organizations still need privacy. How can those ideas coexist? With zero-knowledge proofs (ZKPs), it’s possible. What is ZKP?  A zero-knowledge proof lets one party—the prover—convince another—the verifier—that a statement is true without revealing the secret behind it. The verifier issues challenges that only someone with the hidden information can answer correctly—a guesser will fail with high probability. The Ali Baba cave parable is a popular analogy to explain how ZKPs work. Imagine a circular tunnel with two entrances that meet at a locked magic door opened only by a secret word. Peggy, who claims to know the word, walks into the cave while Victor waits outside; he then calls out a random entrance for her to reappear from. If Peggy truly knows the word, she can open the door inside and emerge from whichever entrance Victor names; if she doesn’t, she can only come out from the one she originally took, matching Victor’s request only by luck. Repeating this many times takes luck out of the picture, so Victor becomes convinced she knows the secret. He learns nothing about the word itself—only that Peggy consistently succeeds—capturing the essence of zero knowledge. ZKPs were first formalized in a 1985 paper by Shafi Goldwasser, Silvio Micali, and Charles Rackoff. They showed a prover can convince a verifier of a fact without revealing anything else about the data. ZKPs come in two modes: interactive (back-and-forth with a specific verifier) and non-interactive (a single proof anyone can verify). Two widely used non-interactive families today are zk-STARKs (zero-knowledge scalable transparent argument of knowledge) and zk-SNARKs (zero-knowledge succinct non-interactive argument of knowledge). Basic Architecture of ZKP Although the underlying math is complex, it’s useful to keep a high-level view of the workflow in mind. A zero-knowledge proof involves two parties: a prover, who possesses a secret (the witness), and a verifier, who needs to be convinced that a statement about that secret is true without learning the secret itself. The statement is first written in a precise, mathematical form—often as a circuit or a small program that encodes the rules. Typical claims might be “I control this account without revealing the private key,” or “this confidential transaction balances correctly without showing the amounts,” or even “this output truly came from running that program on hidden input.” The process often begins with setup, which ties the intended computation to the proof system. Not every scheme needs a per-circuit setup: many SNARKs do, while STARKs and some modern SNARK variants are transparent and rely on public randomness or universal parameters instead of a special ceremony. When setup is required, it produces two artifacts: a proving key, used by whoever will build proofs, and a verification key, used by anyone to check them. Who runs or validates this step is part of the trust model; running it yourself—or at least verifying how it was produced—ensures the circuit being proved is exactly the one you expect. The heart of the workflow is proving. The prover combines public inputs with the private witness and the proving key to construct a compact proof. This is the compute-intensive phase. In SNARK systems, most time and memory go into large polynomial transforms (FFT/NTT, i.e., fast Fourier transforms / number-theoretic transforms) and big elliptic-curve multi-scalar multiplications (MSMs), with a transcript hash to make the proof non-interactive. In STARK systems, the weight shifts to generating a large execution trace, extending it to low degree via FFT, committing with Merkle trees, and running hash-heavy FRI checks (Fast Reed–Solomon Interactive Oracle Proofs of Proximity). All of this parallelizes well, which is why the prover is the part you accelerate on GPUs—exactly the workload that benefits from running on SaladCloud. Verification is deliberately lightweight. A verifier takes the proof, the public inputs, and the verification key and performs a small amount of work—typically a few pairings or an inner-product check in SNARKs, or a handful of hash lookups and FRI checks in STARKs. On a server this is quick; on-chain it’s still modest in compute terms, though gas costs matter. Throughout, the verifier learns nothing about the witness itself—only that the statement is true. Diagram from https://docs.midnight.network/ In simple terms: setup is the first step (when your scheme requires it) and establishes the keys; proving is the heavy lift that turns secrets and inputs into a succinct proof; verification is the fast final check that makes the result easy to trust. Performance of Zero-Knowledge Proofs on GPUs Zero-knowledge proving is dominated by a handful of algebraic kernels that are both highly parallel and memory-hungry. In SNARK stacks those are large NTT/FFT stages and MSMs on elliptic curves; in STARK stacks, you also see big hash-driven stages (Merkle/FRI) around an execution trace. When MSM is well-optimized on the device, studies find NTT often becomes the main bottleneck—accounting for a large share of wall-clock time—and end-to-end GPU implementations can deliver order-of-magnitude speedups versus CPU baselines. That mix of wide parallelism and streaming access patterns is exactly what modern GPUs are built for. arXiv You don’t need datacenter-only hardware to benefit. Several open zkVM (zero-knowledge virtual machine) and prover stacks officially support single, consumer-grade GPUs: RISC Zero exposes a one-flag CUDA path for its prover, zkSync’s Airbender advertises single-GPU block proving from RTX 4090 up to H100, and SP1 documents GPU proving with a practical floor around 24 GB of VRAM (video RAM) and Compute Capability ≥ 8.6. Many provers are already open source and optimized for running on GPUs. Airbender has public releases and CUDA components; SP1 provides GPU requirements and container-runtime notes; RISC Zero’s prover can be toggled to CUDA; and acceleration libraries like ICICLE make MSM/NTT/hash kernels pluggable in your own images. Some of the solutions are already containerized or have straightforward instructions to do so. That means you can run your own provers on compute of your

Hunyuan3D 2.1 Image-to-3D: Achieve $0.009/generation on SaladCloud and save 94% over FAL

Once upon a time, in 2019, it was considered very impressive to generate an approximate depth map from a single image, a technique known as monocular depth estimation. Today, AI models that can run on your laptop can generate fully textured 3D assets ready to import into game engines and modeling tools. We benchmarked one such model, Hunyuan3D 2.1 from Tencent, on SaladCloud using RTX 4090 GPUs with 4 vCPU and 38GB RAM. For inference we used ComfyUI and ComfyUI API, along with the Hunyuan3D 2.1 Custom Node. We found a median generation time of 139.2 seconds across more than 900 generations, coming out to $0.0148 / generation on High priority, and an impressive $0.009 / generation on Batch priority. This is more than 90% less cost than FAL’s Hunyuan 3D 2.0 endpoint, even on high priority. If you read our other benchmarks, this will come as no surprised to you. ComfyUI + SaladCloud is an easy, cost-effective way to serve diffusion models at scale, including Image-to-3D models like this one. Example Outputs The results aren’t Pixar quality or anything, but overall very impressive for something that took 2 minutes and little-to-no skill. Our input images were AI generated as well, so the full pipeline is Text-to-Image-to-3D. Potion Bottle Rabbit Astronaut Dog Spaceship

Dirt Cheap Image Captioning With Qwen Vision-Language Models: Up to 98.4% Cheaper than OpenAI

SaladBlog-Covers-Qwen-Image-captioning-vision-language-model.png

Image captioning with Qwen Vision Language Model Image captioning and labeling plays an important role in many AI and ML training workloads, and until fairly recently, has been limited in effectiveness both by available technology and cost. Enter open-source vision-language models like Alibaba’s Apache 2.0-licensed Qwen 2.5, available in 3B and 7B sizes. Vision-Language models provide substantial improvements over previous-generation solutions based on CLIP and BLIP. The ability to include a text prompt along with your image gives you a great deal of control as to the style and content of the returned captions. Additionally, self-hosting language models gets easier and more performant all the time thanks to projects like Huggingface’s Text Generation Inference (TGI). Comparing Qwen 2.5 VL against OpenAI’s GPT 40 mini We deployed both the 3B and 7B versions of Qwen 2.5 VL using TGI on SaladCloud, and benchmarked their cost-performance using the COCO2017 image dataset. We compared this to using OpenAI’s GPT 4o mini, their least expensive vision-language model. We also used the “low” detail setting with 4o mini, to try to achieve the least expensive possible use of OpenAI’s services. We found that OpenAI measured roughly twice as many input tokens per image as our TGI/Qwen setup, and cost quite a lot more per token as well. In our best cost-performance of the benchmark, Qwen 2.5 VL 3B achieved 126,650 images captioned per dollar on an RTX 4080, a savings of 98.4% over running the same workload on OpenAI. The rest of the numbers are detailed below. High quality results from all models All tested models provided very good captions, even correctly identifying text. You can see how the different prompts greatly effect the generated captions. Prompt: “What is in this image? Include details.” Qwen2.5 VL 3B The image depicts a street scene in a town with signs for a hotel, shops, and possibly a restaurant. Notable objects in the image include: – A “Peace Way Hotel” sign prominently displayed on the left side of the street. This sign is likely nearby the hotel.– Multiple closed storefronts indicating a quiet day and/or area where businesses are closed.– A clear road with cars, potentially indicating moderate traffic volume.– Buildings with various signs and storefronts, suggesting a commercial area.– Smaller signs and billboards, some of which are partially obscured.– Raised traffic lights or signals, signaling that the road is a busy one.– A large hill or mountain visible in the background, adding to the scenic backdrop.– Some electrical or utility poles and wires extending overhead. Given the setting with sporadic closed shops and hotels, it might suggest a time frame like early morning or low season. Qwen2.5 VL 7B The image depicts a quiet street scene in a town, likely in a Middle Eastern region given the signage and architecture. Key details include: – Hotel Signage: Prominently featured is a sign for the “PEACE WAY HOTEL,” indicating hospitality services in the area.– Restaurant/Cafe: There is another sign labeled “NO NAME RESTAURANT & CAFE,” suggesting local dining options.– Buildings: The landscape includes low-rise buildings with various signs in different languages (likely Arabic and English), displaying shops and businesses.– Traffic: A few vehicles are visible, including a car on the road, suggesting minimal traffic activity.– Surroundings: The background displays a mountainous area, suggesting a hilly or dry environment.– Electrical Wires: The presence of utility poles and wires, indicative of urban infrastructure, adds to the character of the streetscape. Overall, the setting appears to be a small, possibly rural town, characterized by local commerce and quiet surroundings. GPT 4o mini The image shows a street scene that appears to be in a small town or city. There are several buildings lining the road, with a variety of signs in different languages, suggesting a diverse environment. Notably, there is a large sign that reads “PEACE WAY HOTEL,” which stands out prominently. The street is largely empty, indicating a calm atmosphere, and a few vehicles are parked or driving in the area. In the background, hills or mountains rise up, suggesting a rugged landscape surrounding the settlement. The overall mood of the scene is quiet and somewhat subdued, possibly indicating a less busy time of day or a less populated area. Prompt: “Describe the image in detail using comma-separated descriptor tags, with more prominent features toward the front of the list“ Qwen2.5 VL 3B hut, building, window, wood, blue, red, streetlamp, car, truck, sign, mountain, white, beige, asphalt, metal, pole, railing, water, fence, rock, Notify, post office, neuron, electricity, phone, Hotel, Peace, Way Hotel, peaces, signboard, route, mountain, facade, lights, road, run, windows, road, power lines, accommodation, ambulance, driveway, fire escape, handrail, post, craft, sign, go, power, electricity, wine, stop sign Qwen2.5 VL 7B desert-like building facades, narrow street, two signboards, Bavarian architecture style, arid landscape, ridges, single yellow car, white truck on right, administrative architecture, urban setting, signage, blue doors, underground barriers, Hill in background GPT 4o mini street scene, urban landscape, mountainous background, deserted road, hotel signage, small shops, muted colors, peaceful atmosphere, buildings with balconies, clear sky, parked vehicles, roadside signage, Mediterranean architecture, signage in Arabic and English, early morning light, light traffic, hilly terrain, quiet town atmosphere Prompt: “What is in this image? Include details.” Qwen2.5 VL 3B The image depicts a landscape scene with a vast expanse of snow-covered ground in the foreground. The middle ground includes trees and rock formations, suggesting a mountainous terrain. The background features rolling hills and more visible mountain ranges in the distance, under a partly cloudy sky. There is a signpost with blue and white labeling in the right foreground, but the details on the sign are not clearly visible. Qwen2.5 VL 7B The image shows a snowy landscape, likely taken on a mountaintop or hillside. The ground is covered with patches of snow, with some bare soil and vegetation visible where the snow has melted or been pushed away. In the distance, a hazy horizon stretches toward what appears to be valleys and mountains. There is dense evergreen forest

How Undetectable.ai saves over $50,000 a month with SaladCloud’s consumer GPUs

Undetectable.ai saves over $50,000 a month with LLM GPUs on SaladCloud

A surge of users for Undetectable.ai Undetectable.ai is on a mission to solve one of the most pressing challenges in the era of large language models (LLMs): accurately detecting AI-generated content and seamlessly “humanizing” it so that it remains indistinguishable from natural human writing. In just a matter of months, Undetectable.ai skyrocketed from an idea around accurate AI detection to 14 million+ signups, serving both individuals and enterprises worldwide. Students worried about getting flagged for AI content turned to the Humanizer for rewriting their legitimate essays. Marketing agencies, churning out blogs and ad copy, used it to keep that “authentic human tone.” Even entire content-writing businesses popped up around the Undetectable.ai API. What began with a single product vision — an AI detector — quickly evolved into a two-pronged solution: Both tools are powered by a highly custom, manually-created dataset with 10s of 1000s of samples in multiple languages, delivering premium accuracy and humanization.  Scaling Compute – But at what cost?  As undetectable.ai grew in popularity, Ben Miller, COO of Undetectable.ai, was wrestling with a problem that could make or break the startup: These queries weren’t trivial text in/out requests; they involved inference on specialized models requiring significant VRAM.  To achieve this at scale, Undetectable.ai needed fast, cost-efficient, and highly adaptable GPU infrastructure – enter SaladCloud.  The problem with hyperscalers and high-end GPUs Ben and team looked into the usual suspects: big cloud providers offering A100 or H100 GPUs with impressive performance – but equally jaw-dropping price tags. If they stayed on that path, Undetectable.ai would pay tens of thousands of dollars a month, maybe hundreds of thousands as traffic kept soaring. As a lean startup, they couldn’t sink all their resources into GPU fees. “Some of the A100 providers wanted a committed contract. Our business being cyclical, this was not ideal”, adds Ben.  Meanwhile, the need for a flexible infrastructure kept growing. The usage spiked dramatically before midterm exams at universities and peaked again when marketing campaigns ramped up at end-of-quarter cycles. One day they’d need 20 GPUs, the next day maybe 300. “As we scaled, we had to find the most cost effective, scalable way for us to get good GPUs. That’s how I found SaladCloud” – Ben Miller Testing deployment on SaladCloud’s consumer GPUs “We started with a test of our custom model on an A100 and a consumer GPU. The A100 on another cloud could run the queries 3x faster than a consumer card on SaladCloud, but the price was 10 to 40 times higher. And so the math in favor of SaladCloud was very attractive.” – Ben Miller, COO, Undetectable.ai Instead of paying for pricey, high-end datacenter GPUs, Undetectable tapped into thousands of consumer GPUs on SaladCloud. There were immediate benefits. Ben adds, “I was attracted to this idea of a massive cloud with thousands of GPUs while we were struggling to get a single A100 on. It takes a week to get response from support on the hyperscalers while SaladCloud’s team was incredibly responsive”.  Saving $50k-$80k a month on SaladCloud As the numbers made sense, Undetectable.ai switched to SaladCloud. Almost overnight, they spun up the capacity to handle hundreds of thousands of queries per day – with compute nodes peppered across the entire planet. “We’re saving roughly $50,000–$80,000 a month by using SaladCloud instead of an enterprise GPU cloud or a high-end GPU. And that’s before we even refine our autoscaling to handle weekend vs. weekday usage more precisely.” – Ben Miller, COO, Undetectable.ai Ensuring AI content stays human At its core, Undetectable.ai’s story is about pushing the boundaries of LLM usage. This year alone, the team is introducing 20-30 innovative products, including the AI Essay Writer, which helps students refine their essays, and the AI Job Application Bot, designed to automate the job search process, helping professionals save time and increase their chances of securing their next position. Alongside these advancements, Undetectable.ai is also onboarding a growing number of enterprise customers, solidifying its role as a leader in humanizing AI-generated content. With AI ensuring humans can create massive amounts of content in minutes, tools like Undetectable are ensuring the digital world doesn’t become polluted by robotic, emotion-less content.  Undetectable.ai’s partnership with SaladCloud showcases the power of leveraging consumer-grade GPU clouds for demanding AI workloads. By placing cost efficiency and scalability at the forefront, Undetectable.ai now processes hundreds of thousands of queries daily, meeting the needs of marketers, students, educators, and content creators—all without compromising model accuracy or user experience. For AI companies deploying large language models (LLMs) at scale, Undetectable.ai’s story is a testament to thinking beyond the conventional (and often prohibitively expensive) approach of enterprise GPUs. With SaladCloud, they’ve unlocked a massive, globally distributed network of GPUs—capable of powering advanced AI solutions at a fraction of the cost. SaladCloudSaladCloud is the world’s largest distributed cloud computing network with 11,000+ daily GPUs and 450,000 GPUs contributing compute, all at the lowest cost in the market.

Flux.1-Dev benchmark: 992 images per dollar on SaladCloud

Flux1. dev gpu benchmark - SaladCloud

Flux.1-Dev: An Introduction Flux.1 is a new series of text-to-image models from Black Forest Labs that has set the new standard in quality and prompt adherence, and it can even render legible text. The Flux.1-Dev version of the model – a 12 billion parameter rectified flow transformer – generates high quality images in about 20 steps, and is released under a non-commercial license. In this benchmark, we measure the speed and cost performance of this new model on SaladCloud. * An earlier benchmark of Flux.1-Schnell delivered 5243 images per dollar on Saladcloud. Benchmark Design The benchmark was conducted using k6, a modern load testing tool from Grafana Labs, to simulate a gradually increasing load from 7 to 12 virtual users over approximately 1.9 hours. See the exact configuration in GitHub. The test environment consisted of a container group on SaladCloud with 8-10 replicas (most commonly running 9 replicas). Each virtual user submitted continuous consecutive image generation requests to the container group, and response time and failures were measured. Image generation requests consisted of 20 steps at a resolution of 1024×1024. Each node was configured with Deploying Flux.1-Dev on SaladCloud To reproduce this benchmark, deploy the “Flux.1-Dev (FP8) – ComfyUI API” recipe from the “Create container group” page in the SaladCloud portal. Set the priority to “Batch” to optimize for cost-effectiveness. Benchmark results Conclusion Deploying Flux.1-Dev on RTX 4090 (24 GB) GPUs on SaladCloud (batch priority) delivers 992 images per dollar. As with other text-to-image models on SaladCloud, deploying Flux.1-Dev in production results in more inferences per dollar and significant cost savings. Cost per image is just $0.00101. The Flux1-Dev model demonstrates impressive stability and efficiency in this benchmark. With a 99.78% success rate and consistent response times averaging under 18 seconds, the model can readily be used for production deployments. The system showed moderate scalability, maintaining performance as virtual users increased from 7 to 12, with peak throughput achieved at 10 VUs. Under sudden spikes of traffic, an increased timeout and error rate should be expected. Interested in free credits to try SaladCloud for Image Generation? Contact our support team today.

Stable diffusion 1.5 benchmark: 14,000+ images per dollar on SaladCloud

Stable diffusion 1.5 benchmark on SaladCloud

Stable diffusion 1.5 benchmark on consumer GPUs Since our last stable diffusion benchmark nearly a year ago, a lot has changed. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. In this Stable Diffusion 1.5 benchmark, we evaluate the performance of SD 1.5 on 3 consumer GPUs on SaladCloud: RTX 4090, RTX 3090 & RTX 3060 Ti. Stable diffusion 1.5 benchmark design We deployed the “Dreamshaper 8 – ComfyUI” recipe on SaladCloud, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 25 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3060 Ti (8gb vram), RTX 3090 (24gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the Dreamshaper 8 recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results from the stable diffusion 1.5 benchmark While the RTX 4090 unsurprisingly had the best raw performance, the RTX 3090 came in very close, at better cost-performance. The default configuration with the RTX 3060 Ti showed remarkably good response times, and the best cost-performance. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. Interested in deploying on SaladCloud? Contact our support team today. RTX 4090 (24gb vram) RTX 3090 (24gb vram) RTX 3060 Ti (8gb vram) [default] Interested in free credits to try SaladCloud for Image Generation? Contact our support team today.

Stable Diffusion XL (SDXL) benchmark: 3405 images per dollar on SaladCloud

Stable diffusion XL (SDXL) GPU benchmark on SaladCloud

Stable Diffusion XL (SDXL) benchmark on 3 RTX GPUs Since our last SDXL benchmark nearly a year ago, a lot has changed. Community adoption of SDXL has increased significantly, and along with that comes better tooling, performance increases, and better understanding of how to get good results from the model. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has also introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. With ComfyUI and a new GPU pricing, we benchmark SDXL on 3 different consumer GPUs – RTX 4090, RTX 4080 and RTX 3090. SDXL Benchmark Design We deployed the “SDXL with Refiner – ComfyUI” recipe on Salad, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 18 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3090 (24gb vram), RTX 4080 (16gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the SDXL with Refiner recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results from the SDXL benchmark The RTX 4090 achieved the best performance, both in terms of inference time and cost-per-image, returning images in as little as 6.2s / Image, and at a cost as low as 3405 images / $. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. Interested in deploying on SaladCloud? Contact our support team today. RTX 4090 (24gb vram) RTX 4080 (16gb vram) RTX 3090 (24gb vram) Interested in free credits to try SaladCloud for Image Generation? Contact our support team today.

Molecular Simulation: GROMACS Benchmark on 30 GPUs on SaladCloud, 90+% Cost Savings

Molecular Simulation GROMACS Benchmark on SaladCloud

Note: Prices have fallen considerably since this benchmark was conducted, so actual costs will be even lower! Benchmarking GROMACS for Molecular Simulation on Consumer GPUs In this deep dive, we will benchmark GROMACS on SaladCloud, analyzing simulation speed and cost-effectiveness across a spectrum of small, medium, and large molecular systems. Additionally, we will provide recommendations for selecting the most appropriate resource types for various workloads on SaladCloud. Building on the OpenMM benchmark on SaladCloud and our continuous efforts to optimize system architecture and batch job implementation, we have achieved a 90% cost savings by using consumer GPUs for molecular simulations with GROMACS, compared to CPUs and data center GPUs.This capability enables effective static and dynamic load balancing across the system’s various components. GROMACS is a highly optimized, open-source software package for molecular dynamics simulations. Researchers in fields like biochemistry, biophysics, and materials science widely use it to study the physical movements of atoms and molecules over time. GROMACS stands out for its exceptional performance compared to other programs, efficiently leveraging both CPU and GPU resources. This capability enables effective static and dynamic load balancing across the system’s various components. Are you running more than $250K/yr in MDS compute? Migrate to the lowest cost GPU cloud with free, white-glove engineering support. GROMACS benchmark methodology The gmx mdrun is the main computational chemistry engine within GROMACS. The following command is to perform molecular dynamics simulations in the target environment: The mdrun program reads the input TPR file (-s), which contains the initial molecular topology and parameters, and produces several output files (-deffnm) with different extension names for logs, trajectories, structures and energies. GROMACS relies on close collaboration between the CPU and GPU to achieve optimal performance. Although many calculations can be offloaded to the GPU using the options (-nb, -pme, -bonded, -update), the program still demands considerable CPU processing power and multiple threads for task management, communication, and I/O operations. To fully utilize a powerful GPU, GROMACS also depends on robust CPU performance. While running more OpenMP threads than the number of physical cores could be beneficial in certain situations for GROMACS, but for our benchmark test, we only selected Salad nodes with CPUs that have 8 or more cores and configured each node to run 8 OpenMP threads (-ntmpi, -ntomp). We used GROMACS 2024.1 with CUDA 11.8 to build the container image. When running on SaladCloud, it first runs the simulations against typical molecular systems, reports the test data to an AWS DynamoDB table, and then exits. Finally, the data is downloaded and analyzed using Pandas on JupyterLab. Two key performance indicators are collected and analyzed during the test: ns/day stands for nanoseconds per day. It measures simulation speed, indicating how many nanoseconds of simulated time can be computed in one day of real time. ns/dollar stands for nanoseconds per dollar. It measures cost-effectiveness, showing how many nanoseconds of simulated time can be computed for one dollar. Below are the two scenarios and the methods used to collect data and calculate the final results: Scenario Resource Simulation Speed (ns/day) Cost Effectiveness (ns/dollar) ConsumerGPUs 8 cores for 8 OpenMP threads 30 GPU types Create a container group with 100 instances with all GPU types on SaladCloud, and run it for a few hours. Once the code execution is finished on an instance, SaladCloud will allocate a new node and continuously run the instance.   Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type. Calculate the average performance for each GPU type. Pricing from the SaladCloud Price Calculator: $0.072/hour for 16 vCPUs, 8GB RAM$0.015 ~ $0.18/hour for different GPU types (Priority: Batch ) https://salad.com/pricing  Data CenterGPUs 16 Cores for 16 OpenMP threads A40 48GBA100 40GBH100 80GB Use the test data in the GROMACS benchmarks by NHR@FAU. The lowest prices are selected from the data center GPU market, that closely match the resource requirements: $1.86/hour for A40 (24 vCPUs)$1.29/hour for A100 (30 vCPUs)$2.99/hour for H100 (30 vCPUs) https://getdeploying.com/reference/cloud-gpu It is worth mentioning that performance can be influenced by many factors, such as operating systems (Windows, Linux, or WSL) and their versions, CPU models, GPU models, and driver versions, CUDA framework versions, GROMACS versions, and additional features enabled in the runtime environment. It is very common to see different results between our benchmarks and those of others. Benchmark Results Here are six typical biochemical systems used to benchmark GROMACS: No Model Description Size 1 R-143a in hexane (20,248 atoms) with very high output rate Small 2 A short RNA piece with explicit water (31,889 atoms) Small 3 A protein inside a membrane surrounded by explicit water (80,289 atoms) Medium 4 A protein in explicit water (170,320 atoms) Medium 5 A protein membrane channel with explicit water (615,924 atoms) Large 6 A huge virus protein (1,066,628 atoms) Large Model 1: R-143a in hexane (20,248 atoms) with very high output rate Model 2: A short RNA piece with explicit water (31,889 atoms) Model 3: A protein inside a membrane surrounded by explicit water (80,289 atoms) Model 4: A protein in explicit water (170,320 atoms) Model 5: A protein membrane channel with explicit water (615,924 atoms) Model 6: A huge virus protein (1,066,628 atoms) Observations from the GROMACS benchmark Here are some interesting observations from the GROMACS benchmarks: The VRAM usage for all simulations is only 1-2 GB, which means nearly all GPU types can theoretically be utilized to run these models. GROMACS primarily utilizes the CUDA Cores of GPUs (not Tensor Cores), and typically operates in single-precision (FP32). High-end GPUs generally outperform low-end models in simulation speed due to their greater number of CUDA cores and higher memory bandwidth. However, the flagship model of a GPU generation often surpasses the low-end models of the following generation. For smaller models, GPUs are often underutilized, and communication between the CPU and GPU can become a bottleneck, making CPU performance a critical factor in overall system performance. On nodes with GPUs of similar performance, higher CPU clock speeds and more physical cores usually lead to better performance. Data center GPUs are

Molecular Simulation: OpenMM Benchmark on 25 Consumer GPUs, 95% Less Cost

OpenMM-benchmark-on-GPUs-Salad-Blog-cover

Note: Prices have fallen considerably since this benchmark was conducted, so actual costs will be even lower! Benchmarking OpenMM for Molecular Simulation on consumer GPUs OpenMM is one of the most popular toolkits for molecular dynamics simulations, renowned for its high performance, flexibility, and extensibility. It enables users to easily incorporate new features, such as novel forces, integration algorithms, and simulation protocols, which can run efficiently on both CPUs and GPUs. This analysis uses typical biochemical systems to benchmark OpenMM on SaladCloud’s network of AI-enabled consumer GPUs. We will analyze simulation speed and cost-effectiveness in each case and discuss how to build high-performance and reliable molecular simulation workloads on SaladCloud. This approach supports unlimited throughput and offers over 95% cost savings compared to solutions based on data center GPUs. Are you running more than $250K/yr in MDS compute? Migrate to the lowest cost GPU cloud with free, white-glove engineering support. Why run Molecular Simulations on GPUs? GPUs have a high degree of parallelism, which means they can perform many calculations simultaneously. This is particularly useful for molecular simulations, which involve many repetitive calculations, such as evaluating forces between atoms. Using GPUs can significantly accelerate molecular simulations, offering nearly real-time feedback and allowing researchers to run more simulations in less time. This enhanced efficiency accelerates the pace of discovery and lowers computational costs. OpenMM benchmark methodology The OpenMM team has provided benchmarking code in Python, along with benchmarks of simulation speed for typical biochemical systems based on OpenMM 8.0. To conduct the benchmarking test, you can run the following scripts on the target environment: Following the OpenMM benchmarks, we used OpenMM 8.0 with CUDA 11.8 to build the container image. When running on SaladCloud, it first executes the benchmarking code, reports the test data to an AWS DynamoDB table, and then exits. Finally, the data is downloaded and analyzed using Pandas on JupyterLab. We primarily focused on two key performance indicators across three scenarios: ns/day stands for nanoseconds per day. It measures simulation speed, indicating how many nanoseconds of simulated time can be computed in one day of real-time.  ns/dollar stands for nanoseconds per dollar. It measures cost-effectiveness, showing how many nanoseconds of simulated time can be computed for one dollar. Molecular simulations often operate on the timescale of nanoseconds to microseconds, as molecular motions and interactions occur very rapidly. Below are the three scenarios and the methods used to collect data and calculate the final results: Scenario Resource Simulation Speed (ns/day) Cost Effectiveness (ns/dollar) CPUs 16 vCPUs8GB RAM Create a container group with 100 instances with all GPU types on SaladCloud and run it for a few hours.  Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type. Calculate the average performance for each GPU type. Pricing from the SaladCloud Price Calculator: $0.040/hour for   8 vCPUs, 8GB RAM$0.072/hour for 16 vCPUs, 8GB RAM $0.02 ~ $0.30/hour for different GPU types https://salad.com/pricing Consumer GPUs 8 vCPUs 8GB RAM 20+ GPU types Create a container group with 100 instances with all GPU types on SaladCloud and run it forofew hours.  Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type. Calculate the average performance for each GPU type. Pricing from the SaladCloud Price Calculator: $0.040/hour for   8 vCPUs, 8GB RAM$0.072/hour for 16 vCPUs, 8GB RAM $0.02 ~ $0.30/hour for different GPU types https://salad.com/pricing Datacenter GPUs A100H100 Use the test data in the OpenMM benchmarks. Pricing from the AWS EC2 Capacity Blocks: $1.844/hour for 1 A100$4.916/hour for 1 H100 https://aws.amazon.com/ec2/capacityblocks/pricing/ It is worth mentioning that performance can be influenced by many factors, such as operating systems (Windows, Linux, or WSL) and their versions, CPU models, GPU models, and driver versions, CUDA framework versions, OpenMM versions, and additional features enabled in the runtime environment. It is very common to see different results between our benchmarks and those of others. Benchmark Results Here are five typical biochemical systems used to benchmark OpenMM 8.0, along with the corresponding test scripts: Model Description Test script 1 Dihydrofolate Reductase (DHFR), Explicit-PME This is a 159 residue protein with 2489 atoms. The version used for explicit solvent simulations included 7023 TIP3P water molecules, giving a total of 23,558 atoms. All simulations used the AMBER99SB force field and a Langevin integrator. python benchmark.py –platform=CUDA or CPU –seconds=60 –test=pme 2 Apolipoprotein A1 (ApoA1), PME This consists of 392 protein residues, 160 POPC lipids, and 21,458 water molecules, for a total of 92,224 atoms. All simulations used the AMBER14 force field. python benchmark.py –platform=CUDA or CPU –seconds=60 –test=apoa1pme 3 Cellulose, PME It consists of a set of cellulose molecules (91,044 atoms) solvated with 105,855 water molecules, for a total of 408,609 atoms. python benchmark.py –platform=CUDA or CPU –seconds=60 –test=amber20-cellulose 4 Satellite Tobacco Mosaic Virus (STMV), PME It consists of 8820 protein residues, 949 RNA bases, 300,053 water molecules, and 649 sodium ions, for a total of 1,067,095 atoms. python benchmark.py –platform=CUDA or CPU–seconds=60–test=amber20-stmv 5 AMOEBA DHFR, PME Full mutual polarization was used, with induced dipoles iterated until they converged to a tolerance of 1e-5. python benchmark.py –platform=CUDA or CPU –seconds=60  –test=amoebapme Model 1: Dihydrofolate Reductase (DHFR), Explicit-PME Model 2: Apolipoprotein A1 (ApoA1), PME Model 3: Cellulose, PME Model 4: Satellite Tobacco Mosaic Virus (STMV), PME Model 5: AMOEBA DHFR, PME Observations from the OpenMM GPU benchmarks: Here are some interesting observations from the OpenMM GPU benchmarks: The VRAM usage for all simulations is only 1-2 GB, which means nearly all platforms (CPU-only or GPU) and all GPU types can theoretically be utilized to run these models. For all models, the simulation speed of GPUs is significantly higher than that of CPUs, ranging from nearly hundreds of times in Model 1 to more than tens of thousands of times in Model 5. In general, high-end GPUs outperform low-end GPUs in terms of simulation speed. However, the flagship model of a given GPU family often surpasses the low-end models of the next family. As models become more complex with additional molecules and atoms, the performance differences between low-end

AI Transcription Benchmark: 1 Million Hours of Youtube Videos with Parakeet TDT 1.1B for Just $1260, a 1000-fold cost reduction 

AI transcription - Parakeet TRT 1.1B batch transription compared against APIs

Building upon the inference benchmark of Parakeet TDT 1.1B on SaladCloud and with our ongoing efforts to enhance the system architecture and implementation for batch jobs, we have achieved a 1000-fold cost reduction for AI transcription with SaladCloud. This incredible cost performance comes while maintaining the same level of accuracy as other managed transcription services.  YouTube is the world’s most widely used video-sharing platform, featuring a wealth of public content, including talks, news, courses, and more. There might be instances where you need to quickly understand  updates of a global event or summarize a topic, but you may not be able to watch videos individually. In addition, the millions of YouTube videos are a gold-mine of training data for many AI applications. Many companies have a need to do large-scale, AI transcription in batch today but cost is a prohibiting factor. In this deep dive, we will utilize publicly available YouTube videos as datasets and the high-speed ASR  (Automatic Speech Recognition) model – Parakeet TDT 1.1B, and explore methods for constructing a batch-processing system for large-scale AI transcription of videos, using the substantial computational power of SaladCloud’s massive network of consumer GPUs across a global, high-speed distributed network. How to download YouTube videos for batch AI transcription The Python library, pytube, is a lightweight tool designed for handling YouTube videos that can simplify our tasks significantly. Firstly, pytube offers APIs for interacting with YouTube playlists, which are collections of videos usually organized around specific themes. Using the APIs, we can retrieve all the video URLs within a specific playlist.  Secondly, prior to downloading a video, we can access its metadata, including details such as the title, video resolution, frames per second (fps), video codec, audio bit rate (abr), audio codec, etc. If a video on YouTube supports an audio codec, we can enhance efficiency by exclusively downloading its audio. This approach not only reduces bandwidth requirements but also results in substantial time savings, given that the video size is typically ten times larger than its corresponding audio. Below is the code snippet for downloading from YouTube: The audio files downloaded from YouTube primarily utilize the MPEG-4 audio (Mp4a) file format, commonly employed for streaming large audio tracks. We can convert these audio files from MP4A to MP3, a format universally accepted by all ASR models.  Additionally, the duration of audio files sourced from YouTube exhibits considerable variation, ranging from a few minutes to tens of hours. To leverage massive and cost-effective GPU types, as well as to optimize GPU resource utilization, it is essential to segment all lengthy audio into fixed-length clips before inputting them into the model. The results can then be aggregated before returning the final transcription. Advanced system architecture for massive video transcription We can reuse our existing system architecture for audio transcription with a few enhancements:  In a long-term running batch-job system, implementing auto scaling becomes crucial. By continuously monitoring the job count in the message queue, we can dynamically adjust the number of Salad nodes or groups. This adaptive approach allows us to respond effectively to variations in system load, providing the flexibility to efficiently manage costs during lower demand periods or enhance throughput during peak loads. Enhanced node implementation for both video and audio AI transcription Modifications have been made to the node implementation, enabling it to handle both video and audio for AI transcription. The inference process remains unchanged, running on a single thread and dedicated to GPU-based transcription. We have introduced additional features in the benchmark worker process, specifically designed to handle I/O and CPU-bound tasks and running multiple threads: Running two processes to segregate GPU-bound tasks from I/O and CPU-bound tasks provides the flexibility to update each component independently. Introducing multiple threads in the benchmark worker process to handle different tasks eliminates waiting periods by fetching and preparing the next audio clips in advance while the current one is still being transcribed. Consequently, as soon as one audio clip is completed, the next is immediately ready for transcription. This approach not only reduces the overall processing time and increases system throughput but also results in more significant cost savings. Massive YouTube video transcription tests on SaladCloud We created a container group with 100 replicas (2vCPU and 12 GB RAM with 20+ different GPU types) in SaladCloud. The group was operational for approximately 10 hours, from 10:00 pm to 8:00 am PST during weekdays, successfully downloading and transcribing a total of 68,393 YouTube videos. The cumulative length of these videos amounted to 66,786 hours, with an average duration of 3,515 seconds. Hundreds of Salad nodes from worldwide networks actively engaged in the tasks. They are all positioned in high-speed networks near the edges of the YouTube Global CDN (with an average latency of 33ms). This setup guarantees local access and ensures optimal system throughput for downloading content from YouTube. According to the AWS DynamoDB metrics, specifically writes per second, which serve as a monitoring tool for transcription jobs, the system reached its maximum capacity, processing approximately 2 videos (totaling 7500 seconds) per second, roughly one hour after the container group was launched. The selected YouTube videos for this test vary widely in length, ranging from a few minutes to over 10 hours, causing notable fluctuations in the processing curve. Let’s compare the results of the two benchmark tests conducted on Parakeet TDT 1.1B for audio and video: Parakeet Audio Parakeet Video Datasets English CommonVoice and Spoken Wikipedia Corpus English YouTube videos include public talks, news and courses. Average Input Length (s) 12 3515 Cost on SaladCloud (GPU Resource Pool and Global Distribution Network) Around $100100 Replicas (2vCPU,12GB RAM,20+ GPU types) for 10 hours Around $100100 Replicas (2vCPU,12GB RAM,20+ GPU types) for 10 hours Cost on AWS and Cloudflare(Job Queue/Recording System and Cloud Storage ) Around $20 Around $2 Node Implementation 3 downloader threads;Segmentation of long audio; Merging texts. Download audio from YouTube playlists and videos;3 downloader threads;Segmentation of long audio;Format conversion from Mp4a to MP3;Merging texts. Number of Transcription 5,209,130 68,393 Total Input Length