Shawn Rushefsky, Author at SaladCloud Blog

Stable Diffusion 1.5: 14k Images per Dollar On SaladCloud

Leave a Comment / Benchmark, Image Generation, SaladCloud / Shawn Rushefsky

Since our last stable diffusion benchmark nearly a year ago, a lot has changed. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. Benchmark Design We deployed the “Dreamshaper 8 – ComfyUI” recipe on SaladCloud, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 25 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3060 Ti (8gb vram), RTX 3090 (24gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the Dreamshaper 8 recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results While the RTX 4090 unsurprisingly had the best raw performance, the RTX 3090 came in very close, at better cost-performance. The default configuration with the RTX 3060 Ti showed remarkably good response times, and the best cost-performance. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. RTX 4090 (24gb vram) RTX 3090 (24gb vram) RTX 3060 Ti (8gb vram) [default] Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Stable Diffusion 1.5: 14k Images per Dollar On SaladCloud Read More »

Stable Diffusion XL: 3405 Images per Dollar On SaladCloud

Benchmark, Image Generation, SaladCloud / Shawn Rushefsky

Since our last SDXL benchmark nearly a year ago, a lot has changed. Community adoption of SDXL has increased significantly, and along with that comes better tooling, performance increases, and better understanding of how to get good results from the model. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. Benchmark Design We deployed the “SDXL with Refiner – ComfyUI” recipe on Salad, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 18 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3090 (24gb vram), RTX 4080 (16gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the SDXL with Refiner recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results The RTX 4090 achieved the best performance, both in terms of inference time and cost-per-image, returning images in as little as 6.2s / Image, and at a cost as low as 3405 images / $. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. RTX 4090 (24gb vram) RTX 4080 (16gb vram) RTX 3090 (24gb vram) Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Stable Diffusion XL: 3405 Images per Dollar On SaladCloud Read More »

Flux.1 schnell benchmark for image generation

Flux.1 Schnell benchmark: 4265 images per dollar on SaladCloud

Benchmark, Featured, Image Generation, SaladCloud / Shawn Rushefsky

Introduction to Flux.1 – The new standard for image generation Flux.1 is a new series of models from Black Forest Labs that has set the new standard in quality and prompt adherence, and it can even render legible text. The Flux.1-Schnell version of the model generates high quality images in just 4 steps, and is released under the permissive and commercially usable Apache 2 license. In this benchmark, we measure speed and cost performance of this new model on SaladCloud. Benchmark design We deployed the “Flux.1-Schnell (FP8) – ComfyUI (API)” recipe on Salad, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 9/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 18 concurrent users. The test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We ran this on an RTX 4090 (24GB vram) with 4 vCPU and 30GB ram. What we measured: Deployment of Flux.1-Schnell model on SaladCloud Log in to your portal.salad.com account. Click through the FLUX.1-Schnell recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results from the Flux.1 benchmark Our cluster of 9 replicas showed very good overall performance, returning images in as little as 4.1s / Image, and at a cost as low as 4265 images / $. In this test, we can see that as load increases, average round-trip time increases for requests, but throughput also increases. We did not always have the maximum requested replicas running, which is expected. Salad only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. While we saw no failed requests during this benchmark, it is not uncommon to see a small number of failed requests that coincide with node reallocations. This is expected, and you should handle this case in your application via retries. RTX 4090 (24gb vram) Conclusion The Flux.1-Schnell model is a significant advancement in AI image generation, delivering high-quality results while maintaining impressive speed and cost efficiency. Our benchmark on SaladCloud demonstrated its capability to produce images fast, achieving an average response time of just 5.5 seconds and an outstanding cost efficiency of up to 4265 images per dollar. With RTX 4090 (24GB) GPUs available on SaladCloud from just $0.18/hour, AI image generation tools can save significantly on inference cost by running Flux on SaladCloud. These results not only highlight the model’s performance under varying loads but also underscore the potential for scalability and reliability in real-world applications. As developers and creatives seek robust tools for generating visual content, Flux.1-Schnell stands out as a compelling option. Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Flux.1 Schnell benchmark: 4265 images per dollar on SaladCloud Read More »

Blog_Stable_diffusion_fine_tuning_api_service

Cost-effective Stable Diffusion fine tuning on Salad

Leave a Comment / Image Generation, SaladCloud, Tutorials / Shawn Rushefsky

Stable Diffusion XL (SDXL) fine tuning as a service I recently wrote a blog about fine tuning Stable Diffusion XL (SDXL) on interruptible GPUs at low cost, starring my dog Timber. The strong results and exceptional cost performance got me wondering: What would it take to turn that into a fully managed Stable Diffusion training platform? I built one on SaladCloud to find out. The result? Costs as low as $0.00016179 per training step while successfully completing 1000/1000 training jobs. Challenges in developing a Stable Diffusion API There are a number of challenges involved when developing and deploying what’s essentially a Stable Diffusion XL training API on a distributed cloud like Salad. Salad is a distributed cloud with a Million+ individual PCs around the world connected to our network. The GPUs on Salad are latent Nvidia RTX/GTX series. Our goal is for the service to be resilient at any kind of scale. Architecture To handle node interruptions and concurrent training, we built a simple orchestration API, with training compute handled by GPU worker nodes. Additionally, we setup a simple autoscaler using a scheduled Cloudflare Worker. Except for the pool of training nodes, the entire platform uses Cloudflare serverless services. Heavily leveraging serverless technologies for the platform layer greatly reduces operational labor, makes the platform nearly free at rest, and will comfortably scale to handle significantly more load. Given sufficient continuous load, serverless applications do tend to be more expensive than alternatives, so feel free to swap out components as desired. This design doesn’t rely on any provider-specific features, so any SQL database and any Key-Value store would work just as well. API components GPU worker node components Distributing work To get work, worker nodes make a GET request to the API, including their machine id as a query parameter. The API prioritizes handing out that are in the running state, but stalled as measured by heartbeat timeout. It also will never hand a job out to a node where that job has previously failed. Marking a job failed Handling bad nodes If a particular node has failed too many jobs, we want to reallocate it. Our first implementation did not take this into account, and one bad node marked 85% of the benchmark failed, just pulling and failing one job after another. We now run a scheduled Cloudflare Worker every 5 minutes to handle reallocating any nodes with more than the allowed number of failures. Autoscaling the worker pool Our scheduled Cloudflare Worker also handles scaling the worker cluster. It essentially attempts to keep the number of replicas equal to the number of running and pending jobs, with configurable limits. Observing a training run The training script we used from diffusers has a built-in integration with Weights and Biases, a popular ML/AI training dashboard platform. It lets you qualitatively observe the training progress, tracks your training arguments, monitors system usage, and more. Deployment on Salad Deploying on Salad is simple. The worker pattern means we don’t need to enable inbound networking or configure any probes. The only environment configuration needed is a URL for the orchestration API, a key for the orchestration API, and an API key for Weights and Biases (optional). Seeding the benchmark To get a baseline idea of performance, we ran 1000 identical training jobs, each 1400 steps, with text encoder training. We skipped reporting samples to Weights and Balances for this benchmark. We let the auto-scaler run between 0 and 40 nodes each with 2 vCPU, 16GB RAM, and an RTX 4090 GPU. Visualizing a training run Here’s an example training job that got interrupted twice, and was able to resume and complete training on a different node each time. The smaller marks are heartbeat events emitted by the worker every 30s, color coded by machine id. We can see for this run that it sat in the queue for 5.4 hours before a worker picked it up, and ran for 54:00 of billable time, calculated as number of heartbeats * 30s. Plugging that into the Pricing Calculator, we see a cost of $0.324/hour, so a total cost of $0.2916 to train the model and the text encoder for 1400 steps. This comes out to $**0.000208/**step. The amount of time taken, and therefore the cost, varies greatly based on the parameters you use for training. Training the text encoder slows down training. Using prior preservation also slows down training. More steps takes longer. It’s interesting to note that although the run was interrupted multiple times, these interruptions cost less than 4 minutes of clock time, and the run still finished in the median amount of time. Results from the Stable Diffusion XL fine tuning Tips and Observations Future Improvements Conclusions Our exploration into fine-tuning Stable Diffusion XL on interruptible GPUs has demonstrated the feasibility and efficiency of our approach, despite the significant challenges posed by training interruptions, capacity limitations, and cost management. Leveraging Cloudflare’s serverless technologies alongside our custom orchestration and autoscaling solutions, we’ve created a resilient and manageable system capable of handling large-scale operations with notable cost efficiency and operational simplicity. The successes of our deployment, underscored by the seamless completion of 1000/1000 benchmark jobs, highlight the system’s robustness and the potential for further improvements. Future enhancements, such as asynchronous validation and refined node performance assessments, promise to elevate the performance and cost-effectiveness of our service. Given the extensive amount of experimentation required to get good results, a platform like this can be useful for individuals as well as those seeking to build commercial offerings. Once deployed, a person could submit many different combinations of parameters, prompts, and training data, and run many experiments in parallel. Resources Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its

Cost-effective Stable Diffusion fine tuning on Salad Read More »

Fine tuning stable diffusion XL (SDXL) for low cost with Interruptible GPUs and LORA

Fine tuning Stable Diffusion XL (SDXL) with interruptible GPUs and LoRA for low cost

Leave a Comment / Image Generation, SaladCloud, Tutorials / Shawn Rushefsky

It’s no secret that training image generation models like Stable Diffusion XL (SDXL) doesn’t come cheaply. The original Stable Diffusion model cost $600,000 USD to train using hundreds of enterprise-grade A100 GPUs for more than 100,000 combined hours. Fast forward to today, and techniques like Parameter-Efficient Fine Tuning (PEFT) and Low-Rank Adaptation (LoRA) allow us to fine tune state-of-the-art image generation models like Stable Diffusion XL in minutes on a single consumer GPU. Using spot instances or community clouds like Salad reduces the cost even further. In this tutorial, we fine tune SDXL on a custom image of Timber, my playful Siberian Husky. Benefits of spot instances to fine tune SDXL Spot instances allow cloud providers to sell unused capacity at lower prices, usually in an auction format where users bid on unused capacity. Salad’s community cloud comprises of tens of thousands of idle residential gaming PCs around the world. On AWS, using an Nvidia A10G GPU with the g5.xlarge instance type costs $1.006/hr for on-demand pricing, but as low as $0.5389/hr for “spot” pricing. On Salad, an RTX 4090 (24GB) GPU with 8 vCPUs and 16GB of RAM costs only $0.348/hr. In our tests, we were able to train a LoRA for Stable Diffusion XL in 13 minutes on an RTX 4090, at a cost of just $0.0754. These low costs open the door for increasingly customized and sophisticated AI image generation applications. Challenges of fine tuning Stable Diffusion XL with spot instances There is one major catch, though: both spot instances and community cloud instances have the potential to be interrupted without warning, potentially wasting expensive training time. Additionally, both are subject to supply constraints. If AWS is able to sell all of their GPU instances at on-demand pricing, there will be no spot instances available. Since Salad’s network is residential, and owned by individuals around the world, instances come on and offline throughout the day as people use their PCs. But with a few extra steps, we can take advantage of the huge cost savings of using interruptible GPUs for fine tuning. Solutions to mitigate the impact of interrupted nodes The #1 thing you can do to mitigate the impact of interrupted nodes is to periodically save checkpoints of the training progress to cloud storage, like Cloudflare R2 or AWS S3. This ensures that your training job can pick up where it left off in the event it gets terminated prematurely. This periodic checkpointing functionality is often offered out-of-the-box by frameworks such as 🤗 Accelerate, and simply needs to be enabled via launch arguments. For example, using the Dreambooth LoRA SDXL script with accelerate, as we did, you might end up with arguments like this: This indicates that we want to train for 500 steps, and save checkpoints every 50 steps. This ensures that at most, we lose 49 steps of progress if a node gets interrupted. On an RTX 4090, that amounts to about 73 seconds of lost work. You may want to checkpoint more or less frequently than this, depending on how often your nodes get interrupted, storage costs, and other factors. Once you’ve enabled checkpointing with these launch arguments, you need another process monitoring for the creation of these checkpoints, and automatically syncing them to your preferred cloud storage. We’ve provided an example python script that does this by launching accelerate in one thread, and using another thread to monitor the filesystem with watchdog, and push files to S3-compatible storage using boto3. In our case, we used R2 instead of S3, because R2 does not charge egress fees. Other considerations for SDXL fine tuning The biggest callout here is to automate clean up of your old checkpoints from storage. Our example script saves a checkpoint every 10% progress, each of which is 66MB compressed. Even though the final LoRA we end up with is only 23MB, the total storage used during the process is 683MB. It’s easy to see how storage costs could get out of hand if this was neglected for long enough. Our example script fires a webhook at each checkpoint, and another at completion. We set up a Cloudflare Worker to receive these webhooks and clean up resources as needed. Additionally, while the open source tools are powerful and relatively easy to use, they are still quite complex and the documentation is often very minimal. I relied on youtube videos and reading the code to figure out the various options for the SDXL LoRA training script. However, these open source projects are improving at an increasingly quick pace, as they see wider and wider adoption, so the documentation will likely improve. At the time of writing, the 🤗Diffusers library had merged 47 pull requests from 26 authors, just in the last 7 days. Conclusions Modern training techniques and interruptible hardware combine to offer extremely cost effective fine tuning of Stable Diffusion XL. Open source training frameworks make the process approachable, although documentation could be improved. You can train a model of yourself, your pets, or any other subject in just a few minutes, at a cost of pennies. Training costs have plummeted over the last year, thanks in large part to the rapidly expanding open source AI community. The range of hardware capable of running these training tasks has greatly expanded as well. Many recent consumer GPUs are capable of training an SDXL LoRA model in well under an hour, with the fastest taking just over 10 minutes. Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Fine tuning Stable Diffusion XL (SDXL) with interruptible GPUs and LoRA for low cost Read More »

Recognize anything model++ gpu benchmark

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs

Leave a Comment / SaladCloud, Vision AI / Shawn Rushefsky

What is the Recognize Anything Model++? The Recognize Anything Model++ (RAM++) is a state of the art image tagging foundational model released last year, with pre-trained model weights available on huggingface hub. It significantly outperforms other open models like CLIP and BLIP in both the scope of recognized categories and accuracy. But how much does it cost to run RAM++ on consumer GPUs? In this benchmark, we tag 144,485 images from the COCO 2017 and AVA image datasets, evaluating inference speed and cost-performance. The evaluation was done across 167 nodes on SaladCloud representing 19 different consumer GPU classes. To do this, we created a container group targeting a capacity of 100 nodes, with the “Stable Diffusion Compatible” GPU class. All nodes were assigned 2 vCPU and 8GB RAM. Here’s what we found. Up to 309k images tagged per dollar on RTX 2080 In keeping with a trend we often see here, the best cost-performance is coming from the lower end GPUs, RTX 20- and 30-series cards. In general, we find that the smallest/cheapest GPU that can do the job you need is likely to have the best cost-performance, in terms of inferences per dollar. RAM++ is a fairly small, lightweight model (3GB), and achieved its best performance on the RTX 2080, with just over 309k inferences per dollar. Average Inference Time Is <300ms Across All GPUs We see relatively quick inference times across all GPU types, but we also see a pretty wide distribution of performance, even within a single GPU type. Zooming in, we can see this wide distribution is also present within a single node. Further, we see no significant correlation between inference time and number of tags generated. GPU Correlation between inference time and number of tags RTX 2080 0.04255 RTX 2080 SUPER -0.02209 RTX 2080 Ti -0.03439 RTX 3060 0.00074 RTX 3060 Ti 0.00455 RTX 3070 0.00138 RTX 3070 Laptop GPU -0.00326 RTX 3070 Ti -0.01494 RTX 3080 -0.00041 RTX 3080 Laptop GPU -0.09197 RTX 3080 Ti 0.02748 RTX 3090 -0.00146 RTX 4060 0.03447 RTX 4060 Laptop GPU -0.08151 RTX 4060 Ti 0.04153 RTX 4070 0.01393 RTX 4070 Laptop GPU -0.05811 RTX 4070 Ti 0.00359 RTX 4080 0.02090 RTX 4090 -0.03002 Based on this, you should expect to see fairly wide variation in inference time in production regardless of your GPU selection or image properties. Results from the Recognize Anything Model++ (RAM++) benchmark Consumer GPUs offer a highly cost-effective solution for batch image tagging, coming in between 60x-300x the cost efficiency of managed services like Azure AI Computer Vision. The Recognize Anything paper and code repository offer guides to train and fine-tune this model on your own data, so even if you have unusual categories, you should consider RAM++ instead of commercially available managed services. Resources Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs Read More »

Segment anything model (SAM) benchmark on consumer GPUs on SaladCloud

Segment Anything Model (SAM) Benchmark: 50K Images/$ on Consumer GPUs

Leave a Comment / Benchmark, SaladCloud, Vision AI / Shawn Rushefsky

What is the Segment Anything Model (SAM)? The Segment Anything Model (SAM) is a foundational image segmentation model released by Meta AI Research last year, with pre-trained model weights available through the GitHub repository. It can be prompted with a point or a bounding box, and performs well on a variety of segmentation tasks. More importantly, it carries the permissive Apache 2.0 license, allowing commercial use. As companies deploy this model for use cases ranging from image labeling, background removal, inpainting and more, cost of running SAM in production is a primary concern. Benchmarking the Segment Anything Model (SAM) on Salad In this benchmark, we do an unprompted full-image segmentation on 152,848 images from the COCO 2017 and AVA image datasets. We evaluate inference speed and cost-performance across 302 nodes on SaladCloud representing 22 different consumer GPU classes. To do this, we created a container group targeting a capacity of 100 nodes, with the “Stable Diffusion Compatible” GPU class. All nodes were assigned 2 vCPU and 8GB RAM. Here’s what we found. 50K+ images segmented per dollar on RTX 3060 Ti & RTX 3070 Ti As is nearly always the case with smaller models, the best cost-performance is coming from the lower end GPUs, mostly the RTX 30-series cards. In this case, we see a significant bump in cost-performance on the Ti cards. This makes sense since they are priced the same as their non-Ti counterparts but have more CUDA cores. The stand-out performers here are the RTX 3060 Ti, and the RTX 3070 Ti, each offering at least 50k inferences per dollar. Inference time is fairly consistent within a particular node Zooming into performance within a single GPU class – the RTX 3070 Ti, we see that the bulk of inference times fall within a narrow range on any particular node, with some significant outliers. We do see some variability across different nodes, with one standing out as particularly bad. We often see a small amount of variability in performance across nodes on Salad, since each one is an individual residential gaming PC, with a variety of different CPUs, RAM speed, motherboard configurations, etc. Our one outlier node (31b6, circled above) is indicative of something anomalous with that machine. We’re always working to get better at detecting these scenarios before your workloads get to a bad machine. But the best practice is to monitor the performance of your application, and terminate nodes that display anomalous behavior. The range of inference time on one of our nodes (67acdb6b) may look concerning at first. But if we zoom in, we see those outlier times are exceedingly uncommon, with the vast majority of inferences clustered within a narrow range. And indeed, if we filter out the outliers, we see a much tighter grouping within each individual node. But we also start to see 2 distinct groupings of machines: It is a little concerning that some machines are 35-40% faster than others, so this gets sent to our engineering team for further investigation. The above cost-performance numbers include all these outliers and variability, so I suspect that it is possible to beat those numbers. Results from the Segment Anything Model (SAM) benchmark The RTX 3060 Ti and RTX 3070 Ti running the Segment Anything Model (SAM) offer a highly cost-effective solution for batch image segmentation, coming in at 50x the cost efficiency of managed services like Azure AI Computer Vision. Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Segment Anything Model (SAM) Benchmark: 50K Images/$ on Consumer GPUs Read More »

Stable-Diffusion v1.5 -benchmark-Salad-cover

Stable Diffusion v1.5 Benchmark On Consumer GPUs

Leave a Comment / Benchmark, Image Generation, SaladCloud / Shawn Rushefsky

Benchmarking Stable Diffusion v1.5 across 23 consumer GPUs What’s the best way to run inference at scale for stable diffusion? It depends on many factors. In this Stable Diffusion (SD) benchmark, we used SD v1.5 with a controlnet to generate over 460,000 fancy QR codes. The benchmark was run across 23 different consumer GPUs on SaladCloud. Here, we share some of the key learnings for serving Stable Diffusion inference at scale on consumer GPUs. The Evaluation For each GPU type, we compared 4 different backends, 3 batch sizes (1, 2, 4), 2 resolutions (512×512, 768×768), generating images at 15 steps and at 50 steps. Our time measurements include the time taken to generate the image and return it to another process running on localhost. But we do not include time taken to generate the base QR code, upload images or fetch new work from the queue. We recommend handling these tasks asynchronously in order to maximize GPU utilization. Our cost numbers are derived from the Salad Pricing Calculator, using 2 vCPU and 12 GB of RAM. Costs do not include storage, data transfer, queueing, database, etc. However, these things only cost $2 total for the entire project. We used DreamShaper 8 along with the QR Code Monster controlnet to generate the images, with the Euler Ancestral scheduler/sampler. Cold Start Time We also evaluated cold start time for the various backends, which measures the time from when a container starts to when it is ready to serve inference. However, it does not include the time required to download the image to the host. For each backend, we chose the average cold start time from the GPU in which it had the best cold start time. For the stable-fast backend, with the models included in the container, the RTX 4090 has the best average cold start time, while the GTX 1660 has the worst. The empty spot for GTX 1660 Super indicates that no nodes successfully started. Architecture We used our standard batch processing architecture that we’ve used for many other benchmarks. The Backends stable-fast-qr-code This is the only custom backend we used for this benchmark. It uses 🤗 Diffusers with stable-fast. You’ll see in the results that it performed extremely well, almost always taking the top spot for performance and cost-performance. However, there are important caveats to consider before choosing to deploy this or any other custom backend. stable-fast adds a compilation step on start, which can take multiple whole minutes longer than startup for the other backends. Additionally, it achieves the best performance by locking image size at start. For many image generation use cases, dynamic sizing is too important, so this would not be feasible. For other use cases, such as this one where we bulk generated fancy QR codes, it’s ideal. Other build-vs-buy factors should also be taken into consideration. Automatic1111 While designed and built as a user interface for running stable diffusion on your own PC, Automatic1111 is also a very popular inference backend for many commercial SD-powered applications. It boasts wide model and workflow compatibility, is very extensible, and shows strong performance in most categories. Comfy UI ComfyUI is another popular user interface for stable diffusion, but with a node-and-link based interface that mimics the underlying components of a workflow. It is the most customizable of the backends, and it has some caching features that are beneficial when not all parameters change between generations. Stable Diffusion v1.5 Benchmark: Results Stable Fast is the clear winner here, both in terms of speed and cost. However, while the performance is impressive, building and maintaining a custom backend comes with a lot of additional challenges vs using one of the highly flexible, community-maintained options. In particular, if you’ve already built your solution using one of these off-the-shelf options, you likely do not want to refactor your entire codebase around a new backend. We’ve included some results that exclude Stable Fast for those of you in this situation. Best Inference Time (15 steps) With an impressive 27.3 steps/second, Stable Fast achieved outstanding performance on the RTX 4090, generating batches of 4 512×512 images. Best Inference Time (50 Steps) With a 50-step generation, Stable Fast performed even better, achieving 37.6 steps per second on batches of 4 512×512 images. Best Cost Performance – 15 Steps This measures performance for a given combination of backend+gpu on all 15-step image generation tasks. This includes all batch sizes and image sizes. Best Cost Performance – 50 Steps This measures performance for a given combination of backend and gpu on all 50-step image generation tasks. This includes all batch sizes and image sizes. Best Inference Time in Each Task – 15 Steps This measures average inference time at each resolution and batch size, with 15 steps. Best Cost Performance in Each Task – 15 Steps While Stable Fast offered the best overall performance and the best overall cost performance, it was not the absolute best in all tasks for 15-step generations, sharing that honor with Automatic1111. It’s worth noting that A1111 achieved its impressive cost-performance results on much lower-end hardware, which maybe significantly easier to source. Best Inference Time in Each Task – 50 Steps This measures average inference time at each resolution and batch size, with 50 steps. Best Cost Performance in Each Task – 50 Steps Stable Fast absolutely dominated the 50-step generation tasks, taking a comfortable first place in all categories. Best Cost Performance in Each Task (no stable-fast) – 15 Steps Here we pull Stable Fast out of the results to compare the rest. Best Cost Performance in Each Task (no stable-fast) – 50 Steps A1111 – Best Inference Time by GPU A1111 – Best Cost Performance by GPU This measures the cost performance of Automatic1111 across all image generation tasks, for each GPU. SD.Next – Best Inference Time By GPU SD.Next – Best Cost Performance by GPU This measures the cost performance of SD.Next across all image generation tasks, for each GPU. ComfyUI – Best Inference Time By GPU

Stable Diffusion v1.5 Benchmark On Consumer GPUs Read More »

Comparing Price-Performance of 22 GPUs for AI Image Tagging (GTX vs RTX)

Leave a Comment / Benchmark, Featured, SaladCloud, Vision AI / Shawn Rushefsky

Older Consumer GPUs: A Perfect-Fit for AI Image Tagging In the current AI boom, there’s a palpable excitement around sophisticated image generation models like Stable Diffusion XL (SDXL) and the cutting-edge GPUs that power them. These models often require more powerful GPUs with larger amounts of vRAM. However, while the industry is abuzz with these advancements, we shouldn’t overlook the potential of older GPUs, especially for tasks like image tagging and search embedding generation. These processes, employed by image generation platforms like Civit.ai and Midjourney, play a crucial role in enhancing search capabilities and overall user experience. We leveraged Salad’s distributed GPU cloud to evaluate the cost-performance of this task across a wide range of hardware configurations. What is AI Image Tagging? AI image tagging is a technology that can automatically identify and label the content of images, such as objects, people, places, colors, and more. This helps users to organize, search, and discover their images more easily and efficiently. AI image tagging works: AI image tagging can be used for various purposes and applications, such as: Benchmarking 22 Consumer-Grade GPUs for AI Image Tagging In designing the benchmark, our primary objective was to ensure a comprehensive and unbiased evaluation. We selected a range of GPUs on SaladCloud, starting from the GTX 1050 and extending up to the RTX 4090, to capture a broad spectrum of performance capabilities. Each node in our setup was equipped with 16 vCPUs and 7 GB of RAM, ensuring a standardized environment for all tests. For the datasets, we chose two prominent collections from Kaggle: the AVA Aesthetic Visual Assessment and the COCO 2017 Dataset. These datasets offer a mix of aesthetic visuals and diverse object categories, providing a robust testbed for our image tagging and search embedding generation tasks. We used ConvNextV2 Tagger V2 to generate tags and ratings for images, and CLIP to generate embedding vectors. The tagger model used the ONNX runtime, while CLIP used Transformers with PyTorch. ONNX’s GPU capabilities are not a great fit for Salad, because of inconsistent Nvidia driver versions across the network, so we chose to go with the CPU runtime and to allocate 16 vCPUs for each node. PyTorch with Transformers works quite well across a large range of GPUs and driver versions with no additional configuration, so CLIP was run on GPU. Benchmark Results: GTX 1650 is the Surprising Winner As expected, our nodes with higher end GPUs took less time per image, with the flagship RTX 4090 offering the best performance. What is interesting, though, is that the median time per image is actually very similar for the GTX 1650 and the RTX 4090: 1 second. The best-case and worst-case performance of the 4090 is notably better. Weighting our findings by cost, we can confirm our intuition that the 1650 is a much better value at $0.02/hr than is the 4090 at $0.30/hr. While the older GPUs like the GTX 1650 have worse absolute performance compared to the RTX 4090, the great difference in price causes the older GPUs to be the best value, as long as your use case can withstand the additional latency. In fact, we see all GTX series GPUs outperforming all RTX GPUs in terms of images tagged per dollar. GTX Series: The Cost-Effective Option for AI Image Tagging with 3x More Images Tagged per Dollar In the ever-advancing realm of AI and GPU technology, the allure of the latest hardware often overshadows the nuanced decisions that drive optimal performance. Our analysis not only emphasizes the balance between raw performance and cost-effectiveness but also resonates with broader cloud best practices. Just as it’s pivotal not to oversubscribe to compute resources in cloud environments, it’s equally essential to avoid overcommitting to high-end GPUs when more cost-effective options can meet the requirements. The GTX 1650’s value proposition, especially for tasks with flexible latency needs, serves as a testament to this principle, delivering 3x as many images tagged per dollar as the RTX 4090. As we navigate the expanding AI applications landscape, making judicious hardware choices based on comprehensive real-world benchmarks becomes paramount. It’s a reminder that the goal isn’t always about harnessing the most powerful tools, but rather the most appropriate ones for the task and budget at hand. Run Your Image Tagging on Salad’s Distributed Cloud If you are running AI image tagging or any AI inference at scale, Salad’s distributed cloud has 10,000+ GPUs at the lowest price in the market. Sign up for a demo with our team to discuss your specific use case. Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Comparing Price-Performance of 22 GPUs for AI Image Tagging (GTX vs RTX) Read More »

BARK Benchmark - Text-to-speech-gpu - SaladCloud

Bark Benchmark: Reading 144K Recipes with Text-to-Speech on SaladCloud

Leave a Comment / Benchmark, Featured, SaladCloud, Voice AI / Shawn Rushefsky

Speech Synthesis with suno-ai/bark When you think of speech synthesis, you might think of a very robotic sounding voice, like this one from 1979. Maybe you think of more modern voice assistants, like Siri or the Google Assistant. While these are certainly improvements over what we had in the 1970s, they still wouldn’t be mistaken for recordings of actual humans. Enter Bark text-to-speech, a generative AI model like Stable Diffusion or ChatGPT developer by Suno AI. Like these other generative models, Bark takes a text prompt, and creates something new. However, it doesn’t produce images, or more text. From their github page: “Bark can generate highly realistic, multilingual speech as well as other audio – including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying.” This is a fundamental departure from previous generations of speech synthesis. Bark does not try to break down text into phonemes for recreation by a recorded voice. Rather, it “predicts” what an audio recording might be like, based on the text it’s given. The result is much more natural sounding speech and other conversational sounds. Bark is also an important generative AI model because it is freely available for commercial use, and can run on very modest hardware, including consumer GPUs with minimal vRAM. We set out to benchmark Bark across a range of consumer hardware configurations, using Salad’s GPU Cloud. Benchmarking Bark text-to-speech model on Consumer GPUs You know we like to keep things food related here at Salad, so we selected this Food.com Recipe Dataset from Kaggle, a collection of a couple hundred thousand recipes, along with reviews of those recipes. We’re going to have Bark read these recipes out for us. If you’d like to follow along, we’ll be working with Python 3.10 throughout this project. Unlike some of our other benchmarks, our goal here is not to demonstrate that Salad is the most cost-effective platform for AI inference. Rather, we want to leverage some unique capabilities of Salad’s distributed cloud to evaluate Bark’s performance across a wide range of consumer GPUs. And, if I’m being totally honest, I just thought this would be a fun project. You can skip straight to the outputs if that’s what you’re here for. Architecture We’ll be using our standard batch processing framework for this, the same we’ve used for many other benchmarks, including Whisper Large and SDXL. Data Preparation First, we need to download our dataset. Kaggle is free, but does require an account. Once you have an account, you’ll need to grab your API token from your account settings. Clicking the “Create New Token” button will initiate a download of a file called kaggle.json. Place the file in your home directory at ~/.kaggle/kaggle.json. This will allow you to make authenticated requests with the Kaggle CLI. Now we have a folder called food-com-recipes-and-user-interactions that contains the following files: Our first step is to load up our recipes and interactions in a pandas DataFrame. This step may take several minutes. Let’s take a peek and see what we’re working with. Ok, so we have 231,637 recipes, with fields like “id”, “name”, “description”, and “steps”. There’s some other fields as well, but we won’t be using them for this project. Let’s check out our review data. In our review data, we have 1,132,367 reviews, each of which has a “recipe_id” and a “rating”. Let’s see our top recipes by average review: Interestingly, we see a lot of recipes with an average rating of 0.0. Maybe we should filter this down to only recipes with “good” reviews, over 4.5. Ok, now we’ve got 144,177 recipes that have received an average rating of at least 4.5. Now we can merge this table into the recipe table, and get a collection of recipe data, but only for recipes with a rating of at least 4.5. One thing to note here is that although steps looks like a list of strings, it is in fact just a string. Since our goal is to write a “script” for Bark to read, we’re going to want these strings parsed into lists. We’re going to use the ast module to safely evaluate these strings into python lists. Ok, now we need to turn this data into a “script”: something that will sound a little more natural when Bark reads it. I’ll admit, I was tempted to use a Large Language Model (LLM) like Llama 2 for this, and the results would have likely been better, and more natural sounding. However, for the sake of expediency, I’m just going to use a simple python function to stitch each row into a script. Let’s test it on our first row. This will be good enough for this project. We can see there’s some typos in the original data, and it’ll be interesting to see how Bark handles those. However, we have a new problem now, which is that Bark works best with about 13 seconds of spoken text. Our script is quite a bit longer than that, so we’re going to have to chop it up into smaller chunks. According to a quick google search, the average speaking rate is 2.5 words per second, which would translate to a maximum of 32.5 words that Bark will happily do in one clip. Let’s round that down to 30, just to be safe. However, we don’t just want to split the script every 30 words. Ideally, we would only include whole sentences for each segment, so that Bark can do a better job of tone and cadence. There are Natural Language Processing (NLP) techniques to do this with greater accuracy, but again, for expediency, we’re going to do this the simple way. Let see how that works: Ok, that’s pretty good. Let’s move forward with this solution. Bark includes a large number of voice presets, but since our data is all English, we’re going to use just the English language voices. There’s 10 of those, numbered 0-9.

Bark Benchmark: Reading 144K Recipes with Text-to-Speech on SaladCloud Read More »

Author name: Shawn Rushefsky