SaladCloud Blog

INSIDE SALAD

Stable Diffusion v1.5 Benchmark On Consumer GPUs

Shawn Rushefsky

Benchmarking Stable Diffusion v1.5 across 23 consumer GPUs

What’s the best way to run inference at scale for stable diffusion? It depends on many factors. In this Stable Diffusion (SD) benchmark, we used SD v1.5 with a controlnet to generate over 460,000 fancy QR codes. The benchmark was run across 23 different consumer GPUs on SaladCloud. Here, we share some of the key learnings for serving Stable Diffusion inference at scale on consumer GPUs.

The Evaluation

For each GPU type, we compared 4 different backends, 3 batch sizes (1, 2, 4), and 2 resolutions (512×512, 768×768), generating images at 15 steps and at 50 steps. Our time measurements include the time taken to generate the image and return it to another process running on localhost. However, we do not include the time taken to generate the base QR code, upload images, or fetch new work from the queue. We recommend handling these tasks asynchronously in order to maximize GPU utilization.

Our cost numbers are derived from the SaladCloud Pricing Calculator, using 2 vCPU and 12 GB of RAM. Costs do not include storage, data transfer, queueing, database, etc. However, these things only cost $2 total for the entire project. We used DreamShaper 8 along with the QR Code Monster controlnet to generate the images with the Euler Ancestral scheduler/sampler.

Cold Start Time

We also evaluated cold start time for the various backends, which measures the time from when a container starts to when it is ready to serve inference. However, it does not include the time required to download the image to the host. For each backend, we chose the average cold start time from the GPU in which it had the best cold start time.

Average cold start time for GPUs on SaladCloud - Stable Diffusion benchmark
Average cold start time for GPUs on SaladCloud – Stable Diffusion benchmark

For the stable-fast backend, with the models included in the container, the RTX 4090 has the best average cold start time, while the GTX 1660 has the worst. The empty spot for GTX 1660 Super indicates that no nodes successfully started.

Architecture

We used our standard batch processing architecture that we’ve used for many other benchmarks.

SaladCloud architecture for deploying Stable Diffusion v1.5
SaladCloud architecture for deploying Stable Diffusion v1.5

The Backends

stable-fast-qr-code

Stable-diffusion-1-5-benchmark-stable-fast-repo-qr
Stable-diffusion-1-5-benchmark-stable-fast-repo-qr.png
  • Size without models: 2.88 GB
  • Size with models: 6.04 GB
  • Best Avg Cold Start Time: 124.2s on RTX 4090

This is the only custom backend we used for this benchmark. It uses 🤗 Diffusers with stable-fast. You’ll see in the results that it performed extremely well, almost always taking the top spot for performance and cost performance. However, there are important caveats to consider before choosing to deploy this or any other custom backend.

stable-fast adds a compilation step on start, which can take multiple whole minutes longer than startup for the other backends. Additionally, it achieves the best performance by locking image size at start. For many image generation use cases, dynamic sizing is too important, so this would not be feasible. For other use cases, such as this one where we bulk generated fancy QR codes, it’s ideal. Other build-vs-buy factors should also be taken into consideration.

Automatic1111

Stable-diffusion-benchmark-automatic1111-qr
Stable-diffusion-benchmark-automatic1111-qr
  • Size without models: 3.13 GB
  • Size with models: 5.61 GB
  • Best Avg Cold Start Time: 40.7s on RTX 3090 Ti

While designed and built as a user interface for running stable diffusion on your own PC, Automatic1111 is also a very popular inference backend for many commercial SD-powered applications. It boasts wide model and workflow compatibility, is very extensible, and shows strong performance in most categories.

Comfy UI

Stable-diffusion-benchmark-comfy-ui-qr-code
Stable-diffusion-benchmark-comfy-ui-qr-code
  • Size without models: 4.03 GB
  • Size with models: 6.48 GB
  • Best Avg Cold Start Time: 15.5s on RTX 4090

ComfyUI is another popular user interface for stable diffusion, but it has a node-and-link-based interface that mimics the underlying components of a workflow. It is the most customizable of the backends, and it has some caching features that are beneficial when not all parameters change between generations.

Stable Diffusion v1.5 Benchmark: Results

Stable Fast is the clear winner here, both in terms of speed and cost. However, while the performance is impressive, building and maintaining a custom backend comes with a lot of additional challenges vs using one of the highly flexible, community-maintained options. In particular, if you’ve already built your solution using one of these off-the-shelf options, you likely do not want to refactor your entire codebase around a new backend. We’ve included some results that exclude Stable Fast for those of you in this situation.

Best Inference Time (15 steps)

With an impressive 27.3 steps/second, Stable Fast achieved outstanding performance on the RTX 4090, generating batches of 4 512×512 images.

Best Inference Time (15 steps) - stable diffusion

Best Inference Time (50 Steps)

With a 50-step generation, Stable Fast performed even better, achieving 37.6 steps per second on batches of 4 512×512 images.

Best Inference Time (50 Steps) - stable diffusion benchmark

Best Cost Performance – 15 Steps

This measures performance for a given combination of backend+gpu on all 15-step image generation tasks. This includes all batch sizes and image sizes.

Best cost performance - 15 steps - stable diffusion benchmark

Best Cost Performance – 50 Steps

This measures performance for a given combination of backend and gpu on all 50-step image generation tasks. This includes all batch sizes and image sizes.

Best Cost Performance - 50 Steps

Best Inference Time in Each Task – 15 Steps

This measures the average inference time at each resolution and batch size, with 15 steps.

Best Cost Performance in Each Task – 15 Steps

While Stable Fast offered the best overall performance and the best overall cost performance, it was not the absolute best in all tasks for 15-step generations, sharing that honor with Automatic1111. It’s worth noting that A1111 achieved impressive cost-performance results on much lower-end hardware, which may be significantly easier to source.

Best Inference Time in Each Task – 50 Steps

This measures the average inference time at each resolution and batch size, with 50 steps.

Best Cost Performance in Each Task – 50 Steps

Stable Fast absolutely dominated the 50-step generation tasks, taking a comfortable first place in all categories.

Best Cost Performance in Each Task (no stable-fast) – 15 Steps

Here, we pull Stable Fast out of the results to compare the rest.

Best Cost Performance in Each Task (no stable-fast) – 50 Steps

A1111 – Best Inference Time by GPU

Graph: A1111 best inference time by GPU
Graph: A1111 best inference time by GPU

A1111 – Best Cost Performance by GPU

This measures the cost performance of Automatic1111 across all image generation tasks for each GPU.

A1111-Best-Cost-Performance-by-GPU-Higher-is-better-1
A1111 Best Cost Performance by GPU

SD.Next – Best Inference Time By GPU

SD Next - Best inference time by GPU (Lower is better)
SD Next – Best inference time by GPU (Lower is better)

SD.Next – Best Cost Performance by GPU

This measures the cost performance of SD.Next across all image generation tasks, for each GPU.

SD.Next Best Cost Performance by GPU
SD.Next Best Cost Performance by GPU

ComfyUI – Best Inference Time By GPU

ComfyUI - Best Inference Time By GPU
ComfyUI – Best Inference Time By GPU

Comfy UI – Best Cost Performance by GPU

This measures the cost performance of Comfy UI across all image generation tasks for each GPU.

Comfyui - best cost performance by GPU (Higher is better)
Comfyui – best cost performance by GPU (Higher is better)

Stable Fast – Best Inference Time by GPU

Stable-fast-qr-code - Best inference time by GPU
Stable-fast-qr-code – Best inference time by GPU

Stable Fast – Best Cost Performance by GPU

This measures the cost performance of Stable Fast across all image generation tasks for each GPU.

Stable-fast-qr-code - Best cost performance by GPU
Stable-fast-qr-code – Best cost performance by GPU

Observations

  1. Do not use the GTX series GPUs for production stable diffusion inference. Absolute performance and cost performance are dismal in the GTX series, and in many cases the benchmark could not be fully completed, with jobs repeatedly running out of CUDA memory. Additionally, many images generated on these GPUs came out all black, instead of as fancy QR codes as desired.
  2. There are very few surprises regarding which GPU is the fastest for each backend. Newer GPUs with higher model numbers are faster in nearly all situations.
  3. Batching saves time and money. In most situations, you can expect anywhere from 5-30% savings using batch size 4, vs batch size 1.
  4. Generation time scales close to linearly with number of pixels. a 768x768px image has 2.25x the pixels as a 512x512px image, and typically takes around 2x the time to generate.
  5. You can get surprisingly good cost performance out of the 20-series and 30-series RTX GPUs, regardless of the backend you choose.
  6. If you have a use-case that allows you to take advantage of the optimizations offered by Stable Fast, and the engineering availability to build and maintain an in-house solution, this is a great option that could save you a bunch of money while providing a fast and reliable image generation experience for your users.
  7. Many factors influence the scannability of these stable diffusion QR codes, and consistently getting good results is no simple task. Shorter URLs lead to better results, as there is less data to encode. Using QR codes with lighter backgrounds leads to easier scanning but less interesting images. Some prompts work much better than others, and some prompts can sustain much higher guidance than others. In addition, iOS and Android phones use different QR scanning implementations, so some codes scan fine on one platform but not the other.

Other Technology Choices

  • R2 – S3-Compatible blob storage from Cloudflare with no bandwidth charges. We generated about 130GB of images, which has a monthly storage cost of $1.80. The write operations required fit within the free tier of usage. To handle secure uploads, we used pre-signed upload URLs included with each job.
  • SQS – Fully managed message queue service from AWS. This whole benchmark fits within the free tier of usage.
  • DynamoDB – Serverless NoSQL database from AWS. The entire benchmark fits within the free tier of usage.
  • Lambda – Serverless functions from AWS. Used to provide HTTP endpoints constraining the use of SQS and DynamoDB. The entire benchmark fits within the free tier of usage.

Code and Docker Images

Have questions about enterprise pricing for SaladCloud?

Book a 15 min call with our team.

Related Blog Posts

Stable diffusion 1.5 benchmark on SaladCloud

Stable diffusion 1.5 benchmark: 14,000+ images per dollar on SaladCloud

Stable diffusion 1.5 benchmark on consumer GPUs Since our last stable diffusion benchmark nearly a year ago, a lot has changed. While we previously used SD.Next for inference, ComfyUI has...
Read More
Stable diffusion XL (SDXL) GPU benchmark on SaladCloud

Stable Diffusion XL (SDXL) benchmark: 3405 images per dollar on SaladCloud

Stable Diffusion XL (SDXL) benchmark on 3 RTX GPUs Since our last SDXL benchmark nearly a year ago, a lot has changed. Community adoption of SDXL has increased significantly, and...
Read More
Flux.1 schnell benchmark for image generation

Flux.1 Schnell benchmark: 5243 images per dollar on SaladCloud

Introduction to Flux.1 - The new standard for image generation Flux.1 is a new series of models from Black Forest Labs that has set the new standard in quality and...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.