SaladCloud Blog

INSIDE SALAD

Stable Diffusion v1.4 Inference Benchmark – GPUs & Clouds Compared

Daniel Sarfati

Stable Diffusion v1.4 GPU Benchmark – Inference

Stable Diffusion v1.4 is an impressive text-to-image diffusion model developed by stability.ai. By utilizing the principles of diffusion processes, Stable Diffusion v1.4 produces visually appealing and coherent images that accurately depict the given input text. Its stable and reliable performance makes it a valuable asset for applications such as visual storytelling, content creation, and artistic expression. In this benchmark, we evaluate the inference performance of Stable Diffusion 1.4 on different compute clouds and GPUs.

Our goal is to answer a few key questions that developers ask when deploying a stable diffusion model to production:

  • Do you really need a high-range GPU like the A100/H100 for inference?
    • Answer: Not always but necessary when image generation time is important
  • What matters more: image generation time or cost?
    • Answer: Depends on your growth, user expectations and monetization options
  • Are consumer-grade GPUs a cost-effective alternative for inference?
    • Answer: Definitely if profitability and reducing cloud costs are important

Benchmark Parameters

For the benchmark, we compared consumer-grade, mid-range GPUs on two community clouds – SaladCloud and Runpod with higher-end GPUs on three big-box cloud providers. To deploy on SaladCloud, we used the 1-click deployment for Stable Diffusion (SD) v1.4 on the Salad Portal via pre-built recipes.

Cloud providers considered:

Google Cloud Platform (GCP), Amazon Web Services (AWS)Microsoft Azure CloudRunPod and SaladCloud.

GPUs considered

RTX 3060RTX 3090A100V100T4RTX A5000

Link to model: https://huggingface.co/CompVis/stable-diffusion-v1-4

Prompt: ‘a bowl of salad in front of a computer’

The benchmark analysis uses a text prompt as input. Outputs were images in the 512×512 resolution with 50 inference steps as recommended in this HuggingFace blog.

Image: A bowl of Salad in front of a computer – generated from the benchmark

For the comparison, we focused on two main criteria:

Images Per Dollar (Img/$)

Training stable diffusion definitely needs high-end GPUs with high vRAM. But for inference, the more relevant metric is Images Per Dollar. There have been multiple instances of rapid user growth for a text-to-image platform either causing skyrocketing cloud bills or a mad scramble for GPUs. A high number of images generated per dollar means cloud costs are lower and generative AI companies can grow at scale in a profitable manner.

Seconds Per Image (sec/img)

The user base for SD-based image generation tools are vastly different when it comes to image generation time. In some cases, end-users expect images in under 5 seconds (Dall-e, Canva, Picfinder, etc). In others like Secta.ai, users expect results in a few minutes to hours. The image generation times can also vary for different pricing tiers. Free tier users can expect to wait a couple more seconds compared to users paying the highest price for access.

Stable Diffusion GPU Benchmark – Results

Stable Diffusion benchmark results showing a comparison of images per dollar for different GPUs and clouds

Image: Stable Diffusion benchmark results showing a comparison of images per dollar for different GPUs and clouds

The benchmark results show the consumer-grade GPUs outperforming the high-end GPUs, giving more images per dollar with a comparable image generation time. For generative AI companies serving inference at scale, more images per dollar puts them on the path to profitable, scalable growth.

Stable Diffusion benchmark results showing a comparison of image generation time

Image: Stable Diffusion benchmark results showing a comparison of image generation time

Some interesting observations from the benchmark:

  • For AI/ML inference at scale, the consumer-grade GPUs on community clouds outperformed the high-end GPUs on major cloud providers.
  • The A100s and H100s get all the hype but for inference at scale, the RTX series from Nvidia is the clear winner delivering at least 4X more images per dollar compared to other high-end GPUs.
  • On the images per dollar front, the RTX3060 is the surprising winner – racking up 5244 images per dollar (or around $191 for a million images generated) with an image generation time of 6.6 seconds. While the A100 and RTX3090 are faster in image generation, you get more images per dollar which is crucial when you are serving inference at scale.
  • On-demand pricing was used for both Runpod and Salad. Both clouds have lower prices – spot instances are cheaper on Runpod while enterprise bulk pricing is cheaper on Salad.
  • Image generation time for the consumer-grade GPUs is as good, if not better for the RTX3090, as the A100, everyone’s favorite Stable Diffusion GPU. But at a 10x higher cost/hr for A100 on AWS, you reduce your cloud cost significantly with consumer-grade GPUs.
  • The A100, one of the most powerful and expensive GPUs, is the crowd favorite for training Stable Diffusion. But for inference at scale, it is no match for the consumer-grade GPUs. The 3090 gives 12x more images per dollar and the 3060 delivers a whopping 17x more inferences per dollar.
  • The A5000 had the fastest image generation time at 3.15 seconds with the RTX3090 taking just 0.25 seconds more to generate an image. Image generation on the T4 is the slowest with the V100 not far behind. But T4 redeems itself with a healthy amount of images per dollar while the V100 is showing really low numbers for stable diffusion v1.4
  • Choosing the right GPU/Cloud is a matter of finding the best compromise. The RTX 3060 is just 2 seconds slower than the A100 but generates almost 17X more images per dollar – a clear winner when it comes to the best compromise.

Deploying Stable Diffusion v1.4 on Salad Cloud

Stable Diffusion v1.4 is available for 1-click deployment as a ‘Recipe’ on Salad Portal, accessible at https://portal.salad.com/.

This recipe is accessible via an HTTP server, once the recipe has been deployed to Salad, you will be provided with a unique URL that can be used to access this model. In order to secure your recipe, all requests must include the Salad-Api-Key header with your individual Salad API Token that can be found in your account settings.

Example API Request

Example API Request

Parameters required

prompt- Your prompt for Stable Diffusion to generate negativeprompt- Prompts for Stable Diffusion to not contain numinferencesteps- The number of steps to generate each image guidancescale- How close to the prompt your final image should be width- Width in pixels of your final image height- Height in pixels of your final image seed- The seed to generate your images from numimagesperprompt- The number of images to generate for your prompt PIPELINE- Which pipeline to use SCHEDULER- Which scheduler to use safetychecker: Enable or disable the NSFW filter on models, note some models may force this enabled anyway

Example API Response

Example API response

Stable Diffusion XL 0.9 on consumer-grade GPUs

The pace of development in the generative AI space has been tremendous. Stability.ai just announced SDXL 0.9, the most advanced development in the Stable Diffusion text-to-image suite of models. SDXL 0.9 produces massively improved image and composition detail over its predecessor.

In the announcement, Stability.ai noted that SDXL 0.9 can be run on a modern consumer GPU with just 16GB RAM and a minimum of 8GB of vRAM. Chalk it up as another win for consumer-grade GPUs in the race to serve inference at scale.

Stable Diffusion XL System Requirements

Have questions about SaladCloud for your workload?

Book a 15 min call with our team. Get $50 in testing credits.

Related Blog Posts

Openvoice text to speech gpu benchmark on SaladCloud

OpenVoice Text-to-Speech (TTS) Benchmark: 6 Million+ Words/$ Using Salad

What is OpenVoice? OpenVoice is an open-source, instant voice cloning technology that enables the creation of realistic and customizable speech from just a short audio clip of a reference speaker....
Read More
Whisper large v3 - Automatic speech - recognition - gpu benchmark

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110

Save over 99.8% on audio transcription using Whisper Large V3 and consumer GPUs A 99.8% cost-savings for automatic speech recognition sounds unreal. But with the right choice of GPUs and...
Read More
Recognize anything model++ gpu benchmark

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs

What is the Recognize Anything Model++? The Recognize Anything Model++ (RAM++) is a state of the art image tagging foundational model released last year, with pre-trained model weights available on...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.