SaladCloud Blog


Optimizing AI Image Generation: Streamlining Stable Diffusion with ControlNet in Containerized Environments

Shawn Rushefsky

Implementing Stable Diffusion with ControlNet in any containerized environment comes with a plethora of challenges, mainly due to the sizable amount of additional model weights required to be incorporated in the image. In this blog, we discuss these challenges and how they can be mitigated.

What is ControlNet for Stable Diffusion?

ControlNet is a network structure that empowers users to manage diffusion models by setting extra conditions. It gives users immense control over the images they generate using Stable Diffusion, using a single reference image without noticeably inflating VRAM requirements. ControlNet has revolutionized AI image generation, demonstrating a key advantage of open models like Stable Diffusion against proprietary competitors such as Midjourney.

ControlNet for Stable Diffusion: Reference Image (left), Depth-Midas (middle), Output Image (right)

Note how the image composition remains the same between the reference image and the final image. This is accomplished by using the depth map as a control image. Images from using the Dreamshaper modelMiDas Depth Estimation, and the depth controlnet.

Challenges in Implementing ControlNet for Stable Diffusion

However, this remarkable feature presents challenges. There are (at the time of this writing) 14 distinct controlnets compatible with stable diffusion, each offering unique control over the output, necessitating a different “control image.”

All these models, freely accessible on Huggingface, in separate repositories, amount to roughly 4.3GB each, totaling up to an additional storage need of 60.8GB. This represents a near tenfold increase compared to a minimal Stable Diffusion image.

Furthermore, each ControlNet model comes with one or more image preprocessors used to decipher the “control image.” For instance, generating a depth map from an image is a prerequisite for one of the ControlNets. These additional model weights further bloat the total VRAM requirement, exceeding the capacity of commonly used graphics cards like the RTX3060 12GB.

The basic approach to implementing ControlNet in a containerized environment.

Optimizing ControlNet Implementation in a Containerized Environment on a Distributed Cloud

Building a container image for continuous ControlNet stable diffusion inference, if approached without optimization in mind, can rapidly inflate in size. This results in prohibitively long cold-start times as the image is downloaded onto the server running it. This problem is prominent in data centers and becomes even more pronounced in a distributed cloud such as Salad, which depends on residential internet connections of varying speed and reliability.

A two-pronged strategy can effectively address this issue. Firstly, isolate ControlNet annotation as a separate service or leverage a pre-built service like this one from This division of labor not only reduces VRAM requirements and model storage in the stable diffusion service but also enhances efficiency when creating numerous output images from a single input image and prompt.

Secondly, rather than cloning entire repositories for each model, perform a shallow clone of the model repository without git lfs. Then, use wget to selectively download only the necessary weights, as exemplified by this Dockerfile from the Stable Diffusion Service. This tactic alone can save more than 40GB of storage.

Stable Diffusion-split-service-controlnet

The end result? Two services, both with manageable container image sizes. The ControlNet Preprocessor Service, inclusive of all annotator models, sizes up to 7.5GB and operates seamlessly on an RTX3060 12GB.

The Stable Diffusion Service, even with every ControlNet packaged in, comes up to 21.1GB, and also runs smoothly on an RTX3060 12GB. Further reductions can be achieved by tailoring what ControlNets to support. For instance, excludes MLSD, Shuffle, and Segmentation ControlNets in their production image, thereby saving about 4GB of storage.

Have questions about SaladCloud for your workload?

Book a 15 min call with our team. Get $50 in testing credits.

Related Blog Posts

Openvoice text to speech gpu benchmark on SaladCloud

OpenVoice Text-to-Speech (TTS) Benchmark: 6 Million+ Words/$ Using Salad

What is OpenVoice? OpenVoice is an open-source, instant voice cloning technology that enables the creation of realistic and customizable speech from just a short audio clip of a reference speaker....
Read More
Whisper large v3 - Automatic speech - recognition - gpu benchmark

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110

Save over 99.8% on audio transcription using Whisper Large V3 and consumer GPUs A 99.8% cost-savings for automatic speech recognition sounds unreal. But with the right choice of GPUs and...
Read More
Recognize anything model++ gpu benchmark

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs

What is the Recognize Anything Model++? The Recognize Anything Model++ (RAM++) is a state of the art image tagging foundational model released last year, with pre-trained model weights available on...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.