SaladCloud Blog

sdxl

Stable Diffusion XL: 3405 Images per Dollar On SaladCloud

Since our last SDXL benchmark nearly a year ago, a lot has changed. Community adoption of SDXL has increased significantly, and along with that comes better tooling, performance increases, and better understanding of how to get good results from the model. While we previously used SD.Next for inference, ComfyUI has become the de facto image generation inference server for most professional use, owing to its high degree of flexibility, best in class performance, and it is nearly always first to support new models and technologies. SaladCloud has introduced new priority pricing levels, offering significantly lower prices on all GPUs, including top of the line models like RTX 4090. These factors combine to yield much lower per-image inference costs than we achieved previously, and with a much simpler build. Benchmark Design We deployed the “SDXL with Refiner – ComfyUI” recipe on Salad, using the default configuration, but setting priority to “batch”, and requesting 10 replicas. We started the benchmark when we had at least 8/10 replicas running. We used Postman’s collection runner feature to simulate load , first from 10 concurrent users, then ramping up to 18 concurrent users. Each test ran for 1 hour. Our virtual users submit requests to generate 1 image like this: We duplicated this setup to test RTX 3090 (24gb vram), RTX 4080 (16gb vram), and RTX 4090 (24gb vram). What we measured: Deployment on SaladCloud Click through the SDXL with Refiner recipe, available from the Container Groups interface, and set replica count to 10. Optionally, set a non-default priority, and/or enable authentication. For our benchmark, we used “Batch” priority, and did not enable authentication. Results The RTX 4090 achieved the best performance, both in terms of inference time and cost-per-image, returning images in as little as 6.2s / Image, and at a cost as low as 3405 images / $. Across all tests we can see that as load increases, average round-trip time increases for requests. We did not always have the maximum requested replicas running, which is expected. SaladCloud only bills for the running instances, so this really just means we’d want to set our desired replica count to a marginally higher number than what we actually think we need. We saw a small number of failed requests that coincided with node reallocations. This is expected, and you should handle this case in your application via retries. RTX 4090 (24gb vram) RTX 4080 (16gb vram) RTX 3090 (24gb vram) Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Stable Diffusion XL: 3405 Images per Dollar On SaladCloud Read More »

Fine tuning stable diffusion XL (SDXL) for low cost with Interruptible GPUs and LORA

Fine tuning Stable Diffusion XL (SDXL) with interruptible GPUs and LoRA for low cost

It’s no secret that training image generation models like Stable Diffusion XL (SDXL) doesn’t come cheaply. The original Stable Diffusion model cost $600,000 USD to train using hundreds of enterprise-grade A100 GPUs for more than 100,000 combined hours. Fast forward to today, and techniques like Parameter-Efficient Fine Tuning (PEFT) and Low-Rank Adaptation (LoRA) allow us to fine tune state-of-the-art image generation models like Stable Diffusion XL in minutes on a single consumer GPU. Using spot instances or community clouds like Salad reduces the cost even further. In this tutorial, we fine tune SDXL on a custom image of Timber, my playful Siberian Husky. Benefits of spot instances to fine tune SDXL Spot instances allow cloud providers to sell unused capacity at lower prices, usually in an auction format where users bid on unused capacity. Salad’s community cloud comprises of tens of thousands of idle residential gaming PCs around the world. On AWS, using an Nvidia A10G GPU with the g5.xlarge instance type costs $1.006/hr for on-demand pricing, but as low as $0.5389/hr for “spot” pricing. On Salad, an RTX 4090 (24GB) GPU with 8 vCPUs and 16GB of RAM costs only $0.348/hr. In our tests, we were able to train a LoRA for Stable Diffusion XL in 13 minutes on an RTX 4090, at a cost of just $0.0754. These low costs open the door for increasingly customized and sophisticated AI image generation applications. Challenges of fine tuning Stable Diffusion XL with spot instances There is one major catch, though: both spot instances and community cloud instances have the potential to be interrupted without warning, potentially wasting expensive training time. Additionally, both are subject to supply constraints. If AWS is able to sell all of their GPU instances at on-demand pricing, there will be no spot instances available. Since Salad’s network is residential, and owned by individuals around the world, instances come on and offline throughout the day as people use their PCs. But with a few extra steps, we can take advantage of the huge cost savings of using interruptible GPUs for fine tuning. Solutions to mitigate the impact of interrupted nodes The #1 thing you can do to mitigate the impact of interrupted nodes is to periodically save checkpoints of the training progress to cloud storage, like Cloudflare R2 or AWS S3. This ensures that your training job can pick up where it left off in the event it gets terminated prematurely. This periodic checkpointing functionality is often offered out-of-the-box by frameworks such as 🤗 Accelerate, and simply needs to be enabled via launch arguments. For example, using the Dreambooth LoRA SDXL script with accelerate, as we did, you might end up with arguments like this: This indicates that we want to train for 500 steps, and save checkpoints every 50 steps. This ensures that at most, we lose 49 steps of progress if a node gets interrupted. On an RTX 4090, that amounts to about 73 seconds of lost work. You may want to checkpoint more or less frequently than this, depending on how often your nodes get interrupted, storage costs, and other factors. Once you’ve enabled checkpointing with these launch arguments, you need another process monitoring for the creation of these checkpoints, and automatically syncing them to your preferred cloud storage. We’ve provided an example python script that does this by launching accelerate in one thread, and using another thread to monitor the filesystem with watchdog, and push files to S3-compatible storage using boto3. In our case, we used R2 instead of S3, because R2 does not charge egress fees. Other considerations for SDXL fine tuning The biggest callout here is to automate clean up of your old checkpoints from storage. Our example script saves a checkpoint every 10% progress, each of which is 66MB compressed. Even though the final LoRA we end up with is only 23MB, the total storage used during the process is 683MB. It’s easy to see how storage costs could get out of hand if this was neglected for long enough. Our example script fires a webhook at each checkpoint, and another at completion. We set up a Cloudflare Worker to receive these webhooks and clean up resources as needed. Additionally, while the open source tools are powerful and relatively easy to use, they are still quite complex and the documentation is often very minimal. I relied on youtube videos and reading the code to figure out the various options for the SDXL LoRA training script. However, these open source projects are improving at an increasingly quick pace, as they see wider and wider adoption, so the documentation will likely improve. At the time of writing, the 🤗Diffusers library had merged 47 pull requests from 26 authors, just in the last 7 days. Conclusions Modern training techniques and interruptible hardware combine to offer extremely cost effective fine tuning of Stable Diffusion XL. Open source training frameworks make the process approachable, although documentation could be improved. You can train a model of yourself, your pets, or any other subject in just a few minutes, at a cost of pennies. Training costs have plummeted over the last year, thanks in large part to the rapidly expanding open source AI community. The range of hardware capable of running these training tasks has greatly expanded as well. Many recent consumer GPUs are capable of training an SDXL LoRA model in well under an hour, with the fastest taking just over 10 minutes. Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.

Fine tuning Stable Diffusion XL (SDXL) with interruptible GPUs and LoRA for low cost Read More »

Scroll to Top