It’s no secret that training image generation models like Stable Diffusion XL (SDXL) doesn’t come cheaply. The original Stable Diffusion model cost $600,000 USD to train using hundreds of enterprise-grade A100 GPUs for more than 100,000 combined hours. Fast forward to today, and techniques like Parameter-Efficient Fine Tuning (PEFT) and Low-Rank Adaptation (LoRA) allow us to fine-tune state-of-the-art image generation models like Stable Diffusion XL in minutes on a single consumer GPU. Using spot instances or community clouds like SaladCloud reduces the cost even further.
In this tutorial, we fine-tune SDXL on a custom image of Timber, my playful Siberian Husky.
Benefits of spot instances to fine-tune SDXL
Spot instances allow cloud providers to sell unused capacity at lower prices, usually in an auction format where users bid on unused capacity. Salad’s community cloud comprises tens of thousands of idle residential gaming PCs around the world.
On AWS, using an Nvidia A10G GPU with the g5.xlarge
instance type costs $1.006/hr for on-demand pricing but as low as $0.5389/hr for “spot” pricing.
On SaladCloud, an RTX 4090 (24GB) GPU with 8 vCPUs and 16GB of RAM costs only $0.348/hr. In our tests, we were able to train a LoRA for Stable Diffusion XL in 13 minutes on an RTX 4090 at a cost of just $0.0754. These low costs open the door for increasingly customized and sophisticated AI image generation applications.
Challenges of fine-tuning Stable Diffusion XL with spot instances
There is one major catch, though: both spot instances and community cloud instances have the potential to be interrupted without warning, potentially wasting expensive training time. Additionally, both are subject to supply constraints. If AWS is able to sell all of their GPU instances at on-demand pricing, there will be no spot instances available. Since SaladCloud’s network is residential and owned by individuals around the world, instances come on and offline throughout the day as people use their PCs. But with a few extra steps, we can take advantage of the huge cost savings of using interruptible GPUs for fine-tuning.
Solutions to mitigate the impact of interrupted nodes
The #1 thing you can do to mitigate the impact of interrupted nodes is to periodically save checkpoints of the training progress to cloud storage, like Cloudflare R2 or AWS S3. This ensures that your training job can pick up where it left off in the event it gets terminated prematurely. This periodic checkpointing functionality is often offered out-of-the-box by frameworks such as 🤗 Accelerate, and simply needs to be enabled via launch arguments. For example, using the Dreambooth LoRA SDXL script with accelerate, as we did, you might end up with arguments like this:
accelerate launch train_dreambooth_lora_sdxl.py \
--max_train_steps=500 \
--checkpointing_steps=50 \
...args
This indicates that we want to train for 500 steps, and save checkpoints every 50 steps. This ensures that at most, we lose 49 steps of progress if a node gets interrupted. On an RTX 4090, that amounts to about 73 seconds of lost work. You may want to checkpoint more or less frequently than this, depending on how often your nodes get interrupted, storage costs, and other factors.
Once you’ve enabled checkpointing with these launch arguments, you need another process to monitor the creation of these checkpoints and automatically sync them to your preferred cloud storage. We’ve provided an example python script that does this by launching accelerate in one thread and using another thread to monitor the filesystem with watchdog, and push files to S3-compatible storage using boto3. In our case, we used R2 instead of S3, because R2 does not charge egress fees.
Other considerations for SDXL fine tuning
The biggest callout here is to automate clean up of your old checkpoints from storage. Our example script saves a checkpoint every 10% progress, each of which is 66MB compressed. Even though the final LoRA we end up with is only 23MB, the total storage used during the process is 683MB.
It’s easy to see how storage costs could get out of hand if this was neglected for long enough. Our example script fires a webhook at each checkpoint and another at completion. We set up a Cloudflare Worker to receive these webhooks and clean up resources as needed.
Additionally, while the open-source tools are powerful and relatively easy to use, they are still quite complex, and the documentation is often very minimal. I relied on youtube videos and reading the code to figure out the various options for the SDXL LoRA training script. However, these open-source projects are improving at an increasingly quick pace as they see wider and wider adoption, so the documentation will likely improve. At the time of writing, the 🤗Diffusers library had merged 47 pull requests from 26 authors just in the last 7 days.
Conclusions
Modern training techniques and interruptible hardware combine to offer extremely cost effective fine tuning of Stable Diffusion XL. Open source training frameworks make the process approachable, although documentation could be improved.
You can train a model of yourself, your pets, or any other subject in just a few minutes at the cost of pennies. Training costs have plummeted over the last year, thanks in large part to the rapidly expanding open-source AI community. The range of hardware capable of running these training tasks has greatly expanded as well. Many recent consumer GPUs are capable of training an SDXL LoRA model in well under an hour, with the fastest taking just over 10 minutes.
Shawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its proceeds to artists.