SaladCloud Blog

Cost-effective Stable Diffusion fine tuning on Salad

Blog_Stable_diffusion_fine_tuning_api_service

Stable Diffusion XL (SDXL) fine tuning as a service I recently wrote a blog about fine tuning Stable Diffusion XL (SDXL) on interruptible GPUs at low cost, starring my dog Timber. The strong results and exceptional cost performance got me wondering: What would it take to turn that into a fully managed Stable Diffusion training platform? I built one on SaladCloud to find out. The result? Costs as low as $0.00016179 per training step while successfully completing 1000/1000 training jobs. Challenges in developing a Stable Diffusion API There are a number of challenges involved when developing and deploying what’s essentially a Stable Diffusion XL training API on a distributed cloud like Salad. Salad is a distributed cloud with a Million+ individual PCs around the world connected to our network. The GPUs on Salad are latent Nvidia RTX/GTX series. Our goal is for the service to be resilient at any kind of scale. Architecture To handle node interruptions and concurrent training, we built a simple orchestration API, with training compute handled by GPU worker nodes. Additionally, we setup a simple autoscaler using a scheduled Cloudflare Worker. Except for the pool of training nodes, the entire platform uses Cloudflare serverless services. Heavily leveraging serverless technologies for the platform layer greatly reduces operational labor, makes the platform nearly free at rest, and will comfortably scale to handle significantly more load. Given sufficient continuous load, serverless applications do tend to be more expensive than alternatives, so feel free to swap out components as desired. This design doesn’t rely on any provider-specific features, so any SQL database and any Key-Value store would work just as well. API components GPU worker node components Distributing work To get work, worker nodes make a GET request to the API, including their machine id as a query parameter. The API prioritizes handing out that are in the running state, but stalled as measured by heartbeat timeout. It also will never hand a job out to a node where that job has previously failed. Marking a job failed Handling bad nodes If a particular node has failed too many jobs, we want to reallocate it. Our first implementation did not take this into account, and one bad node marked 85% of the benchmark failed, just pulling and failing one job after another. We now run a scheduled Cloudflare Worker every 5 minutes to handle reallocating any nodes with more than the allowed number of failures. Autoscaling the worker pool Our scheduled Cloudflare Worker also handles scaling the worker cluster. It essentially attempts to keep the number of replicas equal to the number of running and pending jobs, with configurable limits. Observing a training run The training script we used from diffusers has a built-in integration with Weights and Biases, a popular ML/AI training dashboard platform. It lets you qualitatively observe the training progress, tracks your training arguments, monitors system usage, and more. Deployment on Salad Deploying on Salad is simple. The worker pattern means we don’t need to enable inbound networking or configure any probes. The only environment configuration needed is a URL for the orchestration API, a key for the orchestration API, and an API key for Weights and Biases (optional). Seeding the benchmark To get a baseline idea of performance, we ran 1000 identical training jobs, each 1400 steps, with text encoder training. We skipped reporting samples to Weights and Balances for this benchmark. We let the auto-scaler run between 0 and 40 nodes each with 2 vCPU, 16GB RAM, and an RTX 4090 GPU. Visualizing a training run Here’s an example training job that got interrupted twice, and was able to resume and complete training on a different node each time. The smaller marks are heartbeat events emitted by the worker every 30s, color coded by machine id. We can see for this run that it sat in the queue for 5.4 hours before a worker picked it up, and ran for 54:00 of billable time, calculated as number of heartbeats * 30s. Plugging that into the Pricing Calculator, we see a cost of $0.324/hour, so a total cost of $0.2916 to train the model and the text encoder for 1400 steps. This comes out to $**0.000208/**step. The amount of time taken, and therefore the cost, varies greatly based on the parameters you use for training. Training the text encoder slows down training. Using prior preservation also slows down training. More steps takes longer. It’s interesting to note that although the run was interrupted multiple times, these interruptions cost less than 4 minutes of clock time, and the run still finished in the median amount of time. Results from the Stable Diffusion XL fine tuning Tips and Observations Future Improvements Conclusions Our exploration into fine-tuning Stable Diffusion XL on interruptible GPUs has demonstrated the feasibility and efficiency of our approach, despite the significant challenges posed by training interruptions, capacity limitations, and cost management. Leveraging Cloudflare’s serverless technologies alongside our custom orchestration and autoscaling solutions, we’ve created a resilient and manageable system capable of handling large-scale operations with notable cost efficiency and operational simplicity. The successes of our deployment, underscored by the seamless completion of 1000/1000 benchmark jobs, highlight the system’s robustness and the potential for further improvements. Future enhancements, such as asynchronous validation and refined node performance assessments, promise to elevate the performance and cost-effectiveness of our service. Given the extensive amount of experimentation required to get good results, a platform like this can be useful for individuals as well as those seeking to build commercial offerings. Once deployed, a person could submit many different combinations of parameters, prompts, and training data, and run many experiments in parallel. Resources Shawn RushefskyShawn Rushefsky is a passionate technologist and systems thinker with deep experience across a number of stacks. As Generative AI Solutions Architect at Salad, Shawn designs resilient and scalable generative ai systems to run on our distributed GPU cloud. He is also the founder of Dreamup.ai, an AI image generation tool that donates 30% of its