SaladCloud Blog


Exploring Distributed Data Parallel (DDP) Training on Consumer GPUs

Salad Technologies

Our team is currently developing utility applications for Salad Container Engine (SCE), a fully managed container orchestration platform designed to leverage the heterogeneous hardware that makes up our uniquely decentralized network. It’s distributed computing in it’s finest form.

One of the most exciting use cases we’re exploring is distributed data parallelism (DDP), a deep-learning paradigm in which networked GPUs train models by performing discrete operations on distributed subsets of larger data sets.

What Is Distributed Data Parallel (DDP) Training?

Unlike single-process data parallelism, where training operations occur sequentially on a single accelerator, distributed data parallelism allows developers to instantiate and coordinate processes across a cohort of networked machines. Each of those devices is then responsible for performing operations on some volume of data as a discrete worker process.

The main advantage of a DDP training configuration is reduction in the length of a typical epoch. By distributing discrete segments of inputted data to multiple hardware accelerators (such as GPUs or TPUs), developers can effectively increase the batch size for a given operational phase (which in fact encompass numerous, identical processes hosted on different workers).

DDP training modules can also be configured for use with model parallelized systems in which distributed workers perform different operations on the same data subsets, or to execute more complex computational graphs that incorporate aspects of both approaches. Though the latter design pattern is less common, distributing both data and operations across distinct worker layers can be ideal for advanced applications requiring sophisticated parallel computation.

Constraints of DDP Training

No matter whether the training modules are constructed in PyTorch, TensorFlow, or some other machine learning framework, DDP configurations generally improve the efficiency of iterations, unburden process overhead, and reduce epoch length—but both data parallelism and model parallelism introduce unique performance constraints that cannot be overcome paradigmatically.

In data-parallel applications, learning must be slowed to a uniform rate in order to ensure smooth reconciliation between all workers on the network. By contrast, computational overhead can be significantly greater in model parallelism and complex graphs because they require persistent communication between nodes.

We’re investigating ways to leverage Salad’s intelligent workload-distribution plane to optimize DDP workflows executed in Salad Container Engine.

SaladCloud Architecture

For four years, our team has perfected a platform to turn latent consumer compute resources into available cloud infrastructure. As the world’s largest decentralized computing cluster, the Salad network has achieved twice the peak processing rate of the recently unveiled Polaris supercomputer. We owe this success to our voluntary computesharing community1.

Salad’s control plane was designed to access, index, and distribute compute tasks2 to target node cohorts. Salad Container Engine works in tandem with these systems to instantiate container replicas, secure and manage simultaneous processes, and dynamically access machine-level GPU resources when required by a given workload.

With our current systems (and an ever-growing corpus heterogeneous hardware), operations can be distributed to optimal hardware workers based on the anticipated degree of computational expense, while Salad’s node selection matrix can minimize latency and timeouts between worker processes by isolating geographically adjacent hosts. Salad Container Engine will soon be able to distribute tensors to the most trustworthy nodes and maintain healthy training populations at runtime.

Our team is now investigating how to synchronize buffers and data gradients to facilitate DDP modeling across a worldwide cloud network. The next step is designing deep-learning containers to efficiently manage backward computations between Salad nodes and our customers’ APIs. In time, we hope to teach Salad to asynchronously reconcile fast and slow learning processes for a variety of multi-GPU applications using PyTorch and TensorFlow.

We look forward to sharing more with our developer community in the months ahead. To stay up to date with the latest SaladCloud features and product updates, please consider subscribing to our newsletter.

  1. The Salad desktop application rewards voluntary participants for their contributions to a set of baseline processing workloads. Internal trust rating systems further reward “virtuous” nodes (i.e., those with performant hardware, reliable connections, longer availability durations, or ideally all three) by elevating their priority ranking in queues for more profitable workloads.
  2. Our baseline workloads include cryptomining libraries and other open-source executables that rely on trustless blockchain protocols or cryptographic equations. By providing built-in participation incentives, these seminal workloads allowed us to grow and persist our network while designing our state-of-the-art orchestration systems.

Have questions about SaladCloud for your workload?

Book a 15 min call with our team. Get $50 in testing credits.

Related Blog Posts

Run cog applications on Salad GPU cloud

How to run cog applications on SaladCloud

Introduction to Cog: Containers for Machine Learning Cog is an open-source tool designed to streamline the creation of inference applications for various AI models. It offers CLI tools, Python modules...
Read More
Blend cuts AI inference cost by 85% on SaladCloud running 3X more scale

Blend cuts AI inference cost by 85% on Salad while running 3X more scale

Key takeaways: - The team at Blend were facing high inference costs & scalability challenges on major cloud providers & local vendors- Switching to SaladCloud for image generation helped them...
Read More
Civitai powers 10 Million AI images per day on Salad

Civitai powers 10 Million AI images per day with Salad’s distributed cloud

Civitai: The Home of Open-Source Generative AI “Our mission is rooted in the belief that AI resources should be accessible to all, not monopolized by a few” -  Justin Maier,...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.