...

SaladCloud Blog

INSIDE SALAD

Molecular Simulation: OpenMM Benchmark on 25 consumer GPUs, 95% less cost

Salad Technologies

Benchmarking OpenMM for Molecular Simulation on consumer GPUs

OpenMM is one of the most popular toolkits for molecular dynamics simulations, renowned for its high-performance, flexibility and extensibility. It enables users to easily incorporate new features, such as novel forces, integration algorithms, and simulation protocols, which can run efficiently on both CPUs and GPUs. In this analysis, we use typical biochemical systems to benchmark OpenMM on SaladCloud’s network of AI-enabled consumer GPUs. We will analyze simulation speed and cost-effectiveness in each case and discuss how to build high-performance and reliable molecular simulation workloads on SaladCloud. This approach supports unlimited throughput and offers over 95% cost savings compared to solutions based on data center GPUs.

Why run Molecular Simulations on GPUs?

GPUs have a high degree of parallelism, which means they can perform many calculations simultaneously. This is particularly useful for molecular simulations, which involve a large number of repetitive calculations, such as evaluating forces between atoms. Using GPUs can significantly accelerate molecular simulations, offering nearly real-time feedback and allowing researchers to run more simulations in less time. This enhanced efficiency accelerates the pace of discovery and lowers computational costs.

OpenMM benchmark methodology

The OpenMM team has provided benchmarking code in Python, along with benchmarks of simulation speed for typical biochemical systems based on OpenMM 8.0. To conduct the benchmarking test, you can run the following scripts on the target environment:

# CPU test using typical biochemical systems

python benchmark.py --platform=CPU --seconds=60 --style=table --test=pme,apoa1pme,amber20-cellulose,amber20-stmv,amoebapme

# GPU test using typical biochemical systems

python benchmark.py --platform=CUDA --seconds=60 --style=table 
--test=pme,apoa1pme,amber20-cellulose,amber20-stmv,amoebapme

Following the OpenMM benchmarks, we used OpenMM 8.0 with CUDA 11.8 to build the container image. When running on SaladCloud, it first executes the benchmarking code, reports the test data to an AWS DynamoDB table, and then exits. Finally, the data is downloaded and analyzed using Pandas on JupyterLab.

We primarily focused on two key performance indicators across three scenarios:

ns/day stands for nanoseconds per day. It measures simulation speed, indicating how many nanoseconds of simulated time can be computed in one day of real time. 

ns/dollar stands for nanoseconds per dollar. It is a measure of cost-effectiveness, showing how many nanoseconds of simulated time can be computed for one dollar.

Molecular simulations often operate on the timescale of nanoseconds to microseconds, as molecular motions and interactions occur very rapidly.

Below are the three scenarios and the methods used to collect data and calculate the final results:

ScenarioResourceSimulation Speed (ns/day)Cost Effectiveness (ns/dollar)
CPUs16 vCPUs
8GB RAM
Create a container group with 10 CPU-only instances on SaladCloud, and run it for a few hours. 

Once the code execution is finished on an instance, SaladCloud will allocate a new node and continuously run the instance.   

Collect test data from tens of unique Salad nodes to calculate the average performance.
Pricing from the Salad Price Calculator:

$0.040/hour for   8 vCPUs, 8GB RAM
$0.072/hour for 16 vCPUs, 8GB RAM

$0.02 ~ $0.30/hour for different GPU types

https://salad.com/pricing
Consumer GPUs8 vCPUs 
8GB RAM

20+ GPU types
Create a container group with 100 instances with all GPU types on SaladCloud, and run it for a few hours. 

Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type.

Calculate the average performance for each GPU type.
Pricing from the Salad Price Calculator:

$0.040/hour for   8 vCPUs, 8GB RAM
$0.072/hour for 16 vCPUs, 8GB RAM

$0.02 ~ $0.30/hour for different GPU types

https://salad.com/pricing
Datacenter GPUsA100
H100
Use the test data in the OpenMM benchmarks.Pricing from the AWS EC2 Capacity Blocks:

$1.844/hour for 1 A100
$4.916/hour for 1 H100

https://aws.amazon.com/ec2/capacityblocks/pricing/

It is worth mentioning that performance can be influenced by many factors, such as operating systems (Windows, Linux, or WSL) and their versions, CPU models, GPU models and driver versions, CUDA framework versions, OpenMM versions, and additional features enabled in the runtime environment. It is very common to see different results between our benchmarks and those of others.

Benchmark Results

Here are five typical biochemical systems used to benchmark OpenMM 8.0, along with the corresponding test scripts:

ModelDescriptionTest script
1Dihydrofolate Reductase (DHFR), Explicit-PMEThis is a 159 residue protein with 2489 atoms. The version used for explicit solvent simulations included 7023 TIP3P water molecules, giving a total of 23,558 atoms. All simulations used the AMBER99SB force field and a Langevin integrator.python benchmark.py –platform=CUDA or CPU 
–seconds=60 
–test=pme
2Apolipoprotein A1 (ApoA1), PMEThis consists of 392 protein residues, 160 POPC lipids, and 21,458 water molecules, for a total of 92,224 atoms. All simulations used the AMBER14 force field.python benchmark.py –platform=CUDA or CPU 
–seconds=60 
–test=apoa1pme
3Cellulose, PMEIt consists of a set of cellulose molecules (91,044 atoms) solvated with 105,855 water molecules, for a total of 408,609 atoms.python benchmark.py –platform=CUDA or CPU 
–seconds=60 –test=amber20-cellulose
4Satellite Tobacco Mosaic Virus (STMV), PMEIt consists of 8820 protein residues, 949 RNA bases, 300,053 water molecules, and 649 sodium ions, for a total of 1,067,095 atoms.python benchmark.py –platform=CUDA or CPU
–seconds=60
–test=amber20-stmv
5AMOEBA DHFR, PMEFull mutual polarization was used, with induced dipoles iterated until they converged to a tolerance of 1e-5.python benchmark.py –platform=CUDA or CPU 
–seconds=60  
–test=amoebapme

Model 1: Dihydrofolate Reductase (DHFR), Explicit-PME

Model 2: Apolipoprotein A1 (ApoA1), PME

Model 3: Cellulose, PME

Model 4: Satellite Tobacco Mosaic Virus (STMV), PME

Model 5: AMOEBA DHFR, PME

Observations from the OpenMM GPU benchmarks:

Here are some interesting observations from the OpenMM GPU benchmarks:

The VRAM usage for all simulations is only 1-2 GB, which means nearly all platforms (CPU-only or GPU) and all GPU types can theoretically be utilized to run these models.

For all models, the simulation speed of GPUs is significantly higher than that of CPUs, ranging from nearly hundreds of times in Model 1 to more than tens of thousands of times in Model 5.

In general, high-end GPUs outperform low-end GPUs in terms of simulation speed. However, the flagship model of a given GPU family often surpasses the low-end models of the next family.

As models become more complex with additional molecules and atoms, the performance differences between low-end GPUs and high-end GPUs become more pronounced, ranging from a few times to tens of times. For example, in Model 1, the H100 outperforms the 1650 by 530%, while in Model 3, the 4090 surpasses the 1650 by 2000%.

In 3 out of 5 models, the consumer GPUs (4080, 4090) outperform the data center GPUs (A100, H100).

CPU-only platforms are not suitable for molecular simulations, not only because of their extremely low performance but also due to their high cost relative to that performance.

The cost effectiveness of data center GPUs (A100, H100) is also very low for molecular simulation jobs, close to that of CPU-only platforms. In Models 1 and 5, the 1080 Ti surpasses the H100 by 2500% in terms of cost effectiveness.

The fact that molecular simulation jobs require little VRAM doesn’t mean low-end GPUs are always the better choice for cost savings compared to high-end GPUs. For example, the most cost-effective GPU in Models 1, 2, and 5 is the 1080 Ti, while the 4090 ranks as the top performer in Models 3 and 4.

Given the complexity and variety of these biochemical systems, the best-performing GPU and the most cost-effective GPU for each model might differ. Before deploying your workloads extensively SaladCloud, we recommend conducting a quick benchmarking test to identify the optimal resource types for your models. This will allow you to optimize your workloads for reduced costs, improved performance, or both.

Reference Design for Molecular Simulation Workloads on SaladCloud

Molecular simulations typically involve running thousands of jobs across various models and parameters. Each job can take several hours or more, depending on the complexity and the selected resource types. To enhance the system throughput, reliability and efficiency, the proposed batch-processing architecture for molecular simulations comprises the following components:

  • Cloud Storage: The initial input, intermediate checkpoint and final output of each job are organized and stored using either three separate folders or a single folder with different files in Cloudflare R2, which is AWS S3 compatible and incurs zero egress fees.  
  • Job Queue System: The job queue, such as Salad Job Queue, Salad Kelpie, AWS SQS or others, is to manage and keep the jobs for the Salad nodes to retrieve and process. 
  • Job Creator: After creating the folder(s) in Cloud Storage and uploading the initial input (such as an OpenMM PDB file, that defines the molecular structure of a biochemical system), it wraps this information into a job and submits the job to the Job Queue. The Job Creator can provide a webhook for callback when the job is done, or checks the job status or result directly. Once the job is completed, the Job Creator can download the job output for further analysis and utilization.
  • GPU Resource Pool: Hundreds of Salad nodes equipped with dedicated GPUs, are utilized for tasks such as pulling a job, downloading its input and checkpoint, running the simulation, uploading the checkpoint regularly and uploading the final result when the job is done.

SaladCloud operates on a foundation of shared nodes, where a node might go down while a job is running. In such cases, a new node will be allocated, and the unfinished job will be re-queued in the Job Queue. To prevent a job from being processed from scratch after node reallocation, the code needs to include some logic:

Step 1: When retrieving a job, first download both the job input and checkpoint from the cloud; if there is no checkpoint, run the simulation from scratch; otherwise, continue the job execution from the last checkpoint.  

Step 2: After completing some simulation steps or running for a certain period (such as 10 minutes), make the latest checkpoint and upload it to the cloud.

Step 3: When the job is finished, upload the final result to the cloud and return the status to the Job Queue.

Molecular simulations are typically compute-bound tasks, and the sizes of the input, checkpoint, and output files are small (a few MB), resulting in minimal upload and download times. Implementing a dedicated I/O thread to regularly upload the checkpoint can help eliminate any impact on simulation throughput.

You can implement the proposed solution using Salad Job Queue or AWS SQS. 

If you don’t want to deal with data synchronization in your code (such as downloading input and checkpoint, uploading checkpoint and output),  Salad Kelpie can be a good choice. It provides a job queue service where you can submit a job using its APIs. The Kelpie worker, integrated with your workload running on Salad nodes, can retrieve a job, execute your code, and handle all data synchronization in the background for you, between local folders/files and S3-compatible Cloud Storage.

Here is the example of an OpenMM-based simulation application integrated with the Kelpie worker:  

https://github.com/SaladTechnologies/openmm

SaladCloud: The right choice for Molecular Dynamics Simulations

The OpenMM GPU benchmark shows that consumer GPUs are the right choice for running molecular simulation jobs, offering performance comparable to data center GPUs while delivering over 95% cost savings. With the right design and implementation, we can build a high-throughput, reliable and cost-effective system on SaladCloud, to run simulation jobs at a very large scale.

If your needs involve running millions of simulation jobs in a short time or require access to hundreds of GPUs or various GPU types within minutes for testing or research purposes, SaladCloud is the ideal platform.

Want to try SaladCloud for Molecular Simulation?

Deploy on Salad

Have questions about enterprise pricing for SaladCloud?

Book a 15 min call with our team.

Related Blog Posts

Run cog applications on Salad GPU cloud

How to run cog applications on SaladCloud

Introduction to Cog: Containers for Machine Learning Cog is an open-source tool designed to streamline the creation of inference applications for various AI models. It offers CLI tools, Python modules...
Read More
Blend cuts AI inference cost by 85% on SaladCloud running 3X more scale

Blend cuts AI inference cost by 85% on Salad while running 3X more scale

Key takeaways: - The team at Blend were facing high inference costs & scalability challenges on major cloud providers & local vendors- Switching to SaladCloud for image generation helped them...
Read More
Civitai powers 10 Million AI images per day on Salad

Civitai powers 10 Million AI images per day with Salad’s distributed cloud

Civitai: The Home of Open-Source Generative AI “Our mission is rooted in the belief that AI resources should be accessible to all, not monopolized by a few” -  Justin Maier,...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.

    Seraphinite AcceleratorOptimized by Seraphinite Accelerator
    Turns on site high speed to be attractive for people and search engines.