Molecular Simulation: GROMACS Benchmark on 30 GPUs on SaladCloud, 90+% Cost Savings
Benchmarking GROMACS for Molecular Simulation on consumer GPUs In this deep dive, we will benchmark GROMACS on SaladCloud, analyzing simulation speed and cost-effectiveness across a spectrum of molecular systems—small, medium, and large. Additionally, we will provide recommendations for selecting the most appropriate resource types for various workloads on SaladCloud. Building on the OpenMM benchmark on SaladCloud and our continuous efforts to optimize system architecture and batch job implementation, we have achieved a 90% cost savings by using consumer GPUs for molecular simulations with GROMACS, compared to CPUs and data center GPUs.This capability enables effective static and dynamic load balancing across the system’s various components. GROMACS is a highly optimized, open-source software package for molecular dynamics simulations. Researchers in fields like biochemistry, biophysics, and materials science widely use it to study the physical movements of atoms and molecules over time. GROMACS stands out for its exceptional performance compared to other programs, efficiently leveraging both CPU and GPU resources. This capability enables effective static and dynamic load balancing across the system’s various components. GROMACS benchmark methodology The gmx mdrun is the main computational chemistry engine within GROMACS. The following command is to perform molecular dynamics simulations in the target environment: The mdrun program reads the input TPR file (-s), which contains the initial molecular topology and parameters, and produces several output files (-deffnm) with different extension names for logs, trajectories, structures and energies. GROMACS relies on close collaboration between the CPU and GPU to achieve optimal performance. Although many calculations can be offloaded to the GPU by using the options (-nb, -pme, -bonded, -update), the program still demands considerable CPU processing power and multiple threads for task management, communication, and I/O operations. To fully utilize a powerful GPU, GROMACS also depends on robust CPU performance. While running more OpenMP threads than the number of physical cores could be beneficial in certain situations for GROMACS, but for our benchmark test, we only selected Salad nodes with CPUs that have 8 or more cores and configured each node to run 8 OpenMP threads (-ntmpi, -ntomp). We used GROMACS 2024.1 with CUDA 11.8 to build the container image. When running on SaladCloud, it first runs the simulations against typical molecular systems, reports the test data to an AWS DynamoDB table, and then exits. Finally, the data is downloaded and analyzed using Pandas on JupyterLab. Two key performance indicators are collected and analyzed during the test: ns/day stands for nanoseconds per day. It measures simulation speed, indicating how many nanoseconds of simulated time can be computed in one day of real time. ns/dollar stands for nanoseconds per dollar. It is a measure of cost-effectiveness, showing how many nanoseconds of simulated time can be computed for one dollar. Below are the two scenarios and the methods used to collect data and calculate the final results: Scenario Resource Simulation Speed (ns/day) Cost Effectiveness (ns/dollar) ConsumerGPUs 8 cores for 8 OpenMP threads 30 GPU types Create a container group with 100 instances with all GPU types on SaladCloud, and run it for a few hours. Once the code execution is finished on an instance, SaladCloud will allocate a new node and continuously run the instance. Collect test data from thousands of unique Salad nodes, ensuring sufficient samples for each GPU type. Calculate the average performance for each GPU type. Pricing from the Salad Price Calculator: $0.072/hour for 16 vCPUs, 8GB RAM$0.015 ~ $0.18/hour for different GPU types (Priority: Batch ) https://salad.com/pricing Data CenterGPUs 16 Cores for 16 OpenMP threads A40 48GBA100 40GBH100 80GB Use the test data in the GROMACS benchmarks by NHR@FAU. The lowest prices are selected from the data center GPU market, that closely match the resource requirements: $1.86/hour for A40 (24 vCPUs)$1.29/hour for A100 (30 vCPUs)$2.99/hour for H100 (30 vCPUs) https://getdeploying.com/reference/cloud-gpu It is worth mentioning that performance can be influenced by many factors, such as operating systems (Windows, Linux, or WSL) and their versions, CPU models, GPU models and driver versions, CUDA framework versions, GROMACS versions, and additional features enabled in the runtime environment. It is very common to see different results between our benchmarks and those of others. Benchmark Results Here are six typical biochemical systems used to benchmark GROMACS: No Model Description Size 1 R-143a in hexane (20,248 atoms) with very high output rate Small 2 A short RNA piece with explicit water (31,889 atoms) Small 3 A protein inside a membrane surrounded by explicit water (80,289 atoms) Medium 4 A protein in explicit water (170,320 atoms) Medium 5 A protein membrane channel with explicit water (615,924 atoms) Large 6 A huge virus protein (1,066,628 atoms) Large Model 1: R-143a in hexane (20,248 atoms) with very high output rate Model 2: A short RNA piece with explicit water (31,889 atoms) Model 3: A protein inside a membrane surrounded by explicit water (80,289 atoms) Model 4: A protein in explicit water (170,320 atoms) Model 5: A protein membrane channel with explicit water (615,924 atoms) Model 6: A huge virus protein (1,066,628 atoms) Observations from the GROMACS benchmark Here are some interesting observations from the GROMACS benchmarks: The VRAM usage for all simulations is only 1-2 GB, which means nearly all GPU types can theoretically be utilized to run these models. GROMACS primarily utilizes the CUDA Cores of GPUs (not Tensor Cores), and typically operates in single-precision (FP32). High-end GPUs generally outperform low-end models in simulation speed due to their greater number of CUDA cores and higher memory bandwidth. However, the flagship model of a GPU generation often surpasses the low-end models of the following generation. For smaller models, GPUs are often underutilized, and communication between the CPU and GPU can become a bottleneck, making CPU performance a critical factor in overall system performance. On nodes with GPUs of similar performance, higher CPU clock speeds and more physical cores usually lead to better performance. Data center GPUs are typically paired with more powerful CPUs that have additional cores, allowing them to run GROMACS significantly faster than consumer GPUs in Models 1 and 2. Large models can fully exploit the vast number of CUDA
Molecular Simulation: GROMACS Benchmark on 30 GPUs on SaladCloud, 90+% Cost Savings Read More »