Documentation Index
Fetch the complete documentation index at: https://mintlify.com/atomind-ai/mlip-arena/llms.txt
Use this file to discover all available pages before exploring further.
Physical motivation
Running molecular dynamics simulations at elevated temperatures or pressures is one of the most demanding practical uses of an MLIP. A model that looks good on static benchmarks can fail catastrophically in MD — producing unbounded forces, violating energy conservation, or crashing within picoseconds of simulation time. This benchmark quantifies simulation survival rate: what fraction of MD runs does a model complete without failing? It also measures inference speed (steps per second as a function of system size), which determines the practical cost of using a model for long-timescale simulations. Two protocols are tested:- Heating (NVT) — isochoric-isothermal molecular dynamics with a temperature ramp from 300 K to 3000 K over 10 ps
- Compression (NPT) — isothermal-isobaric molecular dynamics with simultaneous temperature (300 K → 3000 K) and pressure (0 GPa → 500 GPa) ramps over 10 ps
Structures tested
Simulations are run on structures from the RM24 dataset, which contains a diverse set of inorganic crystal structures spanning multiple chemical families.Temperature and pressure ranges
| Protocol | Ensemble | Temperature range | Pressure range | Duration |
|---|---|---|---|---|
| Heating | NVT (Nosé-Hoover) | 300 K → 3000 K | N/A | 10 ps |
| Compression | NPT (Nosé-Hoover) | 300 K → 3000 K | 0 → 500 GPa | 10 ps |
Metrics
| Metric | Description |
|---|---|
| Valid runs (%) | Percentage of simulations that complete the full duration without crashing |
| Normalized final step | Final simulation step divided by total target steps — a continuous survival proxy |
| Steps per second | MD throughput as a function of number of atoms (log-log scale) |
| Power-law scaling exponent | Fitted exponent n where steps/s ∝ N⁻ⁿ — measures how throughput degrades with system size |
Inference speed is measured and plotted on a log-log scale with power-law fits. The exponent n reflects the model’s scaling with system size — lower n means better scalability to large systems.
Model support
The following models have results for the stability benchmark. Support requires thegpu-tasks: stability entry and, for NPT, npt: true in the model registry.
| Model | NVT support | NPT support | Training data |
|---|---|---|---|
| MACE-MP(M) | Yes | Yes | MPTrj |
| MACE-MPA | Yes | Yes | MPTrj, Alexandria |
| CHGNet | Yes | Yes | MPTrj |
| M3GNet | Yes | Yes | MPF |
| MatterSim | Yes | Yes | MPTrj, Alexandria |
| ORBv2 | Yes | Yes | MPTrj, Alexandria |
| ORB | Yes | Yes | MPTrj, Alexandria |
| SevenNet | Yes | Yes | MPTrj |
How to run
Two Jupyter notebooks orchestrate the benchmark runs:benchmarks/stability/temperature.ipynb— NVT heating runsbenchmarks/stability/pressure.ipynb— NPT compression runs
Configure SLURM
Edit the cluster settings in
benchmarks/stability/run.py. The default allocates 4 GPUs per node with a 4-hour wall time.Run heating simulations
Open and run
benchmarks/stability/temperature.ipynb. Results are saved as <model>-heating.parquet files in benchmarks/stability/<family>/.Run compression simulations
Open and run
benchmarks/stability/pressure.ipynb. Results are saved as <model>-compression.parquet files.Interpreting results
Survival rate is the primary metric. A model that survives 100% of NVT heating runs is a prerequisite for use in production MD simulations at elevated temperatures. Models with survival rates below 50% should not be used for dynamics without careful per-system validation. Inference speed determines practical usability. The log-log speed vs. atoms plot reveals:- The absolute throughput at a given system size
- How throughput degrades as system size grows (the power-law exponent)
- Models with favorable message-passing architectures that scale sub-quadratically with atom count