Parallel performance

MHD - Weak scaling measured on OLCF/TITAN

We measured weak scaling performance for the MHD scheme (no shear border, nor dissipative terms) on the OLCF/TITAN system (world largest GPU cluster) upto 4096 GPU. Sub-domain size is \(256^3\).

Performance are measured in millions of cells update per second. Each configuration uses one GPU per MPI process.

You can notice the good scaling up to 4096 GPU, resulting in a global resolution of \(4096^3\)

Ramses-GPU weak-scaling performance measured on OLCF TITAN cluster

Parallel IO performance using PnetCDF

We measured Ramses-GPU parallel IO performance on the OLCF/TITAN Lustre filesystem up to using 4096 GPUs. All MPI processes write into the same file using PnetCDF library.

From the following chart, we can notice the very good effective bandwidth obtained (up to almost 30 GBytes/s when writing a 4.4 TB file in parallel collective mode with 4096 MPI tasks contributing for 1.07 GB each. It takes about 2.5 minutes to write more than 4 TB of data on disk !

Notice also the very strong impact of the Lustre stripe parameter on performance. When writing such large files, you should always tune this parameter (increase from the default value which is often 2 or 4).

Number of Global size total output local size Lustre time Effective
MPI proc (Number of size per GPU stripe (sec) Bandwidth
(1GPU/MPI) cells) (GBytes) (GBytes) count   (GB/s)
512 \(2048^3\) 554 1.07 64 45 12.6
512 \(2048^3\) 554 1.07 128 25 22.9
4096 \(4096^3\) 4417 1.07 64 291 15.5
4096 \(4096^3\) 4417 1.07 128 158 28.8