This example shows a NumPy-based ping-pong benchmark using the mpi4py library on a cluster. Two MPI ranks exchange a NumPy array for several message sizes and measure round-trip time, one-way latency, and effective bandwidth.
To run this example, create the Python script and Slurm batch script provided below.
The Python script uses the mpi4py library to measure MPI communication performance with the classic ping-pong benchmark. The Slurm script loads the required modules and submits the job across 2 nodes.
Optional interactive testing:
srun --partition=short --ntasks=2 --gres=gpu:1 --time=02:00:00 --pty bash
Use this command to request an interactive session on a single node for experimentation.
#!/usr/bin/env python3 from mpi4py import MPI import numpy as np import sys comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() if size != 2: if rank == 0: print("Need exactly 2 processes!") sys.exit(1) partner = 1 - rank nrounds = 100 msg_sizes = [1, 8, 64, 512, 1024, 4096, 16384, 65536, 262144] for nbytes in msg_sizes: nelems = max(1, nbytes // np.dtype(np.uint8).itemsize) sendbuf = np.zeros(nelems, dtype=np.uint8) recvbuf = np.empty(nelems, dtype=np.uint8) comm.Barrier() t0 = MPI.Wtime() for i in range(nrounds): if rank == 0: comm.Send(sendbuf, dest=partner, tag=100) comm.Recv(recvbuf, source=partner, tag=200) else: comm.Recv(recvbuf, source=partner, tag=100) comm.Send(recvbuf, dest=partner, tag=200) t1 = MPI.Wtime() if rank == 0: total_time = t1 - t0 avg_rtt = total_time / nrounds latency_us = (avg_rtt / 2.0) * 1.0e6 bandwidth_mb_s = nbytes / latency_us print( f"size={nbytes:8d} bytes | " f"RTT={avg_rtt*1.0e6:10.2f} us | " f"latency={latency_us:10.2f} us | " f"bandwidth={bandwidth_mb_s:10.2f} MB/s" ) if rank == 0: print("Ping-pong benchmark completed.")
#!/bin/bash #SBATCH --job-name=pingpong_mpi4py #SBATCH --partition=unite #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --time=00:01:00 #SBATCH --output=pingpong_%j.out module purge #Load necessary modules module add unite/python/3.14/mpi4py #Start the example mpirun -np $SLURM_NTASKS python3 ./pingpong_mpi4py.py
sbatch slurm_pingpong.job
Loading unite/python/3.14/mpi4py Loading requirement: unite/python/3.14/python-3.14.0 unite/mpi/4.1 size= 1 bytes | RTT= 10.08 us | latency= 5.04 us | bandwidth= 0.20 MB/s size= 8 bytes | RTT= 3.80 us | latency= 1.90 us | bandwidth= 4.21 MB/s size= 64 bytes | RTT= 4.49 us | latency= 2.24 us | bandwidth= 28.54 MB/s size= 512 bytes | RTT= 27.33 us | latency= 13.66 us | bandwidth= 37.47 MB/s size= 1024 bytes | RTT= 6.46 us | latency= 3.23 us | bandwidth= 317.16 MB/s size= 4096 bytes | RTT= 8.90 us | latency= 4.45 us | bandwidth= 920.86 MB/s size= 16384 bytes | RTT= 15.53 us | latency= 7.77 us | bandwidth= 2109.61 MB/s size= 65536 bytes | RTT= 29.86 us | latency= 14.93 us | bandwidth= 4390.07 MB/s size= 262144 bytes | RTT= 72.79 us | latency= 36.40 us | bandwidth= 7202.47 MB/s Ping-pong benchmark completed.