====== MPI4py Ping-Pong Example for Cluster ======
This example shows a NumPy-based ping-pong benchmark using the ''mpi4py'' library on a cluster. Two MPI ranks exchange a NumPy array for several message sizes and measure round-trip time, one-way latency, and effective bandwidth.
===== Instructions =====
To run this example, create the Python script and Slurm batch script provided below.
The Python script uses the mpi4py library to measure MPI communication performance with the classic ping-pong benchmark. The Slurm script loads the required modules and submits the job across 2 nodes.
Optional interactive testing:
srun --partition=short --ntasks=2 --gres=gpu:1 --time=02:00:00 --pty bash
Use this command to request an interactive session on a single node for experimentation.
=== Python Script (pingpong_mpi4py.py) ====
#!/usr/bin/env python3
from mpi4py import MPI
import numpy as np
import sys
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if size != 2:
if rank == 0:
print("Need exactly 2 processes!")
sys.exit(1)
partner = 1 - rank
nrounds = 100
msg_sizes = [1, 8, 64, 512, 1024, 4096, 16384, 65536, 262144]
for nbytes in msg_sizes:
nelems = max(1, nbytes // np.dtype(np.uint8).itemsize)
sendbuf = np.zeros(nelems, dtype=np.uint8)
recvbuf = np.empty(nelems, dtype=np.uint8)
comm.Barrier()
t0 = MPI.Wtime()
for i in range(nrounds):
if rank == 0:
comm.Send(sendbuf, dest=partner, tag=100)
comm.Recv(recvbuf, source=partner, tag=200)
else:
comm.Recv(recvbuf, source=partner, tag=100)
comm.Send(recvbuf, dest=partner, tag=200)
t1 = MPI.Wtime()
if rank == 0:
total_time = t1 - t0
avg_rtt = total_time / nrounds
latency_us = (avg_rtt / 2.0) * 1.0e6
bandwidth_mb_s = nbytes / latency_us
print(
f"size={nbytes:8d} bytes | "
f"RTT={avg_rtt*1.0e6:10.2f} us | "
f"latency={latency_us:10.2f} us | "
f"bandwidth={bandwidth_mb_s:10.2f} MB/s"
)
if rank == 0:
print("Ping-pong benchmark completed.")
=== Slurm Job Script (slurm_pingpong.job) ====
#!/bin/bash
#SBATCH --job-name=pingpong_mpi4py
#SBATCH --partition=unite
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:01:00
#SBATCH --output=pingpong_%j.out
module purge
#Load necessary modules
module add unite/python/3.14/mpi4py
#Start the example
mpirun -np $SLURM_NTASKS python3 ./pingpong_mpi4py.py
===== Run =====
sbatch slurm_pingpong.job
===== Example output =====
Loading unite/python/3.14/mpi4py
Loading requirement: unite/python/3.14/python-3.14.0 unite/mpi/4.1
size= 1 bytes | RTT= 10.08 us | latency= 5.04 us | bandwidth= 0.20 MB/s
size= 8 bytes | RTT= 3.80 us | latency= 1.90 us | bandwidth= 4.21 MB/s
size= 64 bytes | RTT= 4.49 us | latency= 2.24 us | bandwidth= 28.54 MB/s
size= 512 bytes | RTT= 27.33 us | latency= 13.66 us | bandwidth= 37.47 MB/s
size= 1024 bytes | RTT= 6.46 us | latency= 3.23 us | bandwidth= 317.16 MB/s
size= 4096 bytes | RTT= 8.90 us | latency= 4.45 us | bandwidth= 920.86 MB/s
size= 16384 bytes | RTT= 15.53 us | latency= 7.77 us | bandwidth= 2109.61 MB/s
size= 65536 bytes | RTT= 29.86 us | latency= 14.93 us | bandwidth= 4390.07 MB/s
size= 262144 bytes | RTT= 72.79 us | latency= 36.40 us | bandwidth= 7202.47 MB/s
Ping-pong benchmark completed.