Table of Contents

MPI4py Ping-Pong Example for Cluster

This example shows a NumPy-based ping-pong benchmark using the mpi4py library on a cluster. Two MPI ranks exchange a NumPy array for several message sizes and measure round-trip time, one-way latency, and effective bandwidth.

Instructions

To run this example, create the Python script and Slurm batch script provided below.

The Python script uses the mpi4py library to measure MPI communication performance with the classic ping-pong benchmark. The Slurm script loads the required modules and submits the job across 2 nodes.

Optional interactive testing:

srun --partition=short --ntasks=2 --gres=gpu:1 --time=02:00:00 --pty bash

Use this command to request an interactive session on a single node for experimentation.

Python Script (pingpong_mpi4py.py)

pingpong_mpi4py.py
#!/usr/bin/env python3
from mpi4py import MPI
import numpy as np
import sys
 
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
 
if size != 2:
    if rank == 0:
        print("Need exactly 2 processes!")
    sys.exit(1)
 
partner = 1 - rank
 
nrounds = 100
msg_sizes = [1, 8, 64, 512, 1024, 4096, 16384, 65536, 262144]
 
for nbytes in msg_sizes:
    nelems = max(1, nbytes // np.dtype(np.uint8).itemsize)
 
    sendbuf = np.zeros(nelems, dtype=np.uint8)
    recvbuf = np.empty(nelems, dtype=np.uint8)
 
    comm.Barrier()
 
    t0 = MPI.Wtime()
 
    for i in range(nrounds):
        if rank == 0:
            comm.Send(sendbuf, dest=partner, tag=100)
            comm.Recv(recvbuf, source=partner, tag=200)
        else:
            comm.Recv(recvbuf, source=partner, tag=100)
            comm.Send(recvbuf, dest=partner, tag=200)
 
    t1 = MPI.Wtime()
 
    if rank == 0:
        total_time = t1 - t0
        avg_rtt = total_time / nrounds
        latency_us = (avg_rtt / 2.0) * 1.0e6
        bandwidth_mb_s = nbytes / latency_us
 
        print(
            f"size={nbytes:8d} bytes | "
            f"RTT={avg_rtt*1.0e6:10.2f} us | "
            f"latency={latency_us:10.2f} us | "
            f"bandwidth={bandwidth_mb_s:10.2f} MB/s"
        )
 
if rank == 0:
    print("Ping-pong benchmark completed.")

Slurm Job Script (slurm_pingpong.job)

slurm_pingpong.job
#!/bin/bash
#SBATCH --job-name=pingpong_mpi4py
#SBATCH --partition=unite
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:01:00
#SBATCH --output=pingpong_%j.out
 
module purge
#Load necessary modules
module add unite/python/3.14/mpi4py
 
#Start the example
mpirun -np $SLURM_NTASKS python3 ./pingpong_mpi4py.py

Run

sbatch slurm_pingpong.job

Example output

Loading unite/python/3.14/mpi4py
  Loading requirement: unite/python/3.14/python-3.14.0 unite/mpi/4.1
 
size=       1 bytes | RTT=     10.08 us | latency=      5.04 us | bandwidth=      0.20 MB/s
size=       8 bytes | RTT=      3.80 us | latency=      1.90 us | bandwidth=      4.21 MB/s
size=      64 bytes | RTT=      4.49 us | latency=      2.24 us | bandwidth=     28.54 MB/s
size=     512 bytes | RTT=     27.33 us | latency=     13.66 us | bandwidth=     37.47 MB/s
size=    1024 bytes | RTT=      6.46 us | latency=      3.23 us | bandwidth=    317.16 MB/s
size=    4096 bytes | RTT=      8.90 us | latency=      4.45 us | bandwidth=    920.86 MB/s
size=   16384 bytes | RTT=     15.53 us | latency=      7.77 us | bandwidth=   2109.61 MB/s
size=   65536 bytes | RTT=     29.86 us | latency=     14.93 us | bandwidth=   4390.07 MB/s
size=  262144 bytes | RTT=     72.79 us | latency=     36.40 us | bandwidth=   7202.47 MB/s
Ping-pong benchmark completed.