Differences

This shows you the differences between two versions of the page.

--- unite_python_mpi_4_py [2026/04/10 12:55] – nshegunov
+++ unite_python_mpi_4_py [2026/04/10 13:49] (current) – nshegunov
@@ Line 1: / Line 1: @@
-== MPI4py Ping-Pong Example for Cluster ==
+====== MPI4py Ping-Pong Example for Cluster ======
-MPI4py ping-pong demonstrates point-to-point communication between rank 0 (master) and rank 1 (worker). Rank 0 sends increasingly larger messages to rank 1, which echoes them back; timing measures bandwidth.
+This example shows a NumPy-based ping-pong benchmark using the ''mpi4py'' library on a cluster. Two MPI ranks exchange a NumPy array for several message sizes and measure round-trip time, one-way latency, and effective bandwidth.
+=====  Instructions =====
+To run this example, create the Python script and Slurm batch script provided below.
+The Python script uses the mpi4py library to measure MPI communication performance with the classic ping-pong benchmark. The Slurm script loads the required modules and submits the job across 2 nodes.
+Optional interactive testing:
+<code bash>
+srun --partition=short --ntasks=2 --gres=gpu:1 --time=02:00:00 --pty bash
+</code>
+Use this command to request an interactive session on a single node for experimentation.
 ===  Python Script (pingpong_mpi4py.py) ====
 <code python pingpong_mpi4py.py>
+#!/usr/bin/env python3
 from mpi4py import MPI
 import numpy as np
-import time
+import sys
 comm = MPI.COMM_WORLD
@@ Line 14: / Line 28: @@
 size = comm.Get_size()
-if size < 2:
+if size != 2:
     if rank == 0:
-        print("Need at least 2 processes")
+        print("Need exactly 2 processes!")
-    exit()
+    sys.exit(1)
-N = 1000  # max message size
+partner = 1 - rank
-extent = 100  # message size steps
-sbuf = np.zeros(1, dtype='d')
+nrounds = 100
-rbuf = np.zeros(1, dtype='d')
+msg_sizes = [1, 8, 64, 512, 1024, 4096, 16384, 65536, 262144]
+for nbytes in msg_sizes:
+    nelems = max(1, nbytes // np.dtype(np.uint8).itemsize)
+    sendbuf = np.zeros(nelems, dtype=np.uint8)
+    recvbuf = np.empty(nelems, dtype=np.uint8)
+    comm.Barrier()
+    t0 = MPI.Wtime()
+    for i in range(nrounds):
+        if rank == 0:
+            comm.Send(sendbuf, dest=partner, tag=100)
+            comm.Recv(recvbuf, source=partner, tag=200)
+        else:
+            comm.Recv(recvbuf, source=partner, tag=100)
+            comm.Send(recvbuf, dest=partner, tag=200)
+    t1 = MPI.Wtime()
+    if rank == 0:
+        total_time = t1 - t0
+        avg_rtt = total_time / nrounds
+        latency_us = (avg_rtt / 2.0) * 1.0e6
+        bandwidth_mb_s = nbytes / latency_us
+        print(
+            f"size={nbytes:8d} bytes | "
+            f"RTT={avg_rtt*1.0e6:10.2f} us | "
+            f"latency={latency_us:10.2f} us | "
+            f"bandwidth={bandwidth_mb_s:10.2f} MB/s"
+        )
 if rank == 0:
-    t0 = time.time()
+    print("Ping-pong benchmark completed.")
-    for i in range(extent):
-        size = (i + 1) * N
-        sbuf = np.zeros(size, dtype='d') + rank
-        t1 = time.time()
-        comm.send(sbuf, dest=1, tag=i)
-        comm.Recv(rbuf, source=1, tag=i)
-        t2 = time.time()
-        latency = (t2 - t1) * 1000  # ms
-        bandwidth = (size * 8) / (t2 - t1) / 1e6  # MB/s
-        print(f"Size {size}: latency {latency:.2f}ms, BW {bandwidth:.1f} MB/s")
-    total_time = time.time() - t0
-    print(f"Total time: {total_time:.2f}s")
-elif rank == 1:
-    for i in range(extent):
-        rbuf = comm.recv(source=0, tag=i)
-        comm.send(rbuf, dest=0, tag=i)
 </code>
@@ Line 60: / Line 91: @@
 #Start the example
-mpirun -np $SLURM_NTASKS python pingpong_mpi4py.py
+mpirun -np $SLURM_NTASKS python3 ./pingpong_mpi4py.py
 </code>
+===== Run =====
+<code bash>
+sbatch slurm_pingpong.job
+</code>
+===== Example output =====
+<code bash>
+Loading unite/python/3.14/mpi4py
+  Loading requirement: unite/python/3.14/python-3.14.0 unite/mpi/4.1
+size=       1 bytes | RTT=     10.08 us | latency=      5.04 us | bandwidth=      0.20 MB/s
+size=       8 bytes | RTT=      3.80 us | latency=      1.90 us | bandwidth=      4.21 MB/s
+size=      64 bytes | RTT=      4.49 us | latency=      2.24 us | bandwidth=     28.54 MB/s
+size=     512 bytes | RTT=     27.33 us | latency=     13.66 us | bandwidth=     37.47 MB/s
+size=    1024 bytes | RTT=      6.46 us | latency=      3.23 us | bandwidth=    317.16 MB/s
+size=    4096 bytes | RTT=      8.90 us | latency=      4.45 us | bandwidth=    920.86 MB/s
+size=   16384 bytes | RTT=     15.53 us | latency=      7.77 us | bandwidth=   2109.61 MB/s
+size=   65536 bytes | RTT=     29.86 us | latency=     14.93 us | bandwidth=   4390.07 MB/s
+size=  262144 bytes | RTT=     72.79 us | latency=     36.40 us | bandwidth=   7202.47 MB/s
+Ping-pong benchmark completed.
+</code>