Differences

This shows you the differences between two versions of the page.

--- examples [2025/12/29 20:57] – [MPI] dimitar
+++ examples [2025/12/30 23:27] (current) – [C++ program which uses GPU] dimitar
@@ Line 1: / Line 1: @@
-====MPI4PI====
+====Description====
-TOD
-----
+This page provides examples on how to use the cluster. There are language specific examples for **C/C++**, and **Python**, which showcase how you can compile and run applications which are written in those languages on the cluster. Additionally, there are examples for how to leverage the different resources of the cluster. These examples are written in **C++**, but the concepts apply to a program written in any language.
-==== PyTorch====
-Consider the following simple python test script( "pytorch_test.py"):
-<code python>
-import torch
-def test_pytorch():
-    print("PyTorch version:", torch.__version__)
-    print("CUDA available:", torch.cuda.is_available())
-    if torch.cuda.is_available():
-        print("CUDA device:", torch.cuda.get_device_name(0))
-        device = torch.device("cuda")
-    else:
-        device = torch.device("cpu")
-    # Simple tensor operation
-    x = torch.tensor([1.0, 2.0, 3.0], device=device)
-    y = torch.tensor([4.0, 5.0, 6.0], device=device)
-    z = x + y
-    print("Tensor operation result:", z)
-test_pytorch()
-</code>
-To test it on the unite cluster you can use the folling sbatch scrpit to run it:
-<code bash>
-#!/bin/bash
-#SBATCH --job-name=pytorch_test
-#SBATCH --output=pytorch_test.out
-#SBATCH --error=pytorch_test.err
-#SBATCH --time=00:10:00
-#SBATCH --partition=a40
-#SBATCH --gres=gpu:1
-#SBATCH --mem=4G
-#SBATCH --cpus-per-task=2
-# Load necessary modules (modify based on your system)
-module load python/pytorch-2.5.1-llvm-cuda-12.3-python-3.13.1-llvm
-# Activate your virtual environment if needed
-# source ~/your_env/bin/activate
-# Run the PyTorch script
-python3.13 pytorch_test.py
-</code>
 ----
-====Pandas====
-Consider the following simple python test script( “pandas_test.py”):
-<code python>
-import pandas as pd
-import numpy as np
-# Create a simple DataFrame
-data = {
-    'A': [1, 2, 3, 4],
-    'B': [5, 6, 7, 8],
-    'C': [9, 10, 11, 12]
-}
-df = pd.DataFrame(data)
-print("Original DataFrame:")
-print(df)
-# Test basic operations
-print("\nSum of each column:")
-print(df.sum())
-print("\nMean of each column:")
-print(df.mean())
-# Adding a new column
-df['D'] = df['A'] + df['B']
-print("\nDataFrame after adding new column D (A + B):")
-print(df)
-# Filtering rows
-filtered_df = df[df['A'] > 2]
-print("\nFiltered DataFrame (A > 2):")
-print(filtered_df)
-# Check if NaN values exist
-print("\nCheck for NaN values:")
-print(df.isna().sum())
-</code>
-You can use the following snatch script to run it:
-<code bash>
-#!/bin/bash
-#SBATCH --job-name=pytorch_test
-#SBATCH --output=pytorch_test.out
-#SBATCH --error=pytorch_test.err
-#SBATCH --time=00:10:00
-#SBATCH --partition=a40
-#SBATCH --gres=gpu:1
-#SBATCH --mem=4G
-#SBATCH --cpus-per-task=2
-# Load necessary modules (modify based on your system)
-module load python/3.13.1-llvm
-module load python/3.13/pandas/2.2.3
-# Activate your virtual environment if needed
-# source ~/your_env/bin/activate
-# Run the PyTorch script
-python3.13 pandas_test.py
-</code>
-----
 ====Simple C/C++ program====
 The following is a simple **C/C++** program which performs element-wise addition of 2 vectors. It does **not** use any dependent libraries:
@@ Line 572: / Line 465: @@
 ----
-====MPI====
+====C++ program which uses MPI====
 The following is an example **C/C++** application which uses **MPI** to perform element-wise addition of two vectors. Each **MPI** task computes the addition of its local region and then sends it back to the leader. Using **MPI** with **Python** is similar assuming that you know how to manage **Python** dependencies on the cluster which is described in the previous section. What is important here is to understand how to manage the resources of the system.
@@ Line 677: / Line 570: @@
 echo "Job completed!"
 echo "----------------------------------------"
+</code>
+----
+====C++ program which uses multiple threads====
+The following is a simple **C++** program which computes the sum of 2 vectors. It uses multiple **threads**. Each **thread** computes the sum for its respective region.
+<code C++>
+#include <iostream>
+#include <vector>
+#include <thread>
+#define VECTOR_SIZE 100000
+void vector_add_worker(int thread_id, int start_idx, int end_idx,
+                       const int* a, const int* b, int* c) {
+    int elements = end_idx - start_idx;
+    std::cout << "Thread " << thread_id << ": Adding " << elements
+              << " elements" << std::endl;
+    for (int i = start_idx; i < end_idx; i++) {
+        c[i] = a[i] + b[i];
+    }
+}
+int main(int argc, char** argv) {
+    if (argc != 2) {
+        std::cerr << "Usage: " << argv[0] << " <number_of_threads>" << std::endl;
+        return 1;
+    }
+    int num_threads = std::atoi(argv[1]);
+    if (num_threads <= 0) {
+        std::cerr << "Error: Number of threads must be positive" << std::endl;
+        return 1;
+    }
+    std::cout << "Using " << num_threads << " threads" << std::endl;
+    std::vector<int> a(VECTOR_SIZE);
+    std::vector<int> b(VECTOR_SIZE);
+    std::vector<int> c(VECTOR_SIZE);
+    for (int i = 0; i < VECTOR_SIZE; i++) {
+        a[i] = i + 1;
+        b[i] = (i + 1) * 2;
+    }
+    int elements_per_thread = VECTOR_SIZE / num_threads;
+    std::vector<std::thread> threads;
+    for (unsigned int t = 0; t < num_threads; t++) {
+        int start_idx = t * elements_per_thread;
+        int end_idx = (t == num_threads - 1) ? VECTOR_SIZE : (t + 1) * elements_per_thread;
+        threads.emplace_back(vector_add_worker, t, start_idx, end_idx,
+                           a.data(), b.data(), c.data());
+    }
+    for (auto& thread : threads) {
+        thread.join();
+    }
+    std::cout << "\nFirst 5 elements of (A + B): ";
+    for (int i = 0; i < 5; i++) {
+        std::cout << c[i] << " ";
+    }
+    std::cout << std::endl;
+    return 0;
+}
+</code>
+The following is the respective batch script for compiling and running the program. You can see the output of the program in the generated //vector_sum_threads_%j.out// file. The //cpus-per-task// parameter of the batch script specifies the number of cores to be allocated for each task (**MPI** process). You can combine the use of **MPI** tasks and **threads** in order to start one process per node. Then each node can use multiple **threads** locally to do work in parallel, while the **threads** share the context of the process.
+<code bash>
+#!/bin/bash
+#SBATCH --job-name=vector_sum_threads
+#SBATCH --output=vector_sum_threads_%j.out
+#SBATCH --error=vector_sum_threads_%j.err
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=8
+#SBATCH --time=00:05:00
+#SBATCH --partition=unite
+echo "Job started at: $(date)"
+echo "Running on node: $(hostname)"
+echo "Number of CPUs allocated: $SLURM_CPUS_PER_TASK"
+echo "----------------------------------------"
+module load gcc
+echo "Compiling vector_sum_threads.cpp..."
+g++ -std=c++11 -pthread -O3 vector_sum_threads.cpp -o vector_sum_threads
+if [ $? -eq 0 ]; then
+    echo "Compilation successful!"
+    echo "----------------------------------------"
+    echo "Running vector_sum_threads with $SLURM_CPUS_PER_TASK threads..."
+    ./vector_sum_threads $SLURM_CPUS_PER_TASK
+    echo "----------------------------------------"
+    echo "Job finished at: $(date)"
+else
+    echo "Compilation failed!"
+    exit 1
+fi
+</code>
+----
+====C++ program which uses GPU====
+The following is an example **Cuda** application which uses **Nvidia GPU** to perform element-wise addition of two vectors. Using **Cuda** with **Python** is similar assuming that you know how to manage **Python** dependencies on the cluster which is described in a previous section. What is important here is to understand how to manage the resources of the system.
+<code C++>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cuda_runtime.h>
+#include <sys/time.h>
+#define CUDA_CHECK(call) \
+    do { \
+        cudaError_t error = call; \
+        if (error != cudaSuccess) { \
+            fprintf(stderr, "CUDA Error: %s:%d, %s\n", __FILE__, __LINE__, \
+                    cudaGetErrorString(error)); \
+            exit(EXIT_FAILURE); \
+        } \
+    } while(0)
+/*
+ * CUDA kernel for vector addition
+ * Each thread computes one element of the result vector
+ *
+ * Parameters:
+ *   a: First input vector
+ *   b: Second input vector
+ *   c: Output vector (result)
+ *   n: Number of elements
+ */
+__global__ void vectorAddKernel(const float *a, const float *b, float *c, int n) {
+    // Calculate global thread ID
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    // Check if thread is within bounds
+    if (idx < n) {
+        c[idx] = a[idx] + b[idx];
+    }
+}
+int main() {
+    const int N = 50'000'000;
+    const size_t bytes = N * sizeof(float);
+    printf("========================================\n");
+    printf("CUDA Vector Addition Example\n");
+    printf("========================================\n");
+    printf("Vector size: %d elements\n", N);
+    int deviceId;
+    cudaDeviceProp props;
+    CUDA_CHECK(cudaGetDevice(&deviceId));
+    CUDA_CHECK(cudaGetDeviceProperties(&props, deviceId));
+    printf("\nGPU Information:\n");
+    printf("  Device: %s\n", props.name);
+    printf("  Compute Capability: %d.%d\n", props.major, props.minor);
+    printf("  Total Global Memory: %.2f GB\n",
+           props.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
+    printf("  Multiprocessors: %d\n", props.multiProcessorCount);
+    printf("  Max Threads per Block: %d\n", props.maxThreadsPerBlock);
+    printf("  Warp Size: %d\n", props.warpSize);
+    printf("\nAllocating host memory...\n");
+    float *h_a = (float *)malloc(bytes);
+    float *h_b = (float *)malloc(bytes);
+    float *h_c_gpu = (float *)malloc(bytes);
+    float *h_c_cpu = (float *)malloc(bytes);
+    if (!h_a || !h_b || !h_c_gpu || !h_c_cpu) {
+        fprintf(stderr, "Error: Host memory allocation failed!\n");
+        return 1;
+    }
+    for (int i = 0; i < N; i++) {
+        h_a[i] = (float)rand() / RAND_MAX;
+        h_b[i] = (float)rand() / RAND_MAX;
+    }
+    float *d_a, *d_b, *d_c;
+    CUDA_CHECK(cudaMalloc(&d_a, bytes));
+    CUDA_CHECK(cudaMalloc(&d_b, bytes));
+    CUDA_CHECK(cudaMalloc(&d_c, bytes));
+    CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
+    CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));
+    int threadsPerBlock = 256;
+    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
+    printf("\nKernel Configuration:\n");
+    printf("  Threads per block: %d\n", threadsPerBlock);
+    printf("  Blocks per grid: %d\n", blocksPerGrid);
+    printf("  Total threads: %d\n", blocksPerGrid * threadsPerBlock);
+    vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);
+    CUDA_CHECK(cudaMemcpy(h_c_gpu, d_c, bytes, cudaMemcpyDeviceToHost));
+    printf("\nFirst 5 elements of result:\n");
+    for (int i = 0; i < 5; i++) {
+        printf("  c[%d] = %.6f\n", i, h_c_gpu[i]);
+    }
+    CUDA_CHECK(cudaFree(d_a));
+    CUDA_CHECK(cudaFree(d_b));
+    CUDA_CHECK(cudaFree(d_c));
+    free(h_a);
+    free(h_b);
+    free(h_c_gpu);
+    free(h_c_cpu);
+    return 0;
+}
+</code>
+The following is the respective batch script for compiling and running the program. You can see the output of the program in the generated //vector_sum_cuda_%j.out// file. The //gres// parameter of the batch script specifies the number of **GPUs** to be allocated in total. You can combine the use of **MPI** tasks and **GPUs** in order to start one process per node. Then each node can use multiple **GPUs** locally to do work in parallel.
+<code bash>
+#!/bin/bash
+#SBATCH --job-name=vector_sum_cuda
+#SBATCH --output=vector_sum_cuda_%j.out
+#SBATCH --error=vector_sum_cuda_%j.err
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --gres=gpu:1
+#SBATCH --time=00:10:00
+#SBATCH --partition=unite
+echo "========================================="
+echo "SLURM Job Information"
+echo "========================================="
+echo "Job ID: $SLURM_JOB_ID"
+echo "Node: $SLURM_NODELIST"
+echo "GPU(s): $SLURM_GPUS_ON_NODE"
+echo "Starting at: $(date)"
+echo ""
+module load nvidia/cuda/12-latest
+echo "Compiling vector_sum_cuda_cuda.cu..."
+nvcc -O3 -o vector_sum_cuda vector_sum_cuda.cu
+if [ $? -ne 0 ]; then
+    echo "Error: Compilation failed!"
+    exit 1
+fi
+echo "Compilation successful!"
+echo ""
+echo "Running vector_sum_cuda..."
+./vector_sum_cuda
+echo ""
+echo "Job finished at: $(date)"
 </code>
 ----