Differences

This shows you the differences between two versions of the page.

--- examples [2025/12/29 21:27] – [GPU] dimitar
+++ examples [2025/12/30 23:27] (current) – [C++ program which uses GPU] dimitar
@@ Line 1: / Line 1: @@
-====MPI4PI====
+====Description====
-TOD
-----
+This page provides examples on how to use the cluster. There are language specific examples for **C/C++**, and **Python**, which showcase how you can compile and run applications which are written in those languages on the cluster. Additionally, there are examples for how to leverage the different resources of the cluster. These examples are written in **C++**, but the concepts apply to a program written in any language.
-==== PyTorch====
-Consider the following simple python test script( "pytorch_test.py"):
-<code python>
-import torch
-def test_pytorch():
-    print("PyTorch version:", torch.__version__)
-    print("CUDA available:", torch.cuda.is_available())
-    if torch.cuda.is_available():
-        print("CUDA device:", torch.cuda.get_device_name(0))
-        device = torch.device("cuda")
-    else:
-        device = torch.device("cpu")
-    # Simple tensor operation
-    x = torch.tensor([1.0, 2.0, 3.0], device=device)
-    y = torch.tensor([4.0, 5.0, 6.0], device=device)
-    z = x + y
-    print("Tensor operation result:", z)
-test_pytorch()
-</code>
-To test it on the unite cluster you can use the folling sbatch scrpit to run it:
-<code bash>
-#!/bin/bash
-#SBATCH --job-name=pytorch_test
-#SBATCH --output=pytorch_test.out
-#SBATCH --error=pytorch_test.err
-#SBATCH --time=00:10:00
-#SBATCH --partition=a40
-#SBATCH --gres=gpu:1
-#SBATCH --mem=4G
-#SBATCH --cpus-per-task=2
-# Load necessary modules (modify based on your system)
-module load python/pytorch-2.5.1-llvm-cuda-12.3-python-3.13.1-llvm
-# Activate your virtual environment if needed
-# source ~/your_env/bin/activate
-# Run the PyTorch script
-python3.13 pytorch_test.py
-</code>
 ----
-====Pandas====
-Consider the following simple python test script( “pandas_test.py”):
-<code python>
-import pandas as pd
-import numpy as np
-# Create a simple DataFrame
-data = {
-    'A': [1, 2, 3, 4],
-    'B': [5, 6, 7, 8],
-    'C': [9, 10, 11, 12]
-}
-df = pd.DataFrame(data)
-print("Original DataFrame:")
-print(df)
-# Test basic operations
-print("\nSum of each column:")
-print(df.sum())
-print("\nMean of each column:")
-print(df.mean())
-# Adding a new column
-df['D'] = df['A'] + df['B']
-print("\nDataFrame after adding new column D (A + B):")
-print(df)
-# Filtering rows
-filtered_df = df[df['A'] > 2]
-print("\nFiltered DataFrame (A > 2):")
-print(filtered_df)
-# Check if NaN values exist
-print("\nCheck for NaN values:")
-print(df.isna().sum())
-</code>
-You can use the following snatch script to run it:
-<code bash>
-#!/bin/bash
-#SBATCH --job-name=pytorch_test
-#SBATCH --output=pytorch_test.out
-#SBATCH --error=pytorch_test.err
-#SBATCH --time=00:10:00
-#SBATCH --partition=a40
-#SBATCH --gres=gpu:1
-#SBATCH --mem=4G
-#SBATCH --cpus-per-task=2
-# Load necessary modules (modify based on your system)
-module load python/3.13.1-llvm
-module load python/3.13/pandas/2.2.3
-# Activate your virtual environment if needed
-# source ~/your_env/bin/activate
-# Run the PyTorch script
-python3.13 pandas_test.py
-</code>
-----
 ====Simple C/C++ program====
 The following is a simple **C/C++** program which performs element-wise addition of 2 vectors. It does **not** use any dependent libraries:
@@ Line 572: / Line 465: @@
 ----
-====MPI====
+====C++ program which uses MPI====
 The following is an example **C/C++** application which uses **MPI** to perform element-wise addition of two vectors. Each **MPI** task computes the addition of its local region and then sends it back to the leader. Using **MPI** with **Python** is similar assuming that you know how to manage **Python** dependencies on the cluster which is described in the previous section. What is important here is to understand how to manage the resources of the system.
@@ Line 791: / Line 684: @@
 ----
-====GPU====
+====C++ program which uses GPU====
 The following is an example **Cuda** application which uses **Nvidia GPU** to perform element-wise addition of two vectors. Using **Cuda** with **Python** is similar assuming that you know how to manage **Python** dependencies on the cluster which is described in a previous section. What is important here is to understand how to manage the resources of the system.
 <code C++>
+#include <stdio.h>
+#include <stdlib.h>
+#include <cuda_runtime.h>
+#include <sys/time.h>
+#define CUDA_CHECK(call) \
+    do { \
+        cudaError_t error = call; \
+        if (error != cudaSuccess) { \
+            fprintf(stderr, "CUDA Error: %s:%d, %s\n", __FILE__, __LINE__, \
+                    cudaGetErrorString(error)); \
+            exit(EXIT_FAILURE); \
+        } \
+    } while(0)
+/*
+ * CUDA kernel for vector addition
+ * Each thread computes one element of the result vector
+ *
+ * Parameters:
+ *   a: First input vector
+ *   b: Second input vector
+ *   c: Output vector (result)
+ *   n: Number of elements
+ */
+__global__ void vectorAddKernel(const float *a, const float *b, float *c, int n) {
+    // Calculate global thread ID
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    // Check if thread is within bounds
+    if (idx < n) {
+        c[idx] = a[idx] + b[idx];
+    }
+}
+int main() {
+    const int N = 50'000'000;
+    const size_t bytes = N * sizeof(float);
+    printf("========================================\n");
+    printf("CUDA Vector Addition Example\n");
+    printf("========================================\n");
+    printf("Vector size: %d elements\n", N);
+    int deviceId;
+    cudaDeviceProp props;
+    CUDA_CHECK(cudaGetDevice(&deviceId));
+    CUDA_CHECK(cudaGetDeviceProperties(&props, deviceId));
+    printf("\nGPU Information:\n");
+    printf("  Device: %s\n", props.name);
+    printf("  Compute Capability: %d.%d\n", props.major, props.minor);
+    printf("  Total Global Memory: %.2f GB\n",
+           props.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
+    printf("  Multiprocessors: %d\n", props.multiProcessorCount);
+    printf("  Max Threads per Block: %d\n", props.maxThreadsPerBlock);
+    printf("  Warp Size: %d\n", props.warpSize);
+    printf("\nAllocating host memory...\n");
+    float *h_a = (float *)malloc(bytes);
+    float *h_b = (float *)malloc(bytes);
+    float *h_c_gpu = (float *)malloc(bytes);
+    float *h_c_cpu = (float *)malloc(bytes);
+    if (!h_a || !h_b || !h_c_gpu || !h_c_cpu) {
+        fprintf(stderr, "Error: Host memory allocation failed!\n");
+        return 1;
+    }
+    for (int i = 0; i < N; i++) {
+        h_a[i] = (float)rand() / RAND_MAX;
+        h_b[i] = (float)rand() / RAND_MAX;
+    }
+    float *d_a, *d_b, *d_c;
+    CUDA_CHECK(cudaMalloc(&d_a, bytes));
+    CUDA_CHECK(cudaMalloc(&d_b, bytes));
+    CUDA_CHECK(cudaMalloc(&d_c, bytes));
+    CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
+    CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));
+    int threadsPerBlock = 256;
+    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
+    printf("\nKernel Configuration:\n");
+    printf("  Threads per block: %d\n", threadsPerBlock);
+    printf("  Blocks per grid: %d\n", blocksPerGrid);
+    printf("  Total threads: %d\n", blocksPerGrid * threadsPerBlock);
+    vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);
+    CUDA_CHECK(cudaMemcpy(h_c_gpu, d_c, bytes, cudaMemcpyDeviceToHost));
+    printf("\nFirst 5 elements of result:\n");
+    for (int i = 0; i < 5; i++) {
+        printf("  c[%d] = %.6f\n", i, h_c_gpu[i]);
+    }
+    CUDA_CHECK(cudaFree(d_a));
+    CUDA_CHECK(cudaFree(d_b));
+    CUDA_CHECK(cudaFree(d_c));
+    free(h_a);
+    free(h_b);
+    free(h_c_gpu);
+    free(h_c_cpu);
+    return 0;
+}
 </code>
@@ Line 801: / Line 803: @@
 <code bash>
+#!/bin/bash
+#SBATCH --job-name=vector_sum_cuda
+#SBATCH --output=vector_sum_cuda_%j.out
+#SBATCH --error=vector_sum_cuda_%j.err
+#SBATCH --nodes=1
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --gres=gpu:1
+#SBATCH --time=00:10:00
+#SBATCH --partition=unite
+echo "========================================="
+echo "SLURM Job Information"
+echo "========================================="
+echo "Job ID: $SLURM_JOB_ID"
+echo "Node: $SLURM_NODELIST"
+echo "GPU(s): $SLURM_GPUS_ON_NODE"
+echo "Starting at: $(date)"
+echo ""
+module load nvidia/cuda/12-latest
+echo "Compiling vector_sum_cuda_cuda.cu..."
+nvcc -O3 -o vector_sum_cuda vector_sum_cuda.cu
+if [ $? -ne 0 ]; then
+    echo "Error: Compilation failed!"
+    exit 1
+fi
+echo "Compilation successful!"
+echo ""
+echo "Running vector_sum_cuda..."
+./vector_sum_cuda
+echo ""
+echo "Job finished at: $(date)"
 </code>
 ----