CUDA

CUDA is the NVIDIA parallel programming language that executes at a higher performance on Graphical Processing Units (GPUs). Currently CUDA can support: C, C++, C#, Fortran, Java, Python.

CUDA Versions 8.0 is installed on /opt/cuda-8.0/, and Version 9.0 is installed on /opt/cuda-9.0/

CUDA compiler: nvcc

CUDA file extension: .cu

CUDA Environment

CUDA binaries and libraries are installed in /opt/CUDA. To set the environment for using CUDA, use the module command:

module load cuda/8.0 (or simply module load cuda)

Example of Compiling CUDA file

Please do your editing and compiling on the Management node & execute program on a GPU node. To simply compile a CUDA file: nvcc -arch=sm_35 filename.cu

This will generate a standard "a.out" execution file on the current work directory. -arch=sm_35 is the gpu architecture supported by the compiler OR nvcc -arch=sm_35 -O3 filename.cu -o outCUDA. This will optimize at level 3 of the serial part of the code and generate execution file "outCUDA".

GPU/CUDATesla K20 Architecture

Compute Capability 3.5 Max Threads per Thread Block 1024 Max Threads per SM 2048 Max Thread Blocks per SM 16.

CUDA C-example Program

The simple vector addition vectoradd.cu sample program located in /opt/cuda-8.0/samples/0_Simple/vectorAdd/ is one of the official CUDA samples shipped with CUDA Toolkit. It randomly generates two float type vectors, and uses GPU to calculate their additions. In the end, the GPU result is compared with the CPU result to verify if the GPU result is correct or not.

More sample programs can be found at /opt/cuda-8.0/samples/

Compile the Code

module load cuda

nvcc -arch=sm_35 vectorAdd.cu (copy the program to your home directory or give the full path)

The above command will create an executable file named ‘a.out’. Alternatively, you may specify your executable filename:

nvcc -arch=sm_35 vectorAdd.cu -o outCUDA

To run the CUDA program, you need to request a GPU node. A sample batch file for requesting the GPU node and running the above sample program is provided:

#!/bin/sh #PBS -l nodes=1:ppn=1 #PBS -l feature=gpunode #PBS -N GPUJob #PBS -l walltime=00:05:00 cd $PBS_O_WORKDIR ./a.out

Expected Output

[Vector addition of 50000 elements]

Copy input data from the host memory to the CUDA device

CUDA kernel launch with 196 blocks of 256 threads

Copy output data from the CUDA device to the host memory

Test PASSED

Done

Running CUDA Programs

Once you have compiled your CUDA code, you will need to run on one of the GPU Nodes. Instructions on how to submit a job is detailed on the GPU Nodes page.

Questions?

If you have questions or want access to the HPC, please reach out to Jeff Braun at jbraun@mtech.edu or Bowen Deng at bdeng1@mtech.edu, For a new account, please complete the form.

New account creation form.