CUDA
CUDA is the NVIDIA parallel programming language that executes at a higher performance on Graphical Processing Units (GPUs). Currently CUDA can support: C, C++, C#, Fortran, Java, Python.
CUDA Versions 8.0 is installed on /opt/cuda-8.0/, and Version 9.0 is installed on /opt/cuda-9.0/
CUDA compiler: nvcc
CUDA file extension: .cu
CUDA Environment
CUDA binaries and libraries are installed in /opt/CUDA. To set the environment for using CUDA, use the module command:
module load cuda/8.0 (or simply module load cuda)
Example of Compiling CUDA file
Please do your editing and compiling on the Management node & execute program on a GPU node. To simply compile a CUDA file: nvcc -arch=sm_35 filename.cu
This will generate a standard "a.out" execution file on the current work directory. -arch=sm_35 is the gpu architecture supported by the compiler OR nvcc -arch=sm_35 -O3 filename.cu -o outCUDA. This will optimize at level 3 of the serial part of the code and generate execution file "outCUDA".
GPU/CUDATesla K20 Architecture
Compute Capability 3.5 Max Threads per Thread Block 1024 Max Threads per SM 2048 Max Thread Blocks per SM 16.
CUDA C-example Program
The simple vector addition vectoradd.cu sample program located in /opt/cuda-8.0/samples/0_Simple/vectorAdd/ is one of the official CUDA samples shipped with CUDA Toolkit. It randomly generates two float type vectors, and uses GPU to calculate their additions. In the end, the GPU result is compared with the CPU result to verify if the GPU result is correct or not.
More sample programs can be found at /opt/cuda-8.0/samples/
Compile the Code
module load cuda
nvcc -arch=sm_35 vectorAdd.cu (copy the program to your home directory or give the full path)
The above command will create an executable file named ‘a.out’. Alternatively, you may specify your executable filename:
nvcc -arch=sm_35 vectorAdd.cu -o outCUDA
To run the CUDA program, you need to request a GPU node. A sample batch file for requesting the GPU node and running the above sample program is provided:
#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l feature=gpunode
#PBS -N GPUJob
#PBS -l walltime=00:05:00
cd $PBS_O_WORKDIR
./a.out
Expected Output
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Running CUDA Programs
Once you have compiled your CUDA code, you will need to run on one of the GPU Nodes. Instructions on how to submit a job is detailed on the GPU Nodes page.
If you have questions or want access to the HPC, please reach out to Jeff Braun at jbraun@mtech.edu or Bowen Deng at bdeng1@mtech.edu, For a new account, please complete the form.