# ECE 6397 - GPU Programming

Course Description

Limitations in the advancement of high-end single-threaded processors have forced new paradigms in software and algorithm development. Research and high-performance applications often require massively parallel systems for simulation, data processing, and data analysis. Several architectures, including nVidia’s CUDA and Intel’s Xeon Phi, provide highly parallel performance at low cost. However, algorithms primarily designed for massively parallel systems are difficult to design and optimize.

In this course, we will focus on the design and development of algorithms that take advantage of highly parallel co-processors, such as the nVidia GPU and Xeon Phi, in order to solve research related problems. This course will include an overview of GPU architectures and principles in programming massively parallel systems. Topics covered will include designing and optimizing parallel algorithms, using available heterogeneous libraries, and case studies in linear systems, n-body problems, deep learning, and differential equations.

Syllabus [ PDF ]

### Lectures

[ PPTX ] [ PDF ] Lecture 00 - syllabus and goals, history of GPUs

[ PPTX ] [ PDF ] Lecture 01 - building cross-platform applications, working on a Linux cluster

[ PPTX ] [ PDF ] Lecture 02 - parallelism concepts and GPU architecture

[ PPTX ] [ BOOK ] Visual Studio tutorial, common C/C++ questions

[ PPTX ] [ PDF ] Lecture 03 - CUDA API, device queries, thread architecture

[ PPTX ] [ PDF ] Lecture 04 - algorithm design, multidimensional grids, efficiency

[ PPTX ] [ PDF ] Lecture 05 - warps, memory transactions, global memory optimization

[ PPTX ] [ PDF ] PA1 Discussion - implementing a 2D Gaussian image filter using global memory

[ PPTX ] [ PDF ] Lecture 06 - thread synchronization, shared memory

[ PPTX ] [ PDF ] Lecture 07 - profiling and debugging

[ PPTX ] [ PDF ] Lecture 08 - matrix multiplication, loop optimizations

[ PPTX ] [ PDF ] Exam 1 Review

[ PPTX ] [ PDF ] Lecture 09 - CUDA libraries, cuBLAS, cuFFT

[ PPTX ] [ PDF ] Lecture 10 - C++ templates, CUDA Thrust

[ PPTX ] [ PDF ] Lecture 11 - concurrent operations, task parallelism, threads

[ PPTX ] [ PDF ] Lecture 12 - CUDA programming in MATLAB

[ PPTX ] [ PDF ] Lecture 13 - C/C++ task parallelism

[ PPTX ] [ PDF ] Lecture 14 - Fundamentals of computer graphics

[ FindGLEW ] [ FindGLFW ] - CMake files for GLEW (Windows) and GLFW

[ PPTX ] [PDF ] Lecture 15 - OpenGL setup [ CMakeLists ] [ CPP ]

[ PPTX ] [ PDF ] Lecture 16 - OpenGL drawing [ CMakeLists ] [ CPP ]

[ PPTX ] [ PDF ] Exam 2 Review

### Homework

HW0 - Verify your ability to develop CUDA code on your platform: [ CMakeLists.txt ] [ main.cu ]

[ PDF ] HW1 - CUDA API, memory allocation, data transfers

[ PDF ] HW2 - Grid design, warps, and control divergence

[ PDF ] HW3 - Memory transactions, alignment, and efficiency

### Programming

[ PDF, PPTX, DOCX ] Final project rubric and templates

[ PPTX ] [ PDF ] Report and presentation guidelines

[ PDF ] PA1 - Separable Convolution of large images

[ PDF ] PA2 - Shared memory general matrix multiplication

• [ A ] [ B ] [ I ] where $\mathbf{A} = \begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6\\ 7 & 8 & 9\end{bmatrix}$, $\mathbf{B} = \begin{bmatrix} 1 & 2 \\ 3 & 4\\ 5 & 6\end{bmatrix}$, and $\mathbf{C} = \mathbf{A}\mathbf{B} = \begin{bmatrix} 22 & 28 \\ 49 & 64\\ 76 & 100\end{bmatrix}$
• [ A ] [ B ] [ I ] C = 32 x 32
• [ A ] [ B ] [ I ] C = 1024 x 1024
• [ A ] [ B ] [ I ] C = 2048 x 2048
• [ A ] [ B ]         C = 4096 x 4096