Fundamentals of Accelerated Computing with Modern CUDA C++

December 12, 2025 · View on GitHub

This tutorial teaches you the fundamentals of GPU programming and modern CUDA C++. You can watch lectures corresponding to this course on YouTube. You'll find the following content:

Brev Launchables of this tutorial should use:

  • L40S, L4, or T4 instances.
  • Crusoe or any other provider with Flexible Ports.

Notebooks

CUDA Made Easy: Accelerating Applications with Parallel Algorithms

NotebookLink
01.01.01 Introduction
01.02.01 Execution Spaces
01.02.02 Exercise Annotate Execution Spaces
01.02.03 Exercise Changing Execution Space
01.02.04 Exercise Compute Median Temperature
01.03.01 Extending Algorithms
01.03.02 Exercise Computing Variance
01.04.01 Vocabulary Types
01.04.02 Exercise Mdspan
01.05.01 Serial vs Parallel
01.05.02 Exercise Segmented Sum Optimization
01.05.03 Exercise Segmented Mean
01.06.01 Memory Spaces
01.06.02 Exercise Copy
01.07.01 Summary
01.08.01 Advanced

Unlocking the GPU’s Full Potential: Asynchrony and CUDA Streams

NotebookLink
02.01.01 Introduction
02.02.01 Asynchrony
02.02.02 Exercise Compute IO Overlap
02.02.03 Exercise Nsight
02.02.04 Exercise NVTX
02.03.01 Streams
02.03.02 Exercise Async Copy
02.04.01 Pinned
02.04.02 Exercise Copy Overlap

Implementing New Algorithms with CUDA Kernels

NotebookLink
03.01.01 Introduction
03.02.01 Kernels
03.02.02 Exercise Symmetry
03.02.03 Exercise Row Symmetry
03.02.04 Dev Tools
03.03.01 Histogram
03.03.02 Exercise Fix Histogram
03.04.01 Sync
03.04.02 Exercise Histogram
03.05.01 Shared
03.05.02 Exercise Optimize Histogram
03.06.01 Cooperative
03.06.02 Exercise Cooperative Histogram