Analyzing GPU codes
April 30, 2025 ยท View on GitHub
Here, you'll learn how you can analyze CUDA programs with Caliper. To follow along, you can build the tutorial example apps with the "cuda" build configuration:
$ . setup-env.sh cuda
We will use the XSBench app for this section.
Instrumenting XSBench
The XSBench CUDA version lets users choose between several different implementations of the main algorithm. Each version launches multiple different CUDA kernels, including both XSBench's own kernels and kernels from NVidia's thrust library. We mark the different CUDA kernel invocations with Caliper region markers so we can study them individually.
Here is an exerpt from Simulation.cu:
// [...]
CALI_MARK_BEGIN("sampling_kernel");
sampling_kernel<<<nblocks, nthreads>>>( in, GSD );
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
CALI_MARK_END("sampling_kernel");
CALI_MARK_BEGIN("count");
// Count the number of fuel material lookups that need to be performed (fuel id = 0)
int n_fuel_lookups = thrust::count(thrust::device, GSD.mat_samples, GSD.mat_samples + in.lookups, 0);
CALI_MARK_END("count");
CALI_MARK_BEGIN("partition");
// Partition fuel into the first part of the array
thrust::partition(thrust::device, GSD.mat_samples, GSD.mat_samples + in.lookups, GSD.p_energy_samples, is_mat_fuel());
CALI_MARK_END("partition");
CALI_MARK_BEGIN("lookup_kernel");
// Launch all material kernels individually (asynchronous is allowed)
nblocks = ceil( (double) n_fuel_lookups / (double) nthreads);
xs_lookup_kernel_optimization_5<<<nblocks, nthreads>>>( in, GSD, n_fuel_lookups, 0 );
nblocks = ceil( (double) (in.lookups - n_fuel_lookups) / (double) nthreads);
xs_lookup_kernel_optimization_5<<<nblocks, nthreads>>>( in, GSD, in.lookups-n_fuel_lookups, n_fuel_lookups );
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
CALI_MARK_END("lookup_kernel");
// [...]
Basic runtime profiling
Basic time profiling with runtime-report now reveals the time spent in the
different algorithmic phases. It also shows that most of the program runtime
is actually spent in initialization steps. Inside the main simulation routine,
the lookup_kernel phase takes by far the most time.
We can launch measurements via the ConfigManager, and modified XSBench to add
a new command-line parameter (-P), which lets us provide a Caliper
configuration string.
$ XSBench -k 4 -s small -m event -P runtime-report
[...]
Path Time (E) Time (I) Time % (E) Time % (I)
main 0.173986 2.665342 6.527252 99.992872
simulation 0.000023 0.128070 0.000863 4.804669
run_event_based_simulation_optimization_4 0.004479 0.128047 0.168034 4.803807
verification 0.000240 0.000240 0.009004 0.009004
lookup_kernel 0.109309 0.109309 4.100832 4.100832
sort 0.010845 0.010845 0.406861 0.406861
count 0.001737 0.001737 0.065165 0.065165
sampling_kernel 0.001437 0.001437 0.053910 0.053910
move_simulation_data_to_device 2.066417 2.066417 77.523624 77.523624
grid_init_do_not_profile 0.296869 0.296869 11.137326 11.137326
Profiling the CUDA host-side API
We can get more information with additional profiling options.
The profile.cuda option lets you profile the time spent in CUDA API functions
such as cudaMalloc, cudaMemcpy, cudaDeviceSynchronize, etc. You can use
it with the runtime-report or hatchet-region-profile Caliper
configurations. It shows that most program runtime - over 77% - is spent in
a cudaMalloc inside move_simulation_data_to_device.
$ XSBench -k 4 -s small -m event -P runtime-report,profile.cuda
[...]
Path Time (E) Time (I) Time % (E) Time % (I)
main 0.178092 2.830274 6.291905 99.992228
simulation 0.000024 0.127734 0.000848 4.512781
run_event_based_simulation_optimization_4 0.000047 0.127710 0.001660 4.511933
verification 0.000061 0.000342 0.002155 0.012083
cudaFree 0.000015 0.000015 0.000530 0.000530
cudaStreamSynchronize 0.000009 0.000009 0.000318 0.000318
cudaMemcpyAsync 0.000022 0.000022 0.000777 0.000777
cudaGetLastError 0.000003 0.000003 0.000106 0.000106
cudaDeviceSynchronize 0.000158 0.000158 0.005582 0.005582
cudaPeekAtLastError 0.000008 0.000008 0.000283 0.000283
cudaLaunchKernel 0.000023 0.000023 0.000813 0.000813
cudaMalloc 0.000014 0.000014 0.000495 0.000495
cudaOccupancyMaxAct~~iprocessorWithFlags 0.000007 0.000007 0.000247 0.000247
cudaDeviceGetAttribute 0.000005 0.000005 0.000177 0.000177
cudaGetDevice 0.000007 0.000007 0.000247 0.000247
cudaFuncGetAttributes 0.000010 0.000010 0.000353 0.000353
lookup_kernel 0.000043 0.107176 0.001519 3.786477
cudaDeviceSynchronize 0.107016 0.107016 3.780824 3.780824
cudaPeekAtLastError 0.000003 0.000003 0.000106 0.000106
cudaLaunchKernel 0.000114 0.000114 0.004028 0.004028
[...]
move_simulation_data_to_device 0.000113 2.226495 0.003992 78.661004
cudaDeviceSynchronize 0.000028 0.000028 0.000989 0.000989
cudaPeekAtLastError 0.000004 0.000004 0.000141 0.000141
cudaMemcpy 0.024325 0.024325 0.859391 0.859391
cudaMalloc 2.202025 2.202025 77.796491 77.796491
grid_init_do_not_profile 0.297325 0.297325 10.504350 10.504350
cudaGetDeviceProperties 0.000614 0.000614 0.021692 0.021692
cudaGetDevice 0.000014 0.000014 0.000495 0.000495
Profiling GPU activities
Another Caliper measurement recipe - cuda-activity-report - gives us more
detailed information about GPU activities like kernel executions and memory
copies. The report output shows both "Host Time" and "GPU Time": The "Host Time"
shows the time spent in host-side regions on the CPU, similar to runtime-report.
The "GPU Time" column shows the time spent on the GPU in activities like kernel
executions for any activities that were launched at or below this node in the
region hierarchy.
The output below shows that this run took a total of 2.71 seconds (from the
"Host Time" in main), and the GPU was executing 0.14 seconds worth of
activities total ("GPU Time" in main). Going down the region hierarchy we
can see that most of this GPU activity (0.1077 seconds) is from the lookup
kernel - it was launched from cudaLaunchKernel inside the lookup_kernel
Caliper region.
The "GPU %" column compares the GPU activity time with the time on the host,
giving us an idea of the overall GPU utilization. The GPU utilization for the
program as whole was low (5.17%), but in the simulation region itself we
achieved 90.7%.
Note that the cudaLaunchKernel calls are asynchronous. In the example, the
cudaLaunchKernel call under lookup_kernel spends very little time on the
host, but launches 0.1 seconds worth of GPU activities. Correspondingly, we
see that we spend 0.1 seconds on the host in cudaDeviceSynchronize, waiting
for the CUDA kernels to finish. Due to this asynchronous nature, the "GPU %"
metric only makes sense for regions above CUDA synchronization points, like
the lookup_kernel region.
$ XSBench -k 4 -s small -m event -P cuda-activity-report
[...]
Path Host Time GPU Time GPU %
main 2.714768 0.140372 5.170669
simulation 0.128210 0.116375 90.769106
run_event_based_simulation_optimization_4 0.128180 0.116375 90.790648
verification 0.000370 0.000169 45.672395
[...]
lookup_kernel 0.107829 0.107756 99.932302
cudaDeviceSynchronize 0.107646
cudaPeekAtLastError 0.000003
cudaLaunchKernel 0.000148 0.107756 73039.554399
[...]
We can use cuda-activity-profile instead of cuda-activity-report to produce
a JSON file instead of the text output that we can analyze in external tools.
$ XSBench -k 4 -s small -m event -P cuda-activity-profile
[...]
$ ls *.json
cuda_profile.json
Viewing host<->device memory copies
We can use the cuda.memcpy option for cuda-activity-report to print the
amount of data copied between host and device in explicit cudaMemcpy (and
similar) calls.
The XSBench example app does not copy much data from GPU to CPU but does copy
252 MB from CPU to GPU in the move_simulation_data_to_device region.
$ XSBench -k 4 -s small -m event -P cuda-activity-report,cuda.memcpy
[...]
Path Host Time GPU Time GPU % Copy CPU->GPU Copy GPU->CPU
main 2.724224 0.142380 5.226454
simulation 0.130136 0.118380 90.966589
run_event_based_simulation_optimization_4 0.130111 0.118380 90.984216
verification 0.000344 0.000171 49.608941
cudaFree 0.000015
cudaMemcpyAsync 0.000028 0.000003 9.393006 0.000004
[...]
count 0.002818 0.001108 39.301212
cudaFree 0.000175
cudaStreamSynchronize 0.000096
cudaMemcpyAsync 0.000293 0.000023 7.698713 0.000096
[...]
move_simulation_data_to_device 2.120833 0.024000 1.131626
cudaDeviceSynchronize 0.000027
cudaPeekAtLastError 0.000004
cudaMemcpy 0.024433 0.024000 98.226649 252.107056
cudaMalloc 2.096273
[...]
Looking at individual kernels
With the show_kernels option, we can see the individual CUDA __global__
kernel functions that were executed. This includes the CUDA kernels invoked
by libraries like NVidia's thrust library in the XSBench example.
The display is "inclusive": directly under main, we see all kernel functions
that were launched anywhere in the program and the total time spent in them.
As we go down the region hierarchy, we see the exact places where the kernel
functions where launched and their runtime.
Path Kernel Host Time GPU Time GPU %
main
|- 2.723571 0.024071 0.883819
|- sampling_kernel(Inputs, SimulationData) 0.001400
|- void thrust::cuda_cub::~~t>, thrust::plus<long>) 0.000999
|- void thrust::cuda_cub::~~rust::plus<long>, long) 0.000087
|- void thrust::cuda_cub::~~ub::GridEvenShare<int>) 0.000291
|- void thrust::cuda_cub::~~icy700, int>(int*, int) 0.000583
|- void thrust::cuda_cub::~~ub::GridEvenShare<int>) 0.002095
|- void thrust::cuda_cub::~~ub::GridEvenShare<int>) 0.000407
|- void thrust::cuda_cub::~~ub::GridEvenShare<int>) 0.002053
|- void thrust::cuda_cub::~~_true_predicate>, long) 0.000177
|- void thrust::cuda_cub::~~_true_predicate>, long) 0.000337
|- xs_lookup_kernel_optimi~~ionData, int, int, int) 0.108182
|- void thrust::cuda_cub::~~nt>, thrust::plus<int>) 0.000160
|- void thrust::cuda_cub::~~thrust::plus<int>, int) 0.000007
simulation
|- 0.129518 0.000025 0.019197
|- sampling_kernel(Inputs, SimulationData) 0.001400
[...]
|- void thrust::cuda_cub::~~thrust::plus<int>, int) 0.000007
run_event_based_simulation_optimization_4
[...]
verification
|- 0.000341 0.000002 0.693529
|- void thrust::cuda_cub::~~nt>, thrust::plus<int>) 0.000160
|- void thrust::cuda_cub::~~thrust::plus<int>, int) 0.000007
[...]
cudaLaunchKernel
|- 0.000031
|- void thrust::cuda_cub::~~nt>, thrust::plus<int>) 0.000160
|- void thrust::cuda_cub::~~thrust::plus<int>, int) 0.000007
[...]
lookup_kernel
|- 0.108254
|- xs_lookup_kernel_optimi~~ionData, int, int, int) 0.108182
cudaDeviceSynchronize 0.108075
cudaPeekAtLastError 0.000003
cudaLaunchKernel
|- 0.000144
|- xs_lookup_kernel_optimi~~ionData, int, int, int) 0.108182
[...]