๐Ÿ“˜ Using the Tracer Tool for Accel-Sim

July 16, 2025 ยท View on GitHub

This document explains how to use the tracer_tool for generating instruction traces for GPU applications. The tool supports full benchmark suites, individual applications, specific kernel tracing, source line mapping, and more.


๐Ÿ› ๏ธ Setup and Installation

Before using the tracer, make sure to install and build the required tools:

# Install NVBit
./install_nvbit.sh

# Compile tracer tools
./make

โš™๏ธ Option 1: Trace a Full Benchmark Suite

Use this if you're tracing an entire benchmark suite (e.g., Rodinia):

./run_hw_trace.py -B rodinia-3.1 -D 0
  • -B: Benchmark suite name (app list can be found or defined in this file)
  • -D: Hardware device ID (e.g., GPU 0)

๐Ÿ“ Traces will be stored in:
../../hw_run/traces/device-0/

This script handles trace generation, post-processing, and cleanup automatically.
See the Trace File Structure section for details on the output.


โš™๏ธ Option 2: Trace an Individual Application

Use this approach if you want to trace a specific application binary (e.g., vectoradd):

export CUDA_VISIBLE_DEVICES=0
LD_PRELOAD=./tracer_tool/tracer_tool.so ./nvbit_release/test-apps/vectoradd/vectoradd

๐Ÿ“ Traces will appear in the traces/ folder.
See the Trace File Structure section for contents.
๐Ÿ”„ Note: Unlike Option 1, you must manually perform post-processing:

./tracer_tool/traces-processing/post-traces-processing ./traces/kernelslist

This will generate .traceg and kernelslist.g files that are ready for Accel-Sim.


๐Ÿ“ฆ Trace File Structure

Each trace folder contains:

  • kernel-*.trace: Raw trace files (one per kernel)
  • kernelslist: List of traced kernels and CUDA memcpy operations
  • stats.csv: Summary statistics (instruction counts, kernel IDs, etc.)

After post-processing:

  • .traceg: Grouped trace files by thread block
  • kernelslist.g: Final trace list for use with Accel-Sim

๐ŸŽฏ Selective Tracing (Kernel-Based Filtering)

You can now trace specific kernels using the DYNAMIC_KERNEL_RANGE environment variable.

Usage

Set DYNAMIC_KERNEL_RANGE to specify which kernel IDs (and optionally names) to trace. Supported formats:

  • Single ID:

    export DYNAMIC_KERNEL_RANGE="3"
    

    Traces only kernel 3.

  • Range:

    export DYNAMIC_KERNEL_RANGE="5-8"
    

    Traces kernels 5 through 8 (inclusive).

  • Open-ended Range:

    export DYNAMIC_KERNEL_RANGE="10-"
    

    Traces from kernel 10 onward.

  • Multiple Ranges (space-separated):

    export DYNAMIC_KERNEL_RANGE="2 5-8 10-"
    
  • With Name Filters (regex):

    export DYNAMIC_KERNEL_RANGE="5-8@kernel_a.*,kernel_b.*"
    

    Traces kernels 5โ€“8 only if their names match kernel_a.* or kernel_b.*.

To Disable Tracing But Still List Kernels

To list kernel metadata in stats.csv without generating traces, we set the DYNAMIC_KERNEL_RANGE to very large number :

export DYNAMIC_KERNEL_RANGE="1000000"

This is useful for discovering kernel IDs and names without producing large trace files.


โฑ๏ธ Alternative: Trace Specific Code Regions with CUDA Profiling

Wrap regions to trace using CUDA APIs:

cudaProfilerStart();
// region to trace
cudaProfilerStop();

Then disable default tracing:

export ACTIVE_FROM_START=0

๐Ÿ” Trace Source Line Mapping

Enable source line information in your traces:

  1. Set environment variable:
export TRACE_LINEINFO=1
  1. Rebuild benchmark applications:
source ./gpu-app-collection/src/setup_environment
make -j -C ./gpu-app-collection/src rodinia_2.0-ft

Traces will now include line number info from the original CUDA source (requires -lineinfo flag in NVCC).


๐Ÿ“„ Trace Format Explanation

Each instruction has at least 10 required columns:

[line_num] PC mask dest_num [reg_dests] opcode src_num [reg_srcs] mem_width [addresscompress?] [mem_addresses]

Details:

  • Fields in [] are optional and appear only if applicable.
  • dest_num = 0 โ†’ no destination register field.
  • mem_width = 0 โ†’ no memory address info present.

๐Ÿงพ Example:

31 0 0 3 0000 ffffffff 1 R1 IMAD.MOV.U32 2 R255 R255 0

This line represents:

  • Threadblock: (31, 0, 0)
  • PC: 0000, Mask: ffffffff
  • One destination register: R1
  • Opcode: IMAD.MOV.U32
  • Two source registers: R255, R255
  • Not a memory instruction (mem_width = 0)