HIP CPU Runtime Performance Considerations

November 29, 2020 · View on GitHub

General Considerations

Prefer larger block sizes when not using barriers
Prefer doing more work per block element (HIP / CUDA thread)
- Avoid bijective mappings where each element in the execution domain performs only one computation on a single element from the data domain, as these do not amortise setup / tear-down costs;
- See the following for an argument in favour of this being desirable on the GPU as well:
TODO

Pitfalls

Barriers
- Due to their reliance on O(block_size) fiber switches, functions that have barrier semantics, such as __syncthreads(), induce significant slowdown;
- It is preferable to avoid using barriers if possible, especially since, unlike on GPUs, __shared__ memory does not provide performance benefits;
- If you must use barriers, prefer smaller block sizes such as 8 or 16.
FP16 i.e. __half
- The HIP CPU Runtime provides correct but low-performance support for FP16 computation, in order to ensure that code which uses __half or __half2 is portable;
- It is preferable to avoid using __half or __half2 arithmetic.
Passing large arguments by-value to __global__ functions
- This is a commonly encountered anti-pattern, stemming from the presence of dedicated memory for arguments on some (generally older) GPUs;
  - It is generally disadvantageous when targeting an AMD GPUs, thus the guidance below applies to them as well, and leads to performance portable code;
- If sizeof(T) > 32 for a type T that is the type of an argument passed to a function, strongly prefer pass-by-pointer / pass-by-reference.
Excessive unrolling via #pragma unroll
- This is a commonly encountered anti-pattern, stemming from historical weaknesses in GPU compiler optimisation, which are no longer present in modern toolchains;
- Can be extremely harmful due to trashing the I$
  - See mixbench for an example;
- Strongly prefer deferring to the compiler on matters of unrolling.
Composition with MPI
- The interaction between the underlying implementation of the C++ Standard Library and, more specifically, its Parallel Algorithms component, can and will interact in opaque ways with any MPI driven scheduling;
- Experiment with pinning / affinity of MPI tasks if performance is low in such cases.

Recommended Reading

General Considerations

Pitfalls