Change Log for HIP
June 2, 2026 · View on GitHub
Full documentation for HIP is available at rocm.docs.amd.com
HIP 7.13 for ROCm 7.13
Added
- New HIP APIs
cooperative_groups::reduce()allows calling reduce operators onthread_block_tileandcoalesced_threads. The implementation is based on the__reduce_*_syncoperations, so the macroHIP_ENABLE_EXTRA_WARP_SYNC_TYPESmay be needed to unlock some optimizations.
- New device attribute
hipDeviceAttributeGPUDirectRDMAWithHipVMMSupported, indicating support for GPU Direct RDMA when using HIP VMM. This attribute corresponds to CUDA’sCU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED.
Resolved issues
- A segmentation fault that occurred in child graphs during the graph‑launch phase. The issue originated from the entire graph being launched solely according to the parent graph’s scheduling logic. The HIP runtime now introduces a per‑graph segment‑scheduling control flag and propagates the parent graph’s scheduling mode to its child graphs, ensuring consistent scheduling behavior (classic vs. segment) and preventing failures when the parent falls back to classic scheduling.
- A segmentation fault caused by passing a null pointer to the hipMemGetAddressRange API. The function now handles null pointers correctly, matching the behavior of the corresponding CUDA API.
Changed
__reduce_and_sync(),__reduce_or_sync()and__reduce_xor_sync()now provide a consistent behavior for all masks values and with CUDA. Before, some masks would be translated to bitwise operations but other would not (like the ones containing "holes"). Now all masks cause bitwise instructions to be emitted. This is a change of behavior from previous versions.
Optimized
-
Improves HIP runtime error logging when an application's fat binary does not include a compatible code object for the detected GPU architecture, offering clearer guidance to rebuild with the appropriate
--offload-arch=gfxXXXXoption. -
Enables in‑memory and background‑thread asynchronous logging in the HIP runtime by default to improve overall logging capability. This behavior can be disabled by setting the environment variable
AMD_LOG_ASYNC=0.
HIP 7.12 for ROCm 7.12
Added
-
New HIP APIs
- Library Management
Support for the following APIs for parity with the corresponding CUDA APIs.
hipKernelSetAttributesets an attribute for a kernelhipKernelGetAttributereturns information about a kernelhipKernelGetFunctionreturns a function handle
- Memory Management
- Added support for
hipMipmappedArrayGetMemoryRequirements, which returns memory requirements for HIP mipmapped arrays and ensures parity with CUDA APIs.
- Added support for
- Cooperative Groups
- Support for
barrierAPIsbarrier_arriveandbarrier_waithas been added for bothgrid_groupandthread_blockto enable finer‑grained synchronization within cooperative groups. - Support for
block_rankin the classgrid_group, returns the rank of the block in the calling thread
- Support for
- Dynamic logging, no matching CUDA APIs exist
hipExtEnableLoggingenables HIP runtime logginghipExtDisableLoggingdisables HIP runtime logginghipExtSetLoggingParamssets HIP runtime logging parameters
- Library Management
Support for the following APIs for parity with the corresponding CUDA APIs.
-
New HIP device attributes
hipDeviceAttributeExpertSchedModehas been added to hipDeviceAttribute_t to indicate whether expert scheduling mode is supported on AMD GPUs.hipDeviceAttributeDmaBufSupportedis now supported, enabling buffer sharing.
Removed
- roc-obj* tools and Perl dependency.
Resolved issues
- An error that occurred during HIP graph stream capture in thread‑local capture mode has been fixed. The HIP runtime now updates its validation logic to ensure that captures running in other threads on different streams no longer invalidate or block the thread‑local capture in the current thread.
- A segmentation fault that occurred during HIP graph capture. The HIP runtime has updated its large‑graph handling mechanism to prevent stack overflow.
- Incorrect return codes from
hipEventQueryandhipEventSynchronizewhen invoked under mixed stream‑capture modes. The HIP runtime now correctly handles capture‑mode restrictions for event operations. - A segmentation fault that occurred when retrieving an allocation handle with
hipMemRetainAllocationHandle. The HIP runtime now correctly retains the generic allocation object to prevent memory‑management issues. - Resolved a graph node scheduling issue in multistream execution that, in some cases, led to unnecessary kernel‑execution stalls.
Optimized
- HIP log-level control capabilities HIP runtime adds dynamic logging functionalities, enabling applications to programmatically enable, disable, and configure logging at runtime without modifying environment variables or restarting the application. The result is more precise control over diagnostic output, making it easier to debug targeted code paths or minimize log noise during performance‑critical execution.
- HIP Graph Segmented Execution: Graph nodes are grouped into segments and dispatched across multiple GPU streams to enable parallel execution.
- Batching: Each stream receives a single
AccumulateCommandthat aggregates all kernel dispatches and submits them efficiently as one batch. - Synchronization: When a segment depends on work running on another stream, a hardware wait is inserted. At completion, all parallel streams synchronize back to the launch stream.
- Signaling: Segments emit hardware signals only when downstream segments require them—typically at fork points or when executing in parallel with other segments.
- Batching: Each stream receives a single
This approach reduces dispatch overhead and improves GPU utilization by overlapping independent graph work across streams while preserving correct execution order.
- Optimized graph stream synchronization by eliminating duplicate marker creation when syncing streams back to the launch stream. The runtime now tracks synchronized dependency segments to avoid redundant synchronization markers.
- Optimized
hipMemcpyBatchAsyncwith refactored code, new data structures, and an improved core implementation for better performance.
HIP 7.11 for ROCm 7.11
Added
- New HIP API
hipKernelGetParamInforeturns the offset and size of a kernel parameter.
- New HIP flag
HIP_POINTER_ATTRIBUTE_IS_LEGACY_HIP_IPC_CAPABLEis now supported in thehipPointerGetAttributeAPI, providing parity with the equivalent CUDA attribute.
Resolved issues
- A bug in inter‑GPU copy operations has been fixed by ensuring that the SDMA engine allocator is always queried for inter‑GPU transfers instead of reusing a previously cached engine. Because the allocator applies specialized logic to choose high‑bandwidth engines for each source–destination agent pair, reusing an engine selected for a different copy type could lead to reduced performance or incorrect behavior.
- An error in
hipMemRangeGetAttributethat occurred when memory was allocated withhipMallocAsynchas been resolved. The HIP runtime now correctly handles coherency‑range mode for memory‑pool pointers in the ROCm device implementation. - A race condition in the packet batch‑write logic has been fixed, where the Command Processor (CP) fetcher could read malformed packets. The update now invalidates all packet headers before writing packet bodies and then validates the headers in a defined order to prevent the fetcher from accessing incomplete packets.
- A deadlock that occurred when
hipMallocAsyncwas used after launching a persistent or long‑running kernel in another stream has been resolved. The HIP runtime now removes the default‑stream wait during mapping operations, preventing the stall. - An incorrect granularity value returned for device memory when requesting the recommended granularity through the
hipMemGetAllocationGranularityAPI has been fixed.
Optimized
- HIP runtime implemented a global SDMA engine allocator with per‑stream affinity to improve memory copy performance.
- Packet batch‑dispatch optimization: A new graph‑segment scheduling mechanism has been added to the HIP runtime to reduce CPU overhead during HIP graph launches. It uses hierarchical path discovery to construct execution segments that can be dispatched efficiently in parallel, replacing the traditional topological‑ordering approach.
- Improved
hipGraphLaunchparallelism for complex data‑parallel graphs. The HIP runtime now eliminates recursion, applies topological ordering, and removes an extra loop inhipGraphLaunchto streamline execution.
HIP 7.2.4 for ROCm 7.2.4
Resolved issues
- Fixed H2D memory copy latency regression in CPX mode. HIP runtime synchronization behavior has been corrected on AMD Instinct MI300 Series GPUs in CPX mode, restoring latency to previous levels for inference workloads that run multiple HIP streams with concurrent memory copies.
Optimized
- Reduced
hipGraphLaunchlatency for multi-list graphs. The HIP runtime’s graph dispatch mechanism has been optimized, reducing launch latency for workloads usinghipGraphLaunchwith multi-list graph topologies.
HIP 7.2.1 for ROCm 7.2.1
Resolved issues
- Corrected the validation of stream capture in global‑capture mode. It is no longer affected by any thread‑local capture‑mode sequences occurring in other threads.
- Corrected the return value of
hipEventQueryandhipEventSynchronize. The HIP runtime now properly handles and restricts stream capture within these APIs. - Corrected an issue in the batch-dispatch doorbell for AQL packets to avoid a potential CPU hang.
- To address potential delays in memory‑object destruction that could affect application logic, the HIP runtime disables memory‑object reference counting in direct‑dispatch mode.
Changed
- The
AMD_DIRECT_DISPATCHenvironment variable has been deprecated in the HIP runtime.
HIP 7.2 for ROCm 7.2
Added
- New HIP APIs
hipLibraryEnumerateKernelsreturns Kernel handles within a libraryhipKernelGetLibraryreturns Library handle for a hipKernel_t handlehipKernelGetNamereturns function name for a hipKernel_t handlehipLibraryLoadDatacreates library object from codehipLibraryLoadFromFilecreates library object from filehipLibraryUnloadunloads libraryhipLibraryGetKernelgets a kernel from libraryhipLibraryGetKernelCountgets kernel count in libraryhipStreamCopyAttributescopies attributes from source stream to destination streamhipOccupancyAvailableDynamicSMemPerBlockreturns dynamic shared memory available per block when launching numBlocks blocks on CU.hipMemSetMemPoolSets the current memory pool for a memory location and allocation typehipMemGetMemPoolGets the current memory pool for a memory location and of a particular allocation typehipMemPrefetchBatchAsyncPrefetches a batch of memory ranges to the specified locations
- New HIP flags
hipMemLocationTypeHost, enables handling virtual memory management in host memory location, in addition to device memory.- Support for flags in
hipGetProcAddress, enables searching for the per-thread version symbols.HIP_GET_PROC_ADDRESS_DEFAULTHIP_GET_PROC_ADDRESS_LEGACY_STREAMHIP_GET_PROC_ADDRESS_PER_THREAD_DEFAULT_STREAM
Resolved issues
- Corrected the calculation of the value of maximum shared memory per multiprocessor, in HIP device properties.
Optimized
- Graph node scaling: HIP runtime implements optimized doorbell ring mechanism for certain topologies of graph execution. It enables efficient batching of graph nodes. This enhancement provides better alignment with CUDA Graph optimizations. HIP also adds a new performance test for HIP graphs with programmable topologies to measure graph performance across different structures. The test evaluates graph instantiation time, first launch time, repeat launch times, and end-to-end execution for various graph topologies. The test implements comprehensive timing measurements including CPU overhead and device execution time.
- Back memory set (
memset) optimization: HIP runtime now implements a back memory set (memset) optimization to improve howmemsetnodes are processed during graph execution. This enhancement specifically handles varying number of AQL (Architected Queue Language) packets formemsetgraph node due to graph node set params for AQL batch submission approach. - Async handler performance improvement: HIP runtime has removed the lock contention in async handler enqueue path. This enhancement reduces runtime overhead and maximizes GPU throughput, for asynchronous kernel execution, especially in multi-threaded applications.
HIP 7.1.1 for ROCm 7.1.1
Added
- Support for the flag
hipHostRegisterIoMemoryinhipHostRegister, used to register I/O memory with HIP runtime so it can be accessed by the GPU.
Resolved issues
- Incorrect Compute Unit (CU) mask in logging. HIP runtime now correctly sets the field width for the output print operation. When logging is enabled via the environment variable
AMD_LOG_LEVEL, the runtime logs the accurate CU mask. - A segmentation fault occurred when dynamic queue management mechanism was enabled. HIP runtime now ensures GPU queues aren't NULL during marker submission, preventing crashes and improving robustness.
- An error encountered on hip tear-down after device reset in certain applications due to accessing stale memory objects. HIP runtime now properly releases memory associated with host calls, ensuring reliable device resets.
- A race condition occurred in certain graph-related applications when pending asynchronous signal handlers referenced device memory that had already been released, leading to memory corruption. HIP runtime now uses a reference counting strategy to manage access to device objects in asynchronous event handlers, ensuring safe and reliable memory usage.
HIP 7.1 for ROCm 7.1
Added
- New HIP APIs
hipModuleGetFunctionCountreturns the number of functions within a modulehipMemsetD2D8sets 2D memory range with specified 8-bit valueshipMemsetD2D8Asyncasynchronously sets 2D memory range with specified 8-bit valueshipMemsetD2D16sets 2D memory range with specified 16-bit valueshipMemsetD2D16Asyncasynchronously sets 2D memory range with specified 16-bit valueshipMemsetD2D32sets 2D memory range with specified 32-bit valueshipMemsetD2D32Asyncasynchronously sets 2D memory range with specified 32-bit valueshipStreamSetAttributesets attributes such as synchronization policy for a given streamhipStreamGetAttributereturns attributes such as priority for a given streamhipModuleLoadFatBinaryloads fatbin binary to a modulehipMemcpyBatchAsyncasynchronously performs a batch copy of 1D or 2D memoryhipMemcpy3DBatchAsyncasynchronously performs a batch copy of 3D memoryhipMemcpy3DPeercopies memory between deviceshipMemcpy3DPeerAsyncasynchronously copies memory between deviceshipMemsetD2D32Asyncasynchronously sets 2D memory range with specified 32-bit valueshipMemPrefetchAsync_v2prefetches memory to the specified locationhipMemAdvise_v2advises about the usage of a given memory rangehipGetDriverEntryPointgets function pointer of a HIP API.hipSetValidDevicessets a default list of devices that can be used by HIPhipStreamGetIdqueries the id of a stream
- Support for nested tile partitioning within cooperative groups, matching NVIDIA CUDA functionality.
Resolved issues
- A segmentation fault occurred in application when capturing the same HIP graph from multiple streams with cross-stream dependencies. HIP runtime fixed an issue where a forked stream joined to a parent stream which was not originally created with the API
hipStreamBeginCapture. - Different behavior of en-queuing command on a legacy stream during stream capture on AMD ROCM platform, compared with NVIDIA CUDA. HIP runtime now returns an error in this specific situation, to behave the same as CUDA.
- Failure of memory access fault occurred in rocm-examples test suite. When Heterogeneous Memory Management (HMM) is not supported in the driver,
hipMallocManagedwill only allocate system memory in HIP runtime.
Optimized
- Improved hip module loading latency.
- Optimized kernel metadata retrieval during module post load.
- Optimized doorbell ring in HIP runtime for the following performance improvements:
- Makes efficient packet batching for HIP graph launch,
- Dynamic packet copying based on defined maximum threshold or power-of-2 staggered copy pattern,
- If timestamps are not collected for a signal for reuse, creates a new signal. This can potentially increase signal footprint if the handler doesn't run fast enough.
Known issues
- SPIR-V-enabled applications may encounter an issue of segmentation fault. The problem disappears when SPIR-V is disabled. The issue will be fixed in the next ROCm release.
HIP 7.0.2 for ROCm 7.0.2
Added
- Support for the
hipMemAllocationTypeUncachedflag, enabling developers to allocate uncached memory. This flag is now supported in the following APIs:hipMemGetAllocationGranularitydetermines the recommended allocation granularity for uncached memory.hipMemCreateallocates memory with uncached properties.
Resolved issues
- A compilation failure affecting applications that compile kernels using
hiprtcwith the compiler optionstd=c++11. - A permission-related error occurred during the execution of hipLaunchHostFunc. This API is now supported and permitted to run during stream capture, aligning its behavior with CUDA.
- A numerical error during graph capture of kernels that rely on a remainder in
globalWorkSize, in frameworks like MIOpen and PyTorch, where the grid size is not a multiple of the block size. To ensure correct replay behavior, HIP runtime now stores this remainder inhip::GraphKernelNodeduringhipExtModuleLaunchKernelcapture, enabling accurate execution and preventing corruption. - A page fault occurred during viewport rendering while running the file undo.blend in Blender. The issue was resolved by the HIP runtime, which reused the same context during image creation.
- Resolved a segmentation fault in
gpu_metrics, which is used in threshold logic for command submission patches to GPU device(s) during CPU synchronization.
HIP 7.0 for ROCm 7.0
Added
- New HIP APIs
hipLaunchKernelExdispatches the provided kernel with the given launch configuration and forwards the kernel arguments.hipLaunchKernelExClaunches a HIP kernel using a generic function pointer and the specified configuration.hipDrvLaunchKernelExdispatches the device kernel represented by a HIP function object.hipMemGetHandleForAddressRangegets a handle for the address range requested.num_threadsTotal number of threads in the group. The legacy API size is alias.
- New support for Open Compute Project (OCP) floating-point
FP4/FP6/FP8as the following. For details, see Low precision floating point document.- Data types for
FP4/FP6/FP8. - HIP APIs for
FP4/FP6/FP8, which are compatible with corresponding CUDA APIs. - HIP Extensions APIs for microscaling formats, which are supported on AMD GPUs.
- Data types for
- New
wptrandrptrvalues inClPrint, for better logging in dispatch barrier methods. - The
_sync()version of crosslane builtins such asshfl_sync()are enabled by default. These can be disabled by setting the preprocessor macroHIP_DISABLE_WARP_SYNC_BUILTINS. - Added
constexproperators forfp16/bf16. - Added warp level primitives:
__syncwarpand reduce intrinsics (e.g.__reduce_add_sync()) - Support for the flags in APIs as following, now allows uncached memory allocation.
hipExtHostRegisterUncached, used inhipHostRegister.hipHostMallocUncachedandhipHostAllocUncached, used inhipHostMallocandhipHostAlloc.
num_threadstotal number of threads in the group. The legacy API size is alias.- Added PCI CHIP ID information as the device attribute.
- Added new tests applications for OCP data types
FP4/FP6/FP8. - A new attribute in HIP runtime was implemented which exposes a new device capability of how many compute dies (chiplets, xcc) are available on a given GPU. Developers can get this attribute via the API
hipDeviceGetAttribute, to make use of the best cache locality in a kernel, and optimize the Kernel launch grid layout, for performance improvement.
Changed
- Deprecated GPUs. Some unsupported GPUs such as gfx9, gfx8 and gfx7 are deprecated on Microsoft Windows.
- Removal of Beta warnings in HIP Graph APIs All Beta warnings in usage of HIP Graph APIs are removed, they are now officially and fully supported.
- Behavior changes
hipGetLastErrornow returns the error code which is the last actual error caught in the current thread during the application execution.- Cooperative groups in
hipLaunchCooperativeKernelMultiDeviceandhipLaunchCooperativeKernelfunctions, additional input parameter validation checks are added. hipPointerGetAttributesreturnshipSuccessinstead of an error with invalid valuehipErrorInvalidValue, in caseNULLhost or attribute pointer is passed as input parameter. It now matches the functionality ofcudaPointerGetAttributeswhich changed with CUDA 11 and above releases.hipFreepreviously there was an implicit wait which was applicable for all memory allocations, for synchronization purpose. This wait is now disabled for allocations made withhipMallocAsyncandhipMallocFromPoolAsync, to match the behavior of CUDA APIcudaFreehipFreeAsyncnow returnshipSuccesswhen the input pointer is NULL, instead ofhipErrorInvalidValue, to be consistent withhipFree.- Exceptions occurring during a kernel execution will not abort the process anymore but will return an error unless core dump is enabled.
- Changes in hipRTC.
- Removal of
hipRTCsymbols from HIP Runtime Library. Any application usinghipRTCAPIs should link explicitly with thehipRTClibrary. This makes the usage ofhipRTClibrary on Linux the same as on Windows and matches the behavior of CUDAnvRTC. hipRTCcompilation The device code compilation now uses namespace__hip_internal, instead of the standard headersstd, to avoid namespace collision.- Changes of datatypes from
hipRTC. Datatype definitions such asint64_t,uint64_t,int32_t, anduint32_t, etc. are removed to avoid any potential conflicts in some applications. HIP now uses internal datatypes instead, prefixed with__hip, for example,__hip_int64_t.
- Removal of
- HIP header clean up
- Usage of STD headers, HIP header files only include necessary STL headers.
- Deprecated structure
HIP_MEMSET_NODE_PARAMSis removed. Developers can use the definitionhipMemsetParamsinstead.
- API signature/struct changes
- API signatures are adjusted in some APIs to match corresponding CUDA APIs. Impacted APIs are as folloing:
hiprtcCreateProgramhiprtcCompileProgramhipMemcpyHtoDhipCtxGetApiVersion
- HIP struct change in
hipMemsetParams, it is updated and compatible with CUDA. - HIP vector constructor change in
hipComplexinitialization now generates correct values. The affected constructors will be small vector types such asfloat2,int4, etc.
- API signatures are adjusted in some APIs to match corresponding CUDA APIs. Impacted APIs are as folloing:
- Stream Capture updates
- Restricted stream capture mode, it is made in HIP APIs via adding the macro
CHECK_STREAM_CAPTURE_SUPPORTED (). In the previous HIP enumerationhipStreamCaptureMode, three capture modes were defined. With checking in the macro, the only supported stream capture mode is nowhipStreamCaptureModeRelaxed. The rest are not supported, and the macro will returnhipErrorStreamCaptureUnsupported. This update involves the following APIs, which is allowed only in relaxed stream capture mode,hipMallocManagedhipMemAdvise
- Checks stream capture mode, the following APIs check the stream capture mode and return error codes to match the behavior of CUDA.
hipLaunchCooperativeKernelMultiDevicehipEventQueryhipStreamAddCallback
- Returns error during stream capture. The following HIP APIs now returns specific error
hipErrorStreamCaptureUnsupportedon the AMD platform, but not alwayshipSuccess, to match behavior with CUDA.hipDeviceSetMemPoolhipMemPoolCreatehipMemPoolDestroyhipDeviceSetSharedMemConfighipDeviceSetCacheConfighipMemcpyWithStream
- Restricted stream capture mode, it is made in HIP APIs via adding the macro
- Error code update
Returned error/value codes are updated in the following HIP APIs to match the corresponding CUDA APIs.
- Module Management Related APIs
hipModuleLaunchKernelhipExtModuleLaunchKernelhipExtLaunchKernelhipDrvLaunchKernelExhipLaunchKernelhipLaunchKernelExChipModuleLaunchCooperativeKernelhipModuleLoad
- Texture Management Related APIs
The following APIs update the return codes to match the behavior with CUDA:
hipTexObjectCreate, supports zero width and height for 2D image. If either is zero, will not returnfalse.hipBindTexture2D, adds extra check, if pointer for texture reference or device is NULL, returnshipErrorNotFound.hipBindTextureToArray, if any NULL pointer is input for texture object, resource descriptor, or texture descriptor, returns errorhipErrorInvalidChannelDescriptor, instead ofhipErrorInvalidValue.hipGetTextureAlignmentOffset, adds a return codehipErrorInvalidTexturewhen the texture reference pointer is NULL.
- Cooperative Group Related APIs, more calidations are added in the following API implementation,
hipLaunchCooperativeKernelMultiDevicehipLaunchCooperativeKernel
- Module Management Related APIs
- Invalid stream input parameter handling
In order to match the CUDA runtime behavior more closely, HIP APIs with streams passed as input parameters no longer check the stream validity. Previously, the HIP runtime returned an error code
hipErrorContextIsDestroyedif the stream was invalid. In CUDA version 12 and later, the equivalent behavior is to raise a segmentation fault. HIP runtime now matches the CUDA by causing a segmentation fault. The list of APIs impacted by this change are as follows:- Stream Management Related APIs
hipStreamGetCaptureInfohipStreamGetPriorityhipStreamGetFlagshipStreamDestroyhipStreamAddCallbackhipStreamQueryhipLaunchHostFunc
- Graph Management Related APIs
hipGraphUploadhipGraphLaunchhipStreamBeginCaptureToGraphhipStreamBeginCapturehipStreamIsCapturinghipStreamGetCaptureInfohipGraphInstantiateWithParams
- Memory Management Related APIs
hipMemcpyPeerAsynchipMemcpy2DValidateParamshipMallocFromPoolAsynchipFreeAsynchipMallocAsynchipMemcpyAsynchipMemcpyToSymbolAsynchipStreamAttachMemAsynchipMemPrefetchAsynchipDrvMemcpy3DhipDrvMemcpy3DAsynchipDrvMemcpy2DUnalignedhipMemcpyParam2DhipMemcpyParam2DAsynchipMemcpy2DArrayToArrayhipMemcpy2DhipMemcpy2DAsynchipDrvMemcpy2DUnalignedhipMemcpy3D
- Event Management Related APIs
hipEventRecordhipEventRecordWithFlags
- Stream Management Related APIs
warpSizeChange In order to match the CUDA specification, thewarpSizevariable is no longerconstexpr. In general, this should be a transparent change; however, if an application was usingwarpSizeas a compile-time constant, it will have to be updated to handle the new definition. For more information, see either the discussion ofwarpSizewithin the HIP C++ language extensions.
Optimized
HIP runtime has the following functional improvements which greatly improve runtime performance and user experience.
- Reduced usage of the lock scope in events and kernel handling.
- Switches to
shared_mutexfor event validation, usesstd::unique_lockin HIP runtime to create/destroy event, instead ofscopedLock. - Reduces the
scopedLockin handling of kernel execution. HIP runtime now callsscopedLockduring kernel binary creation/initialization, doesn't call it again during kernel vector iteration before launch.
- Switches to
- Implementation of unifying managed buffer and kernel argument buffer so HIP runtime doesn't need to create/load a separate kernel argument buffer.
- Refactored memory validation, creates a unique function to validate a variety of memory copy operations.
- Improved kernel logging using demangling shader names.
- Advanced support for SPIRV, now kernel compilation caching is enabled by default. This feature is controlled by the environment variable
AMD_COMGR_CACHE, for details, see hip_rtc document. - Programmatic support for scratch limits on MI300 and MI350 series up GPU devices. More enumeration values were added in
hipLimit_tas following,hipExtLimitScratchMin, minimum allowed value in bytes for scratch limit on the device.hipExtLimitScratchMax, maximum allowed value in bytes for scratch limit on the device.hipExtLimitScratchCurrent, current scratch limit threshold in bytes on the device. Must be between the valuehipExtLimitScratchMinandhipExtLimitScratchMax. Developers can now use the environment variableHSA_SCRATCH_SINGLE_LIMIT_ASYNCto change the default allocation size with expected scratch limit in ROCR runtime. On top of it, this value can also be overwritten programmatically in the application using the HIP APIhipDeviceSetLimit(hipExtLimitScratchCurrent, value)to reset the scratch limit value.
- HIP runtime now enables peer-to-peer (P2P) memory copies to utilize all available SDMA engines, rather than being limited to a single engine. It also selects the best engine first to give optimal bandwidth.
- Improved launch latency for
D2Dcopies andmemseton MI300 series. - Introduced a threshold to handle the command submission patch to the GPU device(s), considering the synchronization with CPU, for performance improvement.
Resolved issues
- Error of "unable to find modules" in HIP clean up for code object module.
- The issue of incorrect return error
hipErrorNoDevice, when a crash occurred on GPU device due to illegal operation or memory violation. HIP runtime now handles the failure on the GPU side properly and reports the precise error code based on the last error seen on the GPU. - Failures in some framework test applications, HIP runtime fixed the bug in retrieving a memory object from the IPC memory handle.
- A crash in TensorFlow related application. HIP runtime now combines multiple definitions of
callbackQueueinto a single function, in case of an exception, passes its handler to the application and provides corresponding error code. - Fixed issue of handling the kernel parameters for the graph launch.
- Failures in roc-obj tools. HIP runtime now makes
DEPRECATEDmessage in roc-obj tools asSTDERR. - Support of
hipDeviceMallocContiguousflags inhipExtMallocWithFlags(). It now enablesHSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAGin the memory pool allocation on GPU device. - Compilation failure, HIP runtime refactored the vector type alignment with
__hip_vec_align_v - A numerical error/corruption found in Pytorch during graph replay. HIP runtime fixed the input sizes of kernel launch dimensions in hipExtModuleLaunchKernel for the execution of hipGraph capture.
- A crash during kernel execution in a customer application. The structure of kernel arguments was updated via adding the size of kernel arguments, and HIP runtime does validation before launch kernel with the structured arguments.
- Compilation error when using bfloat16 functions. HIP runtime removed the anonymous namespace from FP16 functions to resolve this issue.
HIP 6.4.2 for ROCm 6.4.2
Added
- HIP API implementation for
hipEventRecordWithFlags, records an event in the specified stream with flags. - Support for the pointer attribute
HIP_POINTER_ATTRIBUTE_CONTEXT. - Support for the flags
hipEventWaitDefaultandhipEventWaitExternal.
Optimized
- Improved implementation in
hipEventSynchronize, HIP runtime now makes internal callbacks as non-blocking operations to improve performance.
Resolved issues
- Issue of dependency on
libgcc-s1during rocm-dev install on Debian Buster. HIP runtime removed this Debian package dependency, and useslibgcc1instead for this distros. - Building issue for
COMGRdynamic load on Fedora and other Distros. HIP runtime now doesn't link againstlibamd_comgr.so. - Failure in the API
hipStreamDestroy, when stream type ishipStreamLegacy. The API now returns error codehipErrorInvalidResourceHandleon this condition. - Kernel launch errors, such as
shared object initialization failed,invalid device functionorkernel execution failure. HIP runtime now loadsCOMGRproperly considering the file with its name and mapped image. - Memory access fault in some applications. HIP runtime fixed offset accumulation in memory address.
- The memory leak in virtual memory management (VMM). HIP runtime now uses the size of handle for allocated memory range instead of actual size for physical memory, which fixed the issue of address clash with VMM.
- Large memory allocation issue. HIP runtime now checks GPU video RAM and system RAM properly and sets size limits during memory allocation either on the host or the GPU device.
- Support of
hipDeviceMallocContiguousflags inhipExtMallocWithFlags(). It now enablesHSA_AMD_MEMORY_POOL_CONTIGUOUS_FLAGin the memory pool allocation on GPU device. - Radom memory segmentation fault in handling
GraphExecobject release andhipDeviceSyncronization. HIP runtime now uses internal device synchronize function in__hipUnregisterFatBinary.
HIP 6.4.1 for ROCm 6.4.1
Added
- New log mask enumeration
LOG_COMGRenables logging precise code object information.
Changed
- HIP runtime uses device bitcode before SPIRV.
- The implementation of preventing
hipLaunchKernellatency degradation with number of idle streams is reverted/disabled by default. - Stop using
__AMDGCN_WAVEFRONT_SIZEandwarpSizeas compile-time constants. ThewarpSizevariable is no longerconstexpr, in order to match the CUDA specification. See more details of thewarpSizechange within the ROCm 6.4.1 deprecation notice.
Optimized
- Improved kernel logging includes de-mangling shader names.
- Refined implementation in HIP APIs
hipEventRecordsandhipStreamWaitEventfor performance improvement.
Resolved issues
- Stale state during the graph capture. The return error was fixed, HIP runtime now always uses the latest dependent nodes during
hipEventRecordcapture. - Segmentation fault during kernel execution. HIP runtime now allows maximum stack size as per ISA on the GPU device.
HIP 6.4 (For ROCm 6.4)
Added
- New HIP APIs
hipDeviceGetTexture1DLinearMaxWidthreturns the maximum width of elements in a 1D linear texture, that can be allocated on the specified device.hipStreamBatchMemOpenqueues an array of batch memory operations in the stream, for stream synchronization.hipGraphAddBatchMemOpNodecreates a batch memory operation node and adds it to a graph.hipGraphBatchMemOpNodeGetParamsreturns the pointer of parameters from the batch memory operation node.hipGraphBatchMemOpNodeSetParamssets parameters for the batch memory operation node.hipGraphExecBatchMemOpNodeSetParamssets the parameters for a batch memory operation node in the given executable graph.hipLinkAddDataadds SPIRV code object data to linker instance with options.hipLinkAddFileadds SPIRV code object file to linker instance with options.hipLinkCreatecreates linker instance at runtime with options.hipLinkCompletecompletes linking of program and output linker binary to use with hipModuleLoadData.hipLinkDestroydeletes linker instance.
Changed
- roc-obj* tools are being deprecated, and will be removed in an upcoming release.
- Perl package dependencies are now RECOMMENDS or SUGGESTS. Users will need to install these themselves.
- Support for ROCm Object tooling has moved into llvm-objdump provided by package rocm-llvm.
- SDMA retainer logic is removed for engine selection in operation of runtime buffer copy.
Optimized
hipGraphLaunchparallelism is improved for complex data-parallel graphs.- Round-robin queue mechanism is updated for command scheduling. For multi-streams execution, HSA queue from null stream lock is freed and won't occupy the queue ID after the kernel in the stream is finished.
- The HIP runtime doesn't free bitcode object before code generation. It adds a cache, which allows compiled code objects to be reused instead of recompiling. This improves performance on multi-GPU systems.
- Runtime uses unified copy approach
- Unpinned
H2Dcopies are no longer blocking until the size of 1MB. - Kernel copy path is enabled for unpinned
H2D/D2Hmethods. - The default environment variable
GPU_FORCE_BLIT_COPY_SIZEis set to16, which limits the kernel copy to sizes less than 16 KB, while copies about that would be handled bySDMAengine. - Blit code is refactored and ASAN instrumentation is cleaned up.
- Unpinned
- HIP runtime uses signals without interrupts.
- In active wait mode, uses signals without interrupts by default.
- Only when a callback is required, switches to the interrupts.
Resolved issues
- Out of memory error on Windows. When the user calls
hipMallocfor device memory allocation while specifying a size larger than the available device memory, the HIP runtime fixes the error in the API implementation, allocating the available device memory plus system memory (shared virtual memory). - Error of dependency on libgcc-s1 during rocm-dev install on Debian Buster. HIP runtime now uses libgcc1 for this distros.
- Stack corruption during kernel execution. HIP runtime now adds maximum stack size limit based on the GPU device feature.
Upcoming changes
The following are the list of backwards incompatible changes planned for the upcoming major ROCm release.
-
Signature changes in APIs to match corresponding CUDA APIs,
hiprtcCreateProgramhiprtcCompileProgramhipCtxGetApiVersion
-
Behavior of
hipPointerGetAttributesis changed to match corresponding CUDA API in version 11 and later releases. -
Behavior of
hipFreeis changed to match corresponding CUDA APIcudaFree. -
HIP vector constructor changes for
hipComplex. -
Return error/value codes update in the following hip APIs, they now match the corresponding CUDA APIs,
hipModuleLaunchKernelhipExtModuleLaunchKernelhipModuleLaunchCooperativeKernelhipGetTextureAlignmentOffsethipTexObjectCreatehipBindTexture2DhipBindTextureToArrayhipModuleLoadhipLaunchCooperativeKernelMultiDevicehipExtLaunchCooperativeKernelMultiDevice
-
HIPRTC implementation, the compilation of hiprtc now uses namespace
__hip_internal, instead of the standard headersstd. -
Stream capture mode update in the following hip APIs. Stream can only be captured in relax mode, to match the behavior of the corresponding CUDA APIs,
hipMallocManagedhipMemAdvisehipLaunchCooperativeKernelMultiDevicehipDeviceSetCacheConfighipDeviceSetSharedMemConfighipMemPoolCreatehipMemPoolDestoryhipDeviceSetMemPoolhipEventQuery
-
The implementation of
hipStreamAddCallbackis updated, to match the behavior of CUDA. -
Removal of hiprtc symbols from hip library.
- hiprtc will be a independent library, all symbols supported in hip library are removed.
- Any application using hiprtc APIs should link explicitly with hiprtc library.
- This change makes the usage of hiprtc library on Linux the same as on Windows, and matches the behavior of CUDA nvrtc.
-
Removal of deprecated struct
HIP_MEMSET_NODE_PARAMS, developers can use definitionhipMemsetParamsinstead. -
warpSizechange. Usages of__AMDGCN_WAVEFRONT_SIZEandwarpSizeas compile-time constants will be removed in HIP header files. In order to match the CUDA specification, thewarpSizevariable is no longerconstexpr. If an application was usingwarpSizeas a compile-time constant, it will have to be updated to handle the new definition. For details usage of thewarpSize, see the best practice for warpSize handling.
HIP 6.3.2 for ROCm 6.3.2
Added
- Tracking of Heterogeneous System Architecture (HSA) handlers:
- Adds an atomic counter to track the outstanding HSA handlers.
- Waits on CPU for the callbacks if the number exceeds the defined value.
- Codes to capture Architected Queueing Language (AQL) packets for HIP graph memory copy node between host and device. HIP enqueues AQL packets during graph launch.
- Control to use system pool implementation in runtime commands handling. By default, it is disabled.
- A new path to avoid
WaitAnycalls inAsyncEventsLoop. The new path is selected by default. - Runtime control on decrement counter only if event is popped. There is a new way to restore dead signals cleanup for the old path.
- A new logic in runtime to track the age of events from the kernel mode driver.
Optimized
- HSA callback performance. The HIP runtime creates and submits commands in the queue and interacts with HSA through a callback function. HIP waits for the CPU status from HSA to optimize handling of events, profiling, commands, and HSA signals for higher performance.
- Runtime optimisation which combines all logic of
WaitAnyin a single processing loop and avoids extra memory allocations or reference counting. The runtime won't spin on the CPU if all events are busy. - Multi-threaded dispatches for performance improvement.
- Command submissions and processing between CPU and GPU by introducing a way to limit the software batch size.
- Switch to
std::shared_mutexin book/keep logic in streams from multiple threads simultaneously, for performance improvement in specific customer applications. std::shared_mutexis used in memory object mapping, for performance improvement.
Resolved issues
- Race condition in multi-threaded producer/consumer scenario with
hipMallocFromPoolAsync. - Segmentation fault with
hipStreamLegacywhile using the APIhipStreamWaitEvent. - Usage of
hipStreamLegacyin HIP event record. - A soft hang in graph execution process from HIP user object. The fix handles the release of graph execution object properly considering synchronization on the device/stream. The user application now behaves the same with hipUserObject on both the AMD ROCm and NVIDIA CUDA platforms.
HIP 6.3.1 for ROCm 6.3.1
Added
- An activeQueues set that tracks only the queues that have a command submitted to them, which allows fast iteration in
waitActiveStreams.
Optimized
- Mechanism of preventing
hipLaunchKernellatency degradation with number of idle streams is implemented for performance improvement.
HIP 6.3 for ROCm 6.3
Added
- New HIP APIs
hipGraphExecGetFlagsreturns the flags on executable graph.hipGraphNodeSetParamsupdates parameters of a created node.hipGraphExecNodeSetParamsupdates parameters of a created node on executable graph.hipDrvGraphMemcpyNodeGetParamsgets a memcpy node's parameters.hipDrvGraphMemcpyNodeSetParamssets a memcpy node's parameters.hipDrvGraphAddMemFreeNodecreates a memory free node and adds it to a graph.hipDrvGraphExecMemcpyNodeSetParamssets the parameters for a memcpy node in the given graphExec.hipDrvGraphExecMemsetNodeSetParamssets the parameters for a memset node in the given graphExec.
Changed
- Un-deprecated HIP APIs
hipHostAllochipFreeHost
Optimized
- Disabled CPU wait in device synchronize to avoid idle time in applications such as Hugging Face models and PyTorch.
- Optimized multi-threaded dispatches to improve performance.
- Limited the software batch size to control the number of command submissions for runtime to handle efficiently.
- Optimizes HSA callback performance when a large number of events are recorded by multiple threads and submitted to multiple GPUs.
- HIP graph execution perfomance improvement.
- Added the optimized multistream path in graph execution. It uses a fixed number of async streams in the execution
- Optimized the launch latency, where commands creation and execution is done at the same time
- Optimized the scheduling to use less barriers and waiting signals if the same queue can be detected
- The new path is controlled by a new environment variable, with the options either to use the original path, or to force the number of asynchronous queues for execution.
Resolved issues
- Soft hang in runtime wait event when run TensorFlow.
- Memory leak in the API
hipGraphInstantiatewhen kernel is launched usinghipExtLaunchKernelGGLwith event. - Memory leak when the API
hipGraphAddMemAllocNodeis called. - The
_sync()version of crosslane builtins such asshfl_sync(),__all_sync()and__any_sync(), continue to be hidden behind the preprocessor macroHIP_ENABLE_WARP_SYNC_BUILTINS, and will be enabled unconditionally in the next ROCm release.
HIP 6.2.41134 for ROCm 6.2.1
Resolved issues
- Soft hang when use AMD_SERIALIZE_KERNEL.
- Memory leak in hipIpcCloseMemHandle.
HIP 6.2 (For ROCm 6.2)
Added
-
Introduced the
_sync()version of crosslane builtins such asshfl_sync(),__all_sync()and__any_sync(). These take a 64-bit integer as an explicit mask argument.- In HIP 6.2, these are hidden behind the preprocessor macro
HIP_ENABLE_WARP_SYNC_BUILTINS, and will be enabled unconditionally in HIP 6.3.
- In HIP 6.2, these are hidden behind the preprocessor macro
-
Added new HIP APIs
hipGetProcAddressreturns the pointer to driver function, corresponding to the defined driver function symbol.hipGetFuncBySymbolreturns the pointer to device entry function that matches entry function symbolPtr.hipStreamBeginCaptureToGraphbegins graph capture on a stream to an existing graph.hipGraphInstantiateWithParamscreates an executable graph from a graph.hipMemcpyAtoAcopies from one 1D array to another.hipMemcpyDtoAcopies from device memory to a 1D array.hipMemcpyAtoDcopies from one 1D array to device memory.hipMemcpyAtoHAsynccopies from one 1D array to host memory.hipMemcpyHtoAAsynccopies from host memory to a 1D array.hipMemcpy2DArrayToArraycopies data between host and device.
-
Added a new flag
integratedsupport in device propertyThe
integratedflag is added in the structhipDeviceProp_t. On the integratedAPUsystem, the runtime driver detects and sets this flag to1, in which case the APIhipDeviceGetAttributereturns enumhipDeviceAttribute_tfor hipDeviceAttributeIntegrated as value1, for integrated GPU device.The enum value
hipDeviceAttributeIntegratedcorresponds tocudaDevAttrIntegratedon CUDA platform. -
Added initial support for 8-bit floating point datatype in
amd_hip_fp8.h. These are accessible via#include <hip/hip_fp8.h> -
Add UUID support for environment variable
HIP_VISIBLE_DEVICES.
Resolved issues
- Stream capture support in HIP graph. Prohibited and unhandled operations are fixed during stream capture in HIP runtime.
- Fix undefined symbol error for hipTexRefGetArray & hipTexRefGetBorderColor.
HIP 6.1 (For ROCm 6.1)
Added
- New environment variable HIP_LAUNCH_BLOCKING It is used for serialization on kernel execution. The default value is 0 (disable), kernel will execute normally as defined in the queue. When this environment variable is set as 1 (enable), HIP runtime will serialize kernel enqueue, behaves the same as AMD_SERIALIZE_KERNEL.
- Added HIPRTC support for hip headers driver_types, math_functions, library_types, math_functions, hip_math_constants, channel_descriptor, device_functions, hip_complex, surface_types, texture_types.
Changed
- HIPRTC now assumes WGP mode for gfx10+. CU mode can be enabled by passing
-mcumodeto the compile options fromhiprtcCompileProgram.
Resolved issues
-
HIP complex vector type multiplication and division operations. On AMD platform, some duplicated complex operators are removed to avoid compilation failures. In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types, typedef float2 hipFloatComplex; typedef double2 hipDoubleComplex; Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following,
- hipCmulf() and hipCdivf() for hipFloatComplex
- hipCmul() and hipCdiv() for hipDoubleComplex
Note: These complex operations are equivalent to corresponding types/functions on NVIDIA platform.
HIP 6.0 (For ROCm 6.0)
Added
-
Addition of hipExtGetLastError
- AMD backend specific API, to return error code from last HIP API called from the active host thread
-
New fields for external resource interoperability,
- Structs
- hipExternalMemoryHandleDesc_st
- hipExternalMemoryBufferDesc_st
- hipExternalSemaphoreHandleDesc_st
- hipExternalSemaphoreSignalParams_st
- hipExternalSemaphoreWaitParams_st
- Enumerations
- hipExternalMemoryHandleType_enum
- hipExternalSemaphoreHandleType_enum
- hipExternalMemoryHandleType_enum
- Structs
-
New members are added in HIP struct hipDeviceProp_t, for new feature capabilities including,
-
Texture
- int maxTexture1DMipmap;
- int maxTexture2DMipmap[2];
- int maxTexture2DLinear[3];
- int maxTexture2DGather[2];
- int maxTexture3DAlt[3];
- int maxTextureCubemap;
- int maxTexture1DLayered[2];
- int maxTexture2DLayered[3];
- int maxTextureCubemapLayered[2];
-
Surface
- int maxSurface1D;
- int maxSurface2D[2];
- int maxSurface3D[3];
- int maxSurface1DLayered[2];
- int maxSurface2DLayered[3];
- int maxSurfaceCubemap;
- int maxSurfaceCubemapLayered[2];
-
Device
- hipUUID uuid;
- char luid[8]; -- this is 8-byte unique identifier. Only valid on windows -- LUID (Locally Unique Identifier) is supported for interoperability between devices.
- unsigned int luidDeviceNodeMask; \
Note: HIP supports LUID only on Windows OS.
-
-
Added
amd_hip_bf16.hwhich addsbfloat16type. These definitions are accessible via#include <hip/hip_bf16.h>This header exists alongside the older bfloat16 header inamd_hip_bfloat16.hwhich is included viahip/hip_bfloat16.h. Users are recommended to use<hip/hip_bf16.h>instead of<hip/hip_bfloat16.h>.
Changed
- Some OpenGL Interop HIP APIs are moved from the hip_runtime_api header to a new header file hip_gl_interop.h for the AMD platform, as following,
- hipGLGetDevices
- hipGraphicsGLRegisterBuffer
- hipGraphicsGLRegisterImage
- With ROCm 6.0, the HIP version is 6.0. As the HIP runtime binary suffix is updated in every major ROCm release, in ROCm 6.0, the new filename is libamdhip64.so.6. Furthermore, in ROCm 6.0 release, the libamdhip64.so.5 binary from ROCm 5.7 is made available to maintain binary backward compatibility with ROCm 5.x.
Changed Impacting Backward Compatibility
- Data types for members in HIP_MEMCPY3D structure are changed from "unsigned int" to "size_t".
- The value of the flag hipIpcMemLazyEnablePeerAccess is changed to “0x01”, which was previously defined as “0”.
- Some device property attributes are not currently support in HIP runtime, in order to maintain consistency, the following related enumeration names are changed in hipDeviceAttribute_t
- hipDeviceAttributeName is changed to hipDeviceAttributeUnused1
- hipDeviceAttributeUuid is changed to hipDeviceAttributeUnused2
- hipDeviceAttributeArch is changed to hipDeviceAttributeUnused3
- hipDeviceAttributeGcnArch is changed to hipDeviceAttributeUnused4
- hipDeviceAttributeGcnArchName is changed to hipDeviceAttributeUnused5
- HIP struct hipArray is removed from driver type header to be complying with cuda
- hipArray_t replaces hipArray*, as the pointer to array.
- This allows hipMemcpyAtoH and hipMemcpyHtoA to have the correct array type which is equivalent to coresponding CUDA driver APIs.
Removed
- Deprecated Heterogeneous Compute (HCC) symbols and flags are removed from the HIP source code, including,
- Build options on obsolete HCC_OPTIONS was removed from cmake.
- Micro definitions are removed. HIP_INCLUDE_HIP_HCC_DETAIL_DRIVER_TYPES_H HIP_INCLUDE_HIP_HCC_DETAIL_HOST_DEFINES_H
- Compilation flags for the platform definitions, AMD platform, HIP_PLATFORM_HCC HCC HIP_ROCclr NVIDIA platform, HIP_PLATFORM_NVCC
- File directories in the clr repository are removed, https://github.com/ROCm/clr/blob/develop/hipamd/include/hip/hcc_detail https://github.com/ROCm/clr/blob/develop/hipamd/include/hip/nvcc_detail
- Deprecated gcnArch is removed from hip device struct hipDeviceProp_t.
- Deprecated "enum hipMemoryType memoryType;" is removed from HIP struct hipPointerAttribute_t union.
- Deprecated HIT based tests are removed from HIP project
- Catch tests are available [hip-tests] (https://github.com/ROCm/hip-tests) project
Resolved issues
- Kernel launch maximum dimension validation is added specifically on gridY and gridZ in the HIP API hipModule-LaunchKernel. As a result,when hipGetDeviceAttribute is called for the value of hipDeviceAttributeMaxGrid-Dim, the behavior on the AMD platform is equivalent to NVIDIA.
- The HIP stream synchronisation behavior is changed in internal stream functions, in which a flag "wait" is added and set when the current stream is null pointer while executing stream synchronisation on other explicitly created streams. This change avoids blocking of execution on null/default stream. The change won't affect usage of applications, and makes them behave the same on the AMD platform as NVIDIA.
- Error handling behavior on unsupported GPU is fixed, HIP runtime will log out error message, instead of creating signal abortion error which is invisible to developers but continued kernel execution process. This is for the case when developers compile any application via hipcc, setting the option --offload-arch with GPU ID which is different from the one on the system.
Known Issues
- Dynamically loaded HIP runtime library references incorrect version of hipDeviceGetProperties and hipChooseDevice APIs
When an application dynamically loads the HIP runtime library from ROCm 6.0 and attempts to get the hipDeviceGetProperties and/or hipChooseDevice entry-points using dlsym, the application gets the older version (ROCm 5.7) of those entry-points.
As a workaround, while compiling with ROCm 6.0, use the string "hipDeviceGetPropertiesR0600", and "hipChooseDeviceR0600" respectively for hipDeviceGetProperties and hipChooseDevice APIs.
HIP 5.7.1 (For ROCm 5.7.1)
Resolved issues
- hipPointerGetAttributes API returns the correct HIP memory type as hipMemoryTypeManaged for managed memory.
HIP 5.7 (For ROCm 5.7)
Added
-
Added meta_group_size/rank for getting the number of tiles and rank of a tile in the partition
-
Added new APIs supporting Windows only, under development on Linux
-
hipMallocMipmappedArray for allocating a mipmapped array on the device
-
hipFreeMipmappedArray for freeing a mipmapped array on the device
-
hipGetMipmappedArrayLevel for getting a mipmap level of a HIP mipmapped array
-
hipMipmappedArrayCreate for creating a mipmapped array
-
hipMipmappedArrayDestroy for destroy a mipmapped array
-
hipMipmappedArrayGetLevel for getting a mipmapped array on a mipmapped level
-
Known Issues
- HIP memory type enum values currently don't support equivalent value to cudaMemoryTypeUnregistered, due to HIP functionality backward compatibility.
- HIP API hipPointerGetAttributes could return invalid value in case the input memory pointer was not allocated through any HIP API on device or host.
Upcoming changes
- Removal of gcnarch from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
- maxTexture1D
- maxTexture2D
- maxTexture1DLayered
- maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- hostRegisterSupported
- uuid
- Removal of deprecated code -hip-hcc codes from hip code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
- Correct hipGetLastError to return the last error instead of last API call's return code
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
- Remove hiparray* and make it opaque with hipArray_t
HIP 5.6.1 (For ROCm 5.6.1)
Resolved issues
- Enabled xnack+ check in HIP catch2 tests hang while tests execution
- Memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs
- Resolved an issue of crash while using hipGraphAddMemFreeNode
HIP 5.6 (For ROCm 5.6)
Added
- Added hipRTC support for amd_hip_fp16
- Added hipStreamGetDevice implementation to get the device assocaited with the stream
- Added HIP_AD_FORMAT_SIGNED_INT16 in hipArray formats
- hipArrayGetInfo for getting information about the specified array
- hipArrayGetDescriptor for getting 1D or 2D array descriptor
- hipArray3DGetDescriptor to get 3D array descriptor
Changed
- hipMallocAsync to return success for zero size allocation to match hipMalloc
- Separation of hipcc perl binaries from HIP project to hipcc project. hip-devel package depends on newly added hipcc package
- Consolidation of hipamd, ROCclr, and OpenCL repositories into a single repository called clr. Instructions are updated to build HIP from sources in the HIP Installation guide
- Removed hipBusBandwidth and hipCommander samples from hip-tests
Optimized
- Consolidation of hipamd, rocclr and OpenCL projects in clr
- Optimized lock for graph global capture mode
Resolved issues
- Fixed regression in hipMemCpyParam3D when offset is applied
Known Issues
- Limited testing on xnack+ configuration
- Multiple HIP tests failures (gpuvm fault or hangs)
- hipSetDevice and hipSetDeviceFlags APIs return hipErrorInvalidDevice instead of hipErrorNoDevice, on a system without GPU
- Known memory leak when code object files are loaded/unloaded via hipModuleLoad/hipModuleUnload APIs. Issue will be fixed in future release
Upcoming changes
- Removal of gcnarch from hipDeviceProp_t structure
- Addition of new fields in hipDeviceProp_t structure
- maxTexture1D
- maxTexture2D
- maxTexture1DLayered
- maxTexture2DLayered
- sharedMemPerMultiprocessor
- deviceOverlap
- asyncEngineCount
- surfaceAlignment
- unifiedAddressing
- computePreemptionSupported
- hostRegisterSupported
- uuid
- Removal of deprecated code -hip-hcc codes from HIP code tree
- Correct hipArray usage in HIP APIs such as hipMemcpyAtoH and hipMemcpyHtoA
- HIPMEMCPY_3D fields correction to avoid truncation of "size_t" to "unsigned int" inside hipMemcpy3D()
- Renaming of 'memoryType' in hipPointerAttribute_t structure to 'type'
- Correct hipGetLastError to return the last error instead of last API call's return code
- Update hipExternalSemaphoreHandleDesc to add "unsigned int reserved[16]"
- Correct handling of flag values in hipIpcOpenMemHandle for hipIpcMemLazyEnablePeerAccess
- Remove hiparray* and make it opaque with hipArray_t