YAKL: YAKL is A Kokkos Layer
May 8, 2026 ยท View on GitHub
A Simple Kokkos-based C++ Framework for Performance Portability and Fortran Code Porting
April, 2026 Changes: yakl::Array now derived from Kokkos:View, yakl::DeviceSpace is now a Kokkos::MemorySpace, and YAKL now uses C++20
- This update reduced duplication of Kokkos functionality, reducing the YAKL code volume to about half of its previous size.
- YAKL in its current form works with Kokkos versions 4.3 - 5.1 and will continue to work with any version > 5.1 as long as the Kokkos
KOKKOS_IMPL_SHARED_ALLOCATION_SPECIALIZATION,KOKKOS_IMPL_HOST_INACCESSIBLE_SHARED_ALLOCATION_SPECIALIZATION,KOKKOS_IMPL_SHARED_ALLOCATION_RECORD_EXPLICIT_INSTANTIATION,KOKKOS_IMPL_HOST_INACCESSIBLE_SHARED_ALLOCATION_RECORD_EXPLICIT_INSTANTIATIONmacro functions are defined and behave as they have from 2024-2026. - Most importantly,
yakl::Array(formallyyakl::CArray) andyakl::Array_F(formallyyakl::FArray) are now derived fromKokkos::Viewand, by default, useViewmember functions, data, constructors, and destructors. Further,yakl::DeviceSpaceis now formally aKokkos::MemorySpaceand can be used as such foryakl::Array,yakl::Array_F, andKokkos::Viewclass objects.- There are now just two
yakl::Arrayandyakl::Array_Ftemplate parameters: (1) The Kokkos-likedata_type, e.g.,float **that replaces the previous templates of value_type,T, and rank integer,N; and (2) the memory space, e.g.yakl::DeviceSpaceorKokkos::HostSpce. All YAKL objects are assumed to take on one of these two memory spaces, meaning it is host or device resident just like before.- You can construct a Kokkos View
data_type(e.g.,int ***) using thetemplate <class T, int N> yakl::ViewTypestruct. E.g.,using my_view_type = typename yakl::ViewType<float,4>::type, which setfloat ****asmy_view_type. template <class KT, class MemSpace = yakl::DeviceSpace> yakl::Array[_F]: Theyakl::Arrayclass assumesKokkos::LayoutRight(row-major indexing where the last index varies the fastest), and theyakl::Array_Fclass assumesKokkos::LayoutLeft(column-major indexing where the first index varies the fastest).- All ctors, copy ctors, move ctors, copy and move assignment operators for
yakl::Arrayare derived directly fromKokkos::View. Becauseyakl::Arrayinherits ctors, dtors, assignmentoperator=andoperator()directly from theKokkos::View, it should for all intents and purposes behave just like aViewin the Kokkos ecosystem with the traditional YAKL bells and whistles like:Array::deep_copy_to,operator=(T rhs) requires std::is_arithmetic_v<T>,slice,subset_slowest_dimension,reshape,collapse,createDeviceObject,createHostObject,createDeviceCopy,createHostCopy,as,extents,ubounds,lbounds,begin,end, andoperator<<. The same is not guaranteed to be true forArray_Fbecause it was prudent to not include theViewctors and copy and move assignment operators to avoid accidentally creating anArray_Fobject without the lower bounds properly specified. - For
yakl::Array_F(the Fortran-styleArrayclass) you need to use the constructors:Array_F(std::string label, {lower1,upper1}[, ...])andArray_F( value_type * ptr, {lower1,upper1}[, ...]), where the initializer lists populateyakl::Array_F::ABstruct objects that contain a lower (inclusive) and upper bound (inclusive) for each dimension of the array.
- You can construct a Kokkos View
yakl::Arrayandyakl::Array_Fobjects now use Kokkos'sSharedAllocationRecordinternals, which means behavior changes a bit now- You can no longer name or debug-by-name an non-owning
ArrayorArray_Fclass. - When you call
slice,reshape,collapseorsubset_slowest_dimensionon the host, it is no longer reference counted with the creating object like it was before but is rather a non-owningViewthat is purely at the mercy of the user making sure the owningArrayobject that created it does not deallocate its memory by falling out of scope before the non-owning object is finished being used.
- You can no longer name or debug-by-name an non-owning
- They yakl
Styletemplate parameter foryakl::Arrayobjects no longer exists. Rather the user needs to explicitly create anArrayorArray_Fobject for a C-style or Fortran-styleArrayobject, respectively.
- There are now just two
- In general, a
_Fsuffix appended to a class or function name is now used to indicate that it is Fortran-style, and the ommision of that suffix indicates that it is C-style. This is true forArray[_F],SArray[_F],parallel_for[_F],Bounds[_F], andSimpleBounds[_F]. TheStyletemplate parameter is removed in lieu of this clearer syntax. - YAKL now defines its own
yakl::DeviceSpace, which satisfies the concepts of aKokkos::MemorySpace. It is assumed that if compiling for a GPU device target that the host space is not accessible. This may be relaxed in the future, but since paging memory automatically to and from device memory with CUDA / HIP Managed Memory or Linux kernel memory paging is so much less performant than user-managed memory, it is being disabled for now so that users can more easily see when they are accidentally accessing host memory on the device or device memory on the host. The coreyakl::DeviceSpaceclass is extremely simple and merely calls YAKL'salloc_deviceandfree_deviceroutines that automatically use the YAKL memory pool when it is enabled. You candeep_copybetweenyakl::DeviceSpaceandKokkosHostSpaceas well as betweenyakl::DeviceSpaceandKokkos::DefaultExecutionSpace::memory_space. In fact,yakl::DeviceSpaceusesKokkos::DefaultExecutionSpace::memory_spacebehind the scenes. - The
yakl::SArrayclass (formerly aliased fromyakl::CSArray) for small, local stack-based arrays in the memory space of the execution space being used in aparallel_forcall has changed its template parameters. You no longer specify the rank explicitly but rather just specify the type and the dimensions directly. Therefore, a 2-DSArrayobject will now be specified astemplate <class T, std::integral auto dimensions...> SArray, meaning the user will now declare something like,yakl::SArray<float,nx,ny> my_local_stack_array;, and YAKL can infer from the constructor template parameters what the rank of the array is. The same is true for theyakl::SArray_Fclass (previouslyyakl::FSArray), except that the user needs to use theyakl::Bndsclass to specify the lower (inclusive) and upper (inclusive) bounds of the array. E.g.,using yakl::Bnds; yakl::SArray_F<double,Bnds{1,nx},Bnds{1,nz}> my_fortran_stack_array;. I trid to get rid of the need for specifyingBndsin that syntax, but not all compiler play nicely with Non-Type Template Parameters (NTTPs) resulting from the initializer list syntax. So, we're stuck with it. You can obtain the rank of anSArrayorSArray_Fobject with thestatic constexpr int rankclass data member.- The
SArrayandSArray_Fclasses both define theKokkos::View'svalue_type,const_value_type,non_const_value_typetypes. They also defineis_SArray=true,rank,is_cstyle, andis_fstyleasstatic constexprclass data members. As before, they both defineoperator=(T rhs) requires std:is_arithmetic_v<T>,operator() const,data(),begin(),end(),size(),span_is_contiguous() {return true;},is_allocated() {return true;},extent(int i),extent<I>(),operator<<,extents(),lbounds(),ubounds().
- The
- The recommended way to pass a YAKL
Array,Array_F,SArray, orSArray_Fobject to a function is to declare a generic template parameter,template <class ArrayLike>and add arequiresclause to constrain the properties of the object such asrequires yakl::is_Array<ArrayLike> && (ArrayLike::rank()==3) && ArrayLike::is_cstyleorrequires yakl::is_SArray<ArrayLike> && (ArrayLike::rank==2) && ArrayLike::is_fstyle && std::is_integral_v<ArrayLike::value_type>. - The
parallel_forlauncher is largely the same as before with a slightly more performant implementation under the hood, which still ultimately usesKokkos::parallel_forfor all launches but still avoids usingMDPolicyRangedue to some performance issues. The biggest changes is that if you want Fortran-styleparallel_forandBounds, you need to append the_Fsuffix to the functions to declare that rather than relying on aStyleparameter.- WARNING: Nvidia's CUDA compilers have an issue with namespace resolution when you code
using yakl::parallel_forand pass asize_targument as the bounds to aparallel_forcall without an explicit namespace. One compiler pass will resolveyakl::parallel_for, and another will resolveKokkos::parallel_for, and you will get an invalid device kernel error at runtime. Please always codeyakl::parallel_forexplicitly rather than codingusing yakl::parallel_forfollowed by a call toparallel_forwithout a namespace if you want to avoid this. Thesize()member function of KokkosViewobjects (and therefore of YAKLArrayobjects) returns asize_ttype and will trigger this issue as well. Please see this documentation for more information on this Argument Dependent Lookup (ADL) issue with CUDA.
- WARNING: Nvidia's CUDA compilers have an issue with namespace resolution when you code
- The
yakl::componentwisenamespace now has functions for any combination of[arithmetic_type || SArray[_F]]or[arithmetic_type || Array[_F]]parameter for binary operators:operator[+-/*<>],operator<=,operator>=,operator==,operator!=,operator&&,operator||. It also has functions forArray[_F]orSArray[_F]inputs to unary operators:operator[!+-],abs,sqrt,cbrt,pow(arr,arithmetic),sin,cos,tan,asin,acos,atan,exp,log,log10,log2,floor,ceil,round,isnan,isinf. Each of these will launch a kernel to run the operation componentwise on the input(s) and return the result in a newArrayorSArrayobject of the correction operation result type depending on the input(s). These are intended mainly to help users write debug and unit testing code more quickly, since each function/operator launches a separate kernel. - The
yakl::intrinsicsnamespace is the same as before except thatpackhas been removed because it was never running on the device anyway, andmatinv_gehas been renamed tomatinvand now includes partial pivoting. Toneythe timer is still in there for quick and easy timer values that are queryable and given to you viastd::coutuponyakl::finalize(). It's been simplified, and you now have access to the following control/lookup functions in theyakl::namesapce based on timer'sstd::string label:timer_start(label),timer_stop(label),timer_get_last_duration(label),timer_get_accumulated_duration(label),timer_get_min_duration(label),timer_get_max_duration(label), andtimer_get_count(label). You can also call thetimer_print()function at any point to send the user the current timer count,min,max,accumulated values. All timers must still be perfectly nested so the class can identify parent and child timers and format the output appropriately.- YALK's
Array[_F]classes now accept any integral type for ctors andoperator(), and theArray_Fclass acceptsptrdiff_ttypes for the dimension bounds. Theparallel_for[_F]and[Simple]Bounds[_F]classes usesize_tandptrdiff_ttypes. So indexing arrays larger than 2B indices and looping over more than 2B indices is technically allowed now. However, be aware that individual GPU backends often limit these sizes tounsigned inttypes, so I still do not advise this. - To summarize the changes a bit more succinctly:
int yakl::memHost-->class Kokkos::HostSpaceint yakl::memDevice-->class yakl::DeviceSpaceyakl::c::[parallel_for|Bounds|SimpleBounds]-->yakl::[parallel_for|Bounds|SimpleBounds]yakl::fortran::[parallel_for|Bounds|SimpleBounds]-->yakl::[parallel_for_F|Bounds_F|SimpleBounds_F]yakl::SArray<float,3,nx,ny,nz>-->yakl::SArray<float,nx,ny,nz>yakl::FSArray<double,2,SB<1,nx>,SB<1,ny>>-->yakl::SArray_F<double,Bnds{1,nx},Bnds{1,nz}>template <int memSpace = yakl::memHost-->template <class MemSpace = Kokkos::HostSpace>yakl::Array<float,2,yakl::memHost,yakl::styleC>-->yakl::Array<float **,Kokkos::HostSpace>yakl::Array<double const,3,yakl::memDevice,yakl::styleFortran>-->yakl::Array_F<double const ***,yakl::DeviceSpace>yakl::Array<float,3,yakl::memHost> arr("label",other_arr.data(),nz,ny,nx)-->yakl::Array<float ***,Kokkos::HostSpace> arr(other_arr.data(),nz,ny,nx)template <class T, int N> requires std::is_arithmetic_v<T> func(yakl::Array<T,N,memDevice,styleC> const &arr)-->template <class ViewLike> requires is_Array<ViewLike> && std::is_arithmetic_v<ViewLike::value_type> && ViewLike::is_cstyle && ViewLike::on_device func(ViewLike const &arr)yakl::intrinsic::matinv_ge-->yakl::intrinsics::matinv(Only forSArrayobjects)- Is the
Array[_F]object allocated / initialized?arr.is_allocated() - Total number of elements in the
[S]Array[_F]object:arr.size()
IMPORTANT: YAKL is now built entirely on Kokkos, and Streams, FFTs, and hierarchical parallelism have been removed. Please note the following modifications to YAKL documentation:
- Streams and hierarchical parallelism are now gone. Now that YAKL is Kokkos, just use Kokkos if you want those things.
- Array classes are still in play as well as the pool allocator and the Fortran-style Array classes.
- The pool allocator no longer has multiple pools. This was meant to be a benefit, but it only seemed to cause confusion for users. So just one pool now. If it runs out of memory, simply make it bigger. It can be changed with
GATOR_DISABLE=1andGATOR_INITIAL_MB=...just like before; as well as with parameters toyakl::init(); - YAKL FFTs, Tridiagonal, and Pentadiagonal solves are now gone. They require hardware specific logic, which YAKL no longer has.
- All previous YAKL code is in the YAKL/deprecated directory now.
- YAKL is now a header-only layer that extends Kokkos, meaning the build system will use typical Kokkos linking in CMake, and you can set YAKL flags the same way you set Kokkos flags (likely with generator expressions).
- All
parallel_forroutines use Kokkosparallel_forand all memory allocations, copies, and frees use Kokkos supported API routines. - All reductions use Kokkos reductions, nothing backend specific anymore like cub, hipcub, or MKL.
- Stack array / static array classes are still in here:
CSArray(aka,SArray) andFSArray - Intrinsics are still in here.
yakl::fence()should now beKokkos::fence()yakl::yakl_throw()should now beKokkos::abort()YAKL_INLINEshould now beKOKKOS_INLINE_FUNCTIONYAKL_LAMBDAshould now beKOKKOS_LAMBDAYAKL_EXECUTE_ON_HOST_ONLY(...)should now beKOKKOS_IF_ON_HOST(...)YAKL_EXECUTE_ON_DEVICE_ONLY(...)should now beKOKKOS_IF_ON_DEVICE(...)yakl::atomicAdd(var,rhs)should now beKokkos::atomic_add(&var,rhs)- Don't forget the ampersand. Kokkos accepts a pointer where YAKL used to accept a reference
Example compilation approach
add_subdirectory(${KOKKOS_HOME} ${KOKKOS_BIN})
add_subdirectory(${YAKL_HOME} ${YAKL_BIN})
target_link_libraries(yakl PUBLIC Kokkos::kokkos)
add_executable(my_target_name ${MY_SOURCE_FILES})
target_compile_options(yakl PUBLIC $<$<COMPILE_LANGUAGE:CXX>:${ADDED_CXX_FLAGS}>)
target_compile_options(my_target_name PUBLIC $<$<COMPILE_LANGUAGE:CXX>:${ADDED_CXX_FLAGS}>)
target_link_libraries(my_target_name yakl [${ADDED_LINK_FLAGS}])
# OLCF Frontier Example
export MPICH_GPU_SUPPORT_ENABLED=1
export MY_BACKEND="Kokkos_ENABLE_HIP"
export MH_ARCH="Kokkos_ARCH_AMD_GFX90A"
export ADDED_CXX_FLAGS="-DUSE_GPU_AWARE_MPI;-munsafe-fp-atomics;-O3;-ffast-math;-I${ROCM_PATH}/include;-D__HIP_ROCclr__;-D__HIP_ARCH_GFX90A__=1;--rocm-path=${ROCM_PATH};--offload-arch=gfx90a;-Wno-unused-result;-Wno-macro-redefined"
export ADDED_LINK_FLAGS="--rocm-path=${ROCM_PATH};-L${ROCM_PATH}/lib;-lamdhip64"
# Create the CMake command
CMAKE_COMMAND=(cmake)
CMAKE_COMMAND+=(-DADDED_CXX_FLAGS="$ADDED_CXX_FLAGS")
CMAKE_COMMAND+=(-DADDED_LINK_FLAGS="$ADDED_LINK_FLAGS")
[[ ! "$MY_BACKEND" == "" ]] && CMAKE_COMMAND+=(-D${MY_BACKEND}=ON)
[[ ! "$MY_ARCH" == "" ]] && CMAKE_COMMAND+=(-D${MY_ARCH}=ON)
[[ "$MY_BACKEND" == "Kokkos_ENABLE_CUDA" ]] && CMAKE_COMMAND+=(-DKokkos_ENABLE_CUDA_CONSTEXPR=ON)
CMAKE_COMMAND+=($CMAKE_DIRECTORY_LOC)
# Run the CMake command
"${CMAKE_COMMAND[@]}"
Documentation: https://github.com/mrnorman/YAKL/wiki (WARNING: Doscumentation is subject to the above changes)
API Documentation: https://mrnorman.github.io/yakl_api/html/index.html (WARNING: API is subject to the above changes)
Cite YAKL: https://link.springer.com/article/10.1007/s10766-022-00739-0
Primary Developer: Matt Norman (Oak Ridge National Laboratory) - mrnorman.github.io
Contributors: https://github.com/mrnorman/YAKL/wiki#contributors
Example YAKL Usage
For a self-contained example of how to use YAKL, please checkout the cpp/ folder of the miniWeather repo