TileLang-Ascend Workspace Reduction Tutorial
June 23, 2026 · View on GitHub
1. Design Goals
In Ascend architecture, Cube and Vector cores cannot directly exchange data. When Cube computation results need to be processed by Vector, the traditional approach requires:
- Cube writes results to Global Memory (GM)
- Vector reads data from GM
- Potential synchronization overhead between cores
This traditional approach introduces Programming complexity: Users must manually write two-stage copy scripts with workspace management, offset calculation, and must explicitly declare workspace buffers and manage their lifecycle, significantly increasing frontend script writing difficulty and reducing code maintainability
Workspace Reduction aims to:
- Automate workspace allocation: Compiler automatically creates GM workspace buffers for cross-core data transfer
- Transform copy statements: Convert direct
copy_l0c_to_ubandcopy_ub_to_l1into two-stage GM-based copies - Support multi-core scenarios: Each AI Core gets independent workspace region via
cidoffset - Handle vector core splitting: When
threads=2, workspace stores complete data while UB is split byvid - Simplify programming model: Users write single copy statement, compiler handles workspace management
2. Usage Guide
Workspace Reduction is automatically enabled in the compilation pipeline. No manual configuration required.
2.1 Basic Example
Original Code (Expert Mode):
@T.prim_func
def kernel(...):
with T.Kernel(..., threads=2, is_npu=True) as (cid):
# Cube computation
T.gemm_v0(q_l1, k_l1, acc_s_l0c, ...)
# Direct copy: L0C → UB (requires cross-core transfer)
T.copy(acc_s_l0c, acc_s_ub) # ← Workspace Reduction triggered
# Vector computation
T.tile.exp(acc_s_ub, acc_s_ub)
2.2 Practical Example: FlashAttention
See complete example: flash_attn.py
@T.prim_func
def flash_attn(...):
with T.Kernel(..., threads=2, is_npu=True) as (cid):
# Attention score computation (Cube)
T.gemm_v0(q_l1, k_l1, acc_s_l0c, transpose_B=True, init=True)
# L0C → UB transfer (triggers Workspace Reduction)
T.copy(acc_s_l0c, acc_s_ub)
# Softmax computation (Vector)
T.tile.exp(acc_s_ub, acc_s_ub)
T.reduce_max(acc_s_ub, m_i, dim=-1)
T.tile.sub(acc_s_ub, acc_s_ub, m_i)
...
3. Important Limitations
3.1 Current Constraints
- Static shape requirement: Workspace shapes must be statically determinable at compile time
- Vid Reduction dependency: Requires coordination with Vid Reduction, i.e.,
threadsparameter must be set to 1 or 2
3.2 Known Limitations
- Multi-dimensional workspace: Currently supports 2D shapes
[M, N], future extension to higher dimensions if needed. - Dynamic shapes: Not supported yet, requires compile-time constants
- Skip UB scenario: Currently only supports indices array transfer in Sparse Flash Attention (SFA) operator
3.3 Future Enhancements
- Scoped copy-back insertion: Support region-based copy insertion strategy where Stage 2 copy statements are inserted at region boundaries before all data consumers, reducing interleaving of copy and compute operations, and avoiding efficiency degradation from redundant cross-core synchronization overhead