Dispenso
June 12, 2026 · View on GitHub
Dispenso
A high-performance C++ thread pool and parallel algorithms library
Dispenso is a modern C++ parallel computing library that provides work-stealing thread pools, parallel for loops, futures, task graphs, and concurrent containers. It serves as a powerful alternative to OpenMP and Intel TBB, offering better nested parallelism, sanitizer-clean code, and explicit thread pool control. Dispenso is used in hundreds of projects at Meta (formerly Facebook) and has been heavily tested and iterated on in production.
Key advantages over OpenMP and TBB:
- No thread explosion with nested parallel loops - dispenso's work-stealing prevents deadlocks and oversubscription
- Clean with ASAN/TSAN - fully sanitizer-compatible, unlike many TBB versions
- Thread-safe shared futures -
std::experimental::shared_future-like API that TBB lacks, safe for multiple concurrent waiters, with much better performance thanstd::future - Portable - C++14 compatible with no compiler-specific pragmas or extensions; C++20 builds gain concept constraints for clearer error messages
Table of Contents
- Choose Dispenso If...
- Features
- Quick Start
- Comparison vs Other Libraries
- Migration Guides
- Documentation and Examples
- Benchmark Results
- Installing
- Building
- Known Issues
- License
Choose Dispenso If...
- You need nested parallelism without thread explosion
- You want sanitizer-clean (ASAN/TSAN) concurrent code
- You want explicit control over thread pools rather than implicit global state
- You need compute-bound futures, not I/O-bound async
- You want stable APIs and minimal dependencies
- You need cross-platform portability from a C++14 baseline
- You have multiple independent parallel loops that can overlap (cascading
parallel_for)
Features
Dispenso provides a comprehensive set of parallel programming primitives:
Core runtime:
ThreadPool— work-stealing thread pool backing all dispenso parallelismTaskSet/ConcurrentTaskSet— task grouping with wait, cancellation, and recursive scheduling
Parallel algorithms:
parallel_for— parallel loops over indices, blocking or non-blocking (cascaded); cascadingparallel_forenables overlapping independent loops without oversubscriptionparallel_invoke— fork-join invocation of heterogeneous tasks; composes naturally with recursive divide-and-conquerfor_each— parallelstd::for_each/std::for_each_nFuture— high-performance thread-safe shared futures withthen(),when_all(), and an API matchingstd::experimental::shared_futureGraph— task graph execution with subgraph support and incremental re-evaluationpipeline— parallel pipelining of streaming workloads
Concurrent containers and synchronization:
ConcurrentVector— concurrent growable vector, superset of TBBconcurrent_vectorAPIChaseLevDeque— lock-free SPMC work-stealing dequeMpmcRingBuffer— bounded multi-producer multi-consumer ring bufferSPSCRingBuffer— lock-free single-producer single-consumer ring bufferLatch— one-shot barrier for thread synchronizationRWLock— reader-writer spin lock, outperformsstd::shared_mutexunder low write contention
General-purpose utilities:
CpuSet— portable CPU affinity, NUMA topology, and cache-aware thread group buildingSmallVector— inline-storage vector (not thread-aware; similar tofolly::small_vector)OnceFunction— lightweight move-onlyvoid()callablePoolAllocator— pool allocator with pluggable backing allocation (e.g. CUDA)SmallBufferAllocator— fast concurrent allocation for temporary objectsResourcePool— semaphore-like guard around pooled resourcesCompletionEvent— notifiable event with wait and timed waitAsyncRequest— lightweight constrained message passingConcurrentObjectArena— fast same-type object arena
Quick Start
Parallel for loop - the most common use case:
#include <dispenso/parallel_for.h>
// Sequential
for (size_t i = 0; i < N; ++i) {
process(data[i]);
}
// Parallel with dispenso - just wrap it!
dispenso::parallel_for(0, N, [&](size_t i) {
process(data[i]);
});
Install via your favorite package manager:
# Conda
conda install -c conda-forge dispenso
# Fedora/RHEL
sudo dnf install dispenso-devel
# Or build from source (see below)
Comparison vs Other Libraries
TBB (Intel Threading Building Blocks)
TBB has more functionality overall, but we built dispenso for three reasons:
- Sanitizer compatibility — TBB doesn't work well with ASAN/TSAN
- Thread-safe shared futures — TBB lacks a futures interface; dispenso provides
std::experimental::shared_future-like futures safe for multiple concurrent waiters - Non-Intel hardware — we needed to control performance on diverse platforms
Performance: Dispenso tends to be faster for small and medium parallel loops, and on par for large ones. When many loops run independently, dispenso's cascading parallel_for avoids oversubscription and has delivered 32-50% speedups in production workloads after porting from TBB at Meta. TBB lacks an equivalent mechanism.
See Migrating from TBB for a step-by-step porting guide.
OpenMP
OpenMP has simple syntax for basic loops but grows complex for advanced constructs. Nested #pragma omp parallel for inside threaded code risks thread explosion and machine exhaustion. Dispenso outperforms OpenMP for medium and large loops. OpenMP has an advantage for very small loops due to direct compiler support, though dispenso's minItemsPerChunk option can close this gap by tuning the parallelism threshold for small/fast loops.
See Migrating from OpenMP for a step-by-step porting guide.
Folly
Folly excels at asynchronous I/O with coroutine support. Dispenso is designed for compute-bound work. Dispenso's futures are lighter-weight and faster for compute workloads; Folly is the better choice for I/O-heavy applications.
TaskFlow
TaskFlow focuses on task graph execution. Dispenso has faster graph construction, faster full and partial graph execution, much lower parallel_for overhead (10-100x in benchmarks), and simpler/faster pipeline construction. TaskFlow does offer CUDA graph mappings, which dispenso does not currently provide.
Others (GCD, C++ std parallelism)
GCD is Apple-specific with ports to other platforms. C++ parallel algorithms are still evolving — we are interested in enabling dispenso as a backend for std::execution and C++ coroutines. Contributions and benchmarks are welcome.
Migration Guides
- Migrating from TBB — API mappings, thread pool differences, and common porting patterns
- Migrating from OpenMP — Replacing
#pragma ompwith dispenso equivalents, handling reductions and nested parallelism
Documentation and Examples
- Getting Started — tutorials with compilable examples for parallel_for, tasks, futures, graphs, pipelines, containers, and more
- API Reference — full Doxygen documentation
- FAQ — common questions about performance, exception behavior, and when to use dispenso vs alternatives
Benchmark Results
Dispenso is benchmarked across Linux (x64), macOS (ARM64), Windows (x64), and Android (ARM64),
comparing against OpenMP, TBB, TaskFlow, folly, and std::async across thread pools, parallel
loops, futures, graphs, concurrent containers, and more.
Interactive Benchmark Dashboard — explore all results with platform switching, dark/light theme, and detailed per-benchmark charts.
Installing
Binary builds of Dispenso are available through several package managers:
- Conda:
conda install -c conda-forge dispenso - Conan:
conan install --requires=dispenso/1.5.0 - vcpkg:
vcpkg install dispenso - Homebrew:
brew install dispenso - MacPorts:
sudo port install dispenso - Fedora/RHEL:
sudo dnf install dispenso-devel
If your platform is not on the list, see the next section for instructions to build from source.
Building
Linux and macOS:
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
make -j
Windows (from Developer Command Prompt):
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
cmake --build . --config Release
For detailed instructions including CMake prerequisites, installation, testing, and benchmarking, see docs/building.md.
Known Issues and Limitations
- Parallel reduction is not a first-class operation. Use
parallel_forwith per-thread state accumulation (see Getting Started). A dedicated reduction API is planned. - macOS CPU affinity:
CpuSet::bindCurrentThread()is a no-op on macOS — the OS does not support explicit CPU pinning. Topology queries work. - See GitHub Issues for the full list.
License
The library is released under the MIT license, but also relies on the (excellent) moodycamel concurrentqueue library, which is released under the Simplified BSD and Zlib licenses. See the top of the source at dispenso/third-party/moodycamel/*.h for details.