Dispenso

June 12, 2026 · View on GitHub

Dispenso

A high-performance C++ thread pool and parallel algorithms library

Dispenso is a modern C++ parallel computing library that provides work-stealing thread pools, parallel for loops, futures, task graphs, and concurrent containers. It serves as a powerful alternative to OpenMP and Intel TBB, offering better nested parallelism, sanitizer-clean code, and explicit thread pool control. Dispenso is used in hundreds of projects at Meta (formerly Facebook) and has been heavily tested and iterated on in production.

Key advantages over OpenMP and TBB:

No thread explosion with nested parallel loops - dispenso's work-stealing prevents deadlocks and oversubscription
Clean with ASAN/TSAN - fully sanitizer-compatible, unlike many TBB versions
Thread-safe shared futures - std::experimental::shared_future-like API that TBB lacks, safe for multiple concurrent waiters, with much better performance than std::future
Portable - C++14 compatible with no compiler-specific pragmas or extensions; C++20 builds gain concept constraints for clearer error messages

Choose Dispenso If...
Features
Quick Start
Comparison vs Other Libraries
Migration Guides
Documentation and Examples
Benchmark Results
Installing
Building
Known Issues
License

Choose Dispenso If...

You need nested parallelism without thread explosion
You want sanitizer-clean (ASAN/TSAN) concurrent code
You want explicit control over thread pools rather than implicit global state
You need compute-bound futures, not I/O-bound async
You want stable APIs and minimal dependencies
You need cross-platform portability from a C++14 baseline
You have multiple independent parallel loops that can overlap (cascading parallel_for)

Features

Dispenso provides a comprehensive set of parallel programming primitives:

Core runtime:

ThreadPool — work-stealing thread pool backing all dispenso parallelism
TaskSet / ConcurrentTaskSet — task grouping with wait, cancellation, and recursive scheduling

Parallel algorithms:

parallel_for — parallel loops over indices, blocking or non-blocking (cascaded); cascading parallel_for enables overlapping independent loops without oversubscription
parallel_invoke — fork-join invocation of heterogeneous tasks; composes naturally with recursive divide-and-conquer
for_each — parallel std::for_each / std::for_each_n
Future — high-performance thread-safe shared futures with then(), when_all(), and an API matching std::experimental::shared_future
Graph — task graph execution with subgraph support and incremental re-evaluation
pipeline — parallel pipelining of streaming workloads

Concurrent containers and synchronization:

ConcurrentVector — concurrent growable vector, superset of TBB concurrent_vector API
ChaseLevDeque — lock-free SPMC work-stealing deque
MpmcRingBuffer — bounded multi-producer multi-consumer ring buffer
SPSCRingBuffer — lock-free single-producer single-consumer ring buffer
Latch — one-shot barrier for thread synchronization
RWLock — reader-writer spin lock, outperforms std::shared_mutex under low write contention

General-purpose utilities:

CpuSet — portable CPU affinity, NUMA topology, and cache-aware thread group building
SmallVector — inline-storage vector (not thread-aware; similar to folly::small_vector)
OnceFunction — lightweight move-only void() callable
PoolAllocator — pool allocator with pluggable backing allocation (e.g. CUDA)
SmallBufferAllocator — fast concurrent allocation for temporary objects
ResourcePool — semaphore-like guard around pooled resources
CompletionEvent — notifiable event with wait and timed wait
AsyncRequest — lightweight constrained message passing
ConcurrentObjectArena — fast same-type object arena

Quick Start

Parallel for loop - the most common use case:

#include <dispenso/parallel_for.h>

// Sequential
for (size_t i = 0; i < N; ++i) {
    process(data[i]);
}

// Parallel with dispenso - just wrap it!
dispenso::parallel_for(0, N, [&](size_t i) {
    process(data[i]);
});

Install via your favorite package manager:

# Conda
conda install -c conda-forge dispenso

# Fedora/RHEL
sudo dnf install dispenso-devel

# Or build from source (see below)

Comparison vs Other Libraries

TBB (Intel Threading Building Blocks)

TBB has more functionality overall, but we built dispenso for three reasons:

Sanitizer compatibility — TBB doesn't work well with ASAN/TSAN
Thread-safe shared futures — TBB lacks a futures interface; dispenso provides std::experimental::shared_future-like futures safe for multiple concurrent waiters
Non-Intel hardware — we needed to control performance on diverse platforms

Performance: Dispenso tends to be faster for small and medium parallel loops, and on par for large ones. When many loops run independently, dispenso's cascading parallel_for avoids oversubscription and has delivered 32-50% speedups in production workloads after porting from TBB at Meta. TBB lacks an equivalent mechanism.

See Migrating from TBB for a step-by-step porting guide.

OpenMP

OpenMP has simple syntax for basic loops but grows complex for advanced constructs. Nested #pragma omp parallel for inside threaded code risks thread explosion and machine exhaustion. Dispenso outperforms OpenMP for medium and large loops. OpenMP has an advantage for very small loops due to direct compiler support, though dispenso's minItemsPerChunk option can close this gap by tuning the parallelism threshold for small/fast loops.

See Migrating from OpenMP for a step-by-step porting guide.

Folly

Folly excels at asynchronous I/O with coroutine support. Dispenso is designed for compute-bound work. Dispenso's futures are lighter-weight and faster for compute workloads; Folly is the better choice for I/O-heavy applications.

TaskFlow

TaskFlow focuses on task graph execution. Dispenso has faster graph construction, faster full and partial graph execution, much lower parallel_for overhead (10-100x in benchmarks), and simpler/faster pipeline construction. TaskFlow does offer CUDA graph mappings, which dispenso does not currently provide.

Others (GCD, C++ std parallelism)

GCD is Apple-specific with ports to other platforms. C++ parallel algorithms are still evolving — we are interested in enabling dispenso as a backend for std::execution and C++ coroutines. Contributions and benchmarks are welcome.

Migration Guides

Migrating from TBB — API mappings, thread pool differences, and common porting patterns
Migrating from OpenMP — Replacing #pragma omp with dispenso equivalents, handling reductions and nested parallelism

Documentation and Examples

Getting Started — tutorials with compilable examples for parallel_for, tasks, futures, graphs, pipelines, containers, and more
API Reference — full Doxygen documentation
FAQ — common questions about performance, exception behavior, and when to use dispenso vs alternatives

Benchmark Results

Dispenso is benchmarked across Linux (x64), macOS (ARM64), Windows (x64), and Android (ARM64), comparing against OpenMP, TBB, TaskFlow, folly, and std::async across thread pools, parallel loops, futures, graphs, concurrent containers, and more.

Interactive Benchmark Dashboard — explore all results with platform switching, dark/light theme, and detailed per-benchmark charts.

Installing

Binary builds of Dispenso are available through several package managers:

Conda: conda install -c conda-forge dispenso
Conan: conan install --requires=dispenso/1.5.0
vcpkg: vcpkg install dispenso
Homebrew: brew install dispenso
MacPorts: sudo port install dispenso
Fedora/RHEL: sudo dnf install dispenso-devel

If your platform is not on the list, see the next section for instructions to build from source.

Building

Linux and macOS:

mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
make -j

Windows (from Developer Command Prompt):

mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
cmake --build . --config Release

For detailed instructions including CMake prerequisites, installation, testing, and benchmarking, see docs/building.md.

Known Issues and Limitations

Parallel reduction is not a first-class operation. Use parallel_for with per-thread state accumulation (see Getting Started). A dedicated reduction API is planned.
macOS CPU affinity: CpuSet::bindCurrentThread() is a no-op on macOS — the OS does not support explicit CPU pinning. Topology queries work.
See GitHub Issues for the full list.

License

The library is released under the MIT license, but also relies on the (excellent) moodycamel concurrentqueue library, which is released under the Simplified BSD and Zlib licenses. See the top of the source at dispenso/third-party/moodycamel/*.h for details.