CHANGELOG.md
April 2, 2026 · View on GitHub
Release History
-
[2026/03] Released v0.11:
- Enables kernel-based communication on heterogeneous platforms, including NVIDIA and Hygon.
- Adds support for both host-side and device-side one-sided communication semantics.
- Introduces adaptor plugin support, enabling dynamic loading of user-defined Device, CCL, and Net adaptor implementations.
-
[2026/02] Released v0.10:
- Implements 11 chip-decoupled collective communication algorithms in uniRunner mode.
- Refactors Device Intra-/Inter-node API and integrates NCCL Device API support on NVIDIA platforms.
- Enhances usability with pip install support for FlagCX and an NCCL wrapper plugin for seamless adoption on NVIDIA platforms.
-
[2026/01] Released v0.9:
- Adds support for Enflame, including
topsAdaptorandecclAdaptor. - Extends flagcxCCLAdaptor to support symmetric operations.
- Introduces the NCCL Device API in ncclAdaptor to enable customized AllReduce operations.
- Refactors
glooAdaptorto support both TCP and IB transports, with automatic NIC detection.
- Adds support for Enflame, including
-
[2025/12] Released v0.8:
- Enables intra-node zero-copy to improve data transfer efficiency for small messages.
- Supports a naive AllReduce implementation in uniRunner mode using a CPU-centric, device-assisted algorithm.
- Adds one-sided communication primitives via the new APIs flagcxHeteroPut and flagcxHeteroPutSignal.
-
[Unreleased] Test infrastructure restructure and bug fixes (PR #413):
- Fixed NCCL group imbalance in
ncclAdaptorGather/ncclAdaptorScatter: errors insidencclGroupStart()/ncclGroupEnd()no longer skipncclGroupEnd(), preventing deadlocks. - Reduced unit-test buffer allocation from 1GB to 4MB per buffer, cutting memory from 32GB to 128MB for 8-rank runs.
- Improved collective test correctness by using rank-dependent data patterns, catching rank-ordering and single-rank-copy bugs.
- Added infinite-loop guard in
perfBenchmarkLoopforstepFactor <= 1. - Wired
PERF_COMMON_SRCintotest/perf/host_api/Makefilebuild. - Removed TRACE-level debug logging from CI workflow.
- Fixed NCCL group imbalance in
-
[2025/11] Released v0.7:
- Added support to TsingMicro, including device adaptor
tsmicroAdaptorand CCL adaptortcclAdaptor. - Implemented an experimental kernel-free non-reduce collective communication (SendRecv, AlltoAll, AlltoAllv, Broadcast, Gather, Scatter, AllGather) using device-buffer IPC/RDMA.
- Enabled auto-tuning on NVIDIA, MetaX, and Hygon platforms, achieving 1.02×–1.26× speedups for AllReduce, AllGather, ReduceScatter, and AlltoAll.
- Enhanced
flagcxNetAdaptorwith one-sided primitives (put,putSignal,waitValue) and added retransmission support for reliability improvement.
- Added support to TsingMicro, including device adaptor
-
[2025/10] Released v0.6:
- Implemented device-buffer IPC communication to support intra-node SendRecv operations.
- Introduced device-initiated, host-launched device-side primitives, enabling kernel-based communication directly from devices.
- Enhanced auto-tuning with 50% performance improvement on MetaX platforms for the AllReduce operations.
-
[2025/09] Released v0.5:
- Added support for AMD GPUs, including a device adaptor
hipAdaptorand a CCL adaptorrcclAdaptor. - Introduced
flagcxNetAdaptorto unify network backends, currently supporting socket, IBRC, UCX and IBUC (experimental). - Enabled zero-copy device-buffer RDMA (user-buffer RDMA) to boost performance for small messages.
- Supported auto-tuning in homogeneous scenarios via
flagcxTuner. - Added test automation in CI/CD for PyTorch APIs.
- Added support for AMD GPUs, including a device adaptor
-
[2025/08] Released v0.4:
- Supported heterogeneous training of ERNIE4.5 (Baidu) on NVIDIA and Iluvatar GPUs with Paddle + FlagCX.
- Improved heterogeneous communication across arbitrary NIC configurations, with more robust and flexible deployments.
- Introduced an experimental network plugin interface with extended supports for IBRC and SOCKET. Device buffer registration now can be done via DMA-BUF.
- Added an InterOp-level DSL to enable customized C2C algorithm design.
- Provided user documentation under
docs/.
-
[2025/07] Released v0.3:
- Integrated three additional native communication libraries: HCCL (Huawei), MUSACCL (Moore Threads) and MPI.
- Enhanced heterogeneous collective communication operations with pipeline optimizations.
- Introduced device-side functions to enable device-buffer RDMA, complementing the existing host-side functions.
- Delivered a full-stack open-source solution, FlagScale + FlagCX, for efficient heterogeneous prefilling-decoding disaggregation.
-
[2025/05] Released v0.2:
- Integrated 3 additional native communications libraries, including MCCL (Moore Threads), XCCL (Mellanox) and DUCCL (BAAI).
- Improved 11 heterogeneous collective communication operations with automatic topology detection and full support to single-NIC and multi-NIC environments.
-
[2025/04] Released v0.1:
- Added 5 native communications libraries including CCL adaptors for NCCL (NVIDIA), IXCCL (Iluvatar), and CNCL (Cambricon), and Host CCL adaptors GLOO and Bootstrap.
- Supported 11 heterogeneous collective communication operations using the C2C (Cluster-to-Cluster) algorithm.
- Provided a full-stack open-source solution, FlagScale + FlagCX, for efficient heterogeneous training.
- Natively integrated into PaddlePaddle v3.0.0, with support for both dynamic and static graphs.