Efficient Latent Communication in Multi-Agent Systems
December 10, 2025 Β· View on GitHub
Efficient Latent Communication in Multi-Agent Systems
Multi agent systems (MAS) typically communicate through text, forcing the extra step of decoding latent representations into tokens before passing to the next agent. The LatentMAS framework, proposed by Zou et al. 2025 [1], engineers direct sharing of the transformer's key value (KV) caches, providing significant speed ups, accuracy gains, and up to 80% less token usage. However, this framework introduces a new challenge - KV caches grow linearly in the number of agents in the system. This work explores the k nearest neighbor retrieval over cached keys from the Memorizing Transformers paper by Wu et al. 2022 [6] as a potential mechanism to limit KV size. In the end, we are able to trim the KV cache memory by 40% and speedup answer generation by 29% while maintaining near full LatentMAS accuracy. We discuss the full process of experimentation and provide intuition for our results. Ultimately, these findings suggest that latent communication contains structured layer-dependent information that can be selectively compressed without significantly compromising performance, providing an avenue for further development of efficient latent MAS design.
Read the blog post here for more details
π‘ Introduction
This repository is based on the LatentMAS framework (Zou et al., 2025), a multi-agent reasoning framework that moves agent collaboration from token space into the model's latent space.
Key Features:
- Efficient multi-step reasoning with drastically fewer tokens
- Training-free latent-space alignment for stable generation
- KNN-based KV cache filtering for memory-efficient agent communication
- Three selection strategies: top-k similarity, bottom-k diversity, and random baseline
This implementation extends the original LatentMAS with experimental KNN filtering capabilities for the KV cache, enabling more efficient memory usage during multi-agent collaboration.
π Supported Datasets
This implementation supports the following datasets:
- GSM8K: Grade school math problems
- GPQA (Diamond): Graduate-level science questions
- MedQA: Medical question answering
π οΈ Getting Started
βοΈ Setup Environment Variables
We recommend setting your HF cache directory to avoid repeated downloads:
export HF_HOME=/path/to/huggingface
export TRANSFORMERS_CACHE=$HF_HOME
export HF_DATASETS_CACHE=$HF_HOME
Models and datasets will automatically be downloaded into $HF_HOME.
π¦ Install Packages
conda create -n latentmas python=3.10 -y
conda activate latentmas
pip install -r requirements.txt
π Quick Start
1. Clone the repo
git clone https://github.com/YourRepo/LatentMAS.git
cd LatentMAS
2. Repository Structure
LatentMAS/
βββ run.py # Main entry for experiments
βββ models.py # Wrapper for HF models + latent realignment
βββ methods/
β βββ baseline.py # Single-agent baseline
β βββ text_mas.py # Token-space multi-agent method
β βββ latent_mas.py # Latent-space multi-agent (with KNN filtering)
βββ prompts.py # Prompt constructors
βββ prompts v2.py # Updated Prompt constructors for Bottom kNN
βββ data.py # Dataset loaders (GSM8K, GPQA, MedQA)
βββ data/ # Provided data + figures
βββ utils.py # Answer parsing / timeout / helpers
βββ example_logs/ # Example logs from LatentMAS
βββ requirements.txt
π§ͺ Running Experiments
πΉ Baseline (single model)
python run.py --method baseline --model_name Qwen/Qwen3-4B --task gsm8k --max_samples 100
πΉ TextMAS (text-based multi-agent system)
python run.py --method text_mas --model_name Qwen/Qwen3-4B --task gsm8k --prompt sequential --max_samples 100
πΉ LatentMAS (latent multi-agent system)
python run.py --method latent_mas --model_name Qwen/Qwen3-4B --task gsm8k --latent_steps 10 --prompt sequential --max_samples 100
Key Parameters:
-
--latent_stepsβ [0, 80] Number of latent reasoning steps per agent. Typically 10β20 works well. -
--latent_space_realignEnables latentβembedding alignment for better generation stability.
python run.py --method latent_mas --model_name Qwen/Qwen3-4B --task gsm8k --latent_steps 10 --latent_space_realign --max_samples 100
--promptβ {sequential,hierarchical} Prompt structure for agent collaboration.
π¬ KNN Cache Filtering (Experimental)
This implementation includes experimental KNN-based filtering of the KV cache to reduce memory usage during agent-to-agent communication.
Key KNN Parameters:
-
--knn_filterEnable KNN filtering of the KV cache -
--knn_percentage(default: 0.8) Percentage of tokens to keep (0.0-1.0). E.g., 0.8 keeps 80% of the cache. -
--knn_min_keep(default: 5) Minimum number of recent tokens to always preserve, regardless of similarity. -
--knn_strategyβ {top,bottom,random} (default:top)top: Keep most similar tokensbottom: Keep least similar tokensrandom: Keep random tokens
𧬠KNN Filtering Examples
1. Standard KNN: Keep 80% most similar tokens
python run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task gsm8k \
--latent_steps 10 \
--max_samples 10 \
--knn_filter \
--knn_percentage 0.8 \
--knn_strategy top
2. Aggressive filtering: Keep only 50% most similar
python run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task gpqa \
--latent_steps 10 \
--max_samples 10 \
--knn_filter \
--knn_percentage 0.5 \
--knn_strategy top
3. Diversity baseline: Keep 80% least similar tokens
python run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task medqa \
--latent_steps 10 \
--max_samples 10 \
--knn_filter \
--knn_percentage 0.8 \
--knn_strategy bottom
6. Full experiment with all features
python run.py \
--method latent_mas \
--model_name Qwen/Qwen3-4B \
--task gsm8k \
--prompt hierarchical \
--latent_steps 20 \
--max_samples 100 \
--latent_space_realign \
--knn_filter \
--knn_percentage 0.7 \
--knn_min_keep 5 \
--knn_strategy top \
--temperature 0.6 \
--seed 42
π Citation
This implementation is based on the LatentMAS paper. If you find this work helpful, please cite:
@article{zou2025latentmas,
title={Latent Collaboration in Multi-Agent Systems},
author={Zou, Jiaru and Yang, Xiyuan and Qiu, Ruizhong and Li, Gaotang and Tieu, Katherine and Lu, Pan and Shen, Ke and Tong, Hanghang and Choi, Yejin and He, Jingrui and Zou, James and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2511.20639},
year={2025}
}
π€ Acknowledgement
This code is based on the LatentMAS framework by Zou et al., 2025. The KNN cache filtering extension was developed independently for research purposes.