SWE-Manager
January 23, 2026 ยท View on GitHub
SWE-Manager
An 8B Model Trained via Reinforcement Learning for Intelligent Proposal Selection and Synthesis
SWE-Manager yields an 8B model trained via reinforcement learning (RL) to compare proposals, justify its choice, and synthesize a golden proposal for implementation.
๐ Overview
SWE-Manager is a specialized 8B parameter language model fine-tuned on the Qwen3-8B. The model excels at:
- ๐ Comparing Multiple Proposals: Analyzing technical tradeoffs across different solution approaches
- ๐ญ Reasoning & Justification: Providing detailed explanations for proposal selection decisions
- ๐จ Golden Proposal Synthesis: Generating optimized, actionable implementation plans by combining best aspects of multiple proposals
The model is trained using a combination of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), specifically DAPO.
๐ Results
SWE-Lancer Manager Benchmark Performance
SWE-Lancer IC Benchmark Performance
๐๏ธ Model Architecture
Base Model
- Architecture: Qwen-3-8B
- Parameters: ~8 billion
- Context Window: 8K tokens
- Vocabulary: 152K tokens
Training Stages
graph LR
A[Qwen-3-8B Base] --> B[SFT Stage]
B --> C[GRPO Stage]
C --> D[SWE-Manager 8B]
style A fill:#f9f,stroke:#333
style D fill:#9f9,stroke:#333
-
Supervised Fine-Tuning (SFT)
- Dataset:
data/sft_data.jsonl - Format: Issue + Multiple Proposals โ Selection + Reasoning
- Epochs: 3-5 epochs with learning rate warmup
- Dataset:
-
Reinforcement Learning (GRPO)
- Technique: DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization ) a variant of GRPO
- Reward Function:
reward_func/proposal_reward.py - Dataset:
data/grpo_data.jsonl - Training: Multi-node distributed training with ZeRO-3
๐ Training Pipeline
Data Preparation
The training data consists of:
- SFT Data (data/sft_data.jsonl): Supervised learning examples
- GRPO Data (data/grpo_data.jsonl): Preference-based training pairs
Data format:
{
"messages": [
{
"role": "system",
"content": "system prompt here..."
},
{
"role": "user",
"content": "Issue + Proposals here..."
}
],
"best_proposal_id": 0,
"golden_proposal": "Sythesized golden proposal text here...",
"explanation": "Justification for the best proposal selection here...",
"cot": "Chain of Thought sythesis here...",
"price": 0
}
Training Scripts
SFT Training
bash train_script/sft_qwen_full_multi_node.sh
DPO/GRPO Training
bash train_script/dapo_qwen_zero_3.sh
Training Infrastructure
- Multi-Node Support: Distributed training across multiple GPUs/nodes
- Mixed Precision: Automatic mixed precision (AMP) for faster training
- Checkpointing: Regular model checkpoints with best model selection
- Monitoring: TensorBoard logging and metrics tracking
๐ฎ Inference
python infer/batch_local_llm_request.py \
--input_file data/benchmark/swe-manager_benchmark.jsonl \
--output_file res/inference_results.jsonl \
--model_path path_to_trained_model \
--batch_size 4
P2A Framework
The Proposal-to-Action (P2A) framework includes:
- Proposal Agent: Generate multiple candidate proposals
- Technical Manager: Compare and select the best proposal with justification, and synthesize a golden proposal
- Implementation Agent: Implement the synthesized golden proposal into code
See P2A/README.md for detailed documentation.
๐ Project Structure
swe-manager/
โโโ README.md # This file
โโโ data/ # Training data
โ โโโ sft_data.jsonl # Supervised fine-tuning data
โ โโโ grpo_data.jsonl # GRPO training data
โ โโโ benchmark/ # Evaluation benchmarks
โ โโโ ablation_study/ # Ablation study data
โโโ train_script/ # Training scripts
โ โโโ sft_qwen_full_multi_node.sh # SFT training
โ โโโ dapo_qwen_zero_3.sh # DPO/GRPO training
โโโ infer/ # Inference scripts
โ โโโ batch_local_llm_request.py # Batch inference
โ โโโ swift_infer.sh # Swift-based inference
โโโ P2A/ # Proposal-to-Action framework
โ โโโ README.md
โโโ reward_func/ # Reward function implementations
โโโ res/ # Results and outputs
โโโ img/ # Documentation images
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Qwen Team for the excellent base model
- ModelScope Swift for training infrastructure
- DeepSpeed for distributed training support
- All contributors and researchers in the software engineering AI community
โญ If you find this project helpful, please consider giving it a star! โญ
Made with โค๏ธ by the SWE-Manager Team