SWE-Manager

January 23, 2026 · View on GitHub

SWE-Manager

An 8B Model Trained via Reinforcement Learning for Intelligent Proposal Selection and Synthesis

SWE-Manager yields an 8B model trained via reinforcement learning (RL) to compare proposals, justify its choice, and synthesize a golden proposal for implementation.

🌟 Overview

SWE-Manager is a specialized 8B parameter language model fine-tuned on the Qwen3-8B. The model excels at:

📊 Comparing Multiple Proposals: Analyzing technical tradeoffs across different solution approaches
💭 Reasoning & Justification: Providing detailed explanations for proposal selection decisions
🎨 Golden Proposal Synthesis: Generating optimized, actionable implementation plans by combining best aspects of multiple proposals

The model is trained using a combination of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), specifically DAPO.

🌟 Results

SWE-Lancer Manager Benchmark Performance

SWE-Manager Benchmark Results

SWE-Lancer IC Benchmark Performance

P2A Framework Results

🏗️ Model Architecture

Base Model

Architecture: Qwen-3-8B
Parameters: ~8 billion
Context Window: 8K tokens
Vocabulary: 152K tokens

Training Stages

graph LR
    A[Qwen-3-8B Base] --> B[SFT Stage]
    B --> C[GRPO Stage]
    C --> D[SWE-Manager 8B]
    
    style A fill:#f9f,stroke:#333
    style D fill:#9f9,stroke:#333

Supervised Fine-Tuning (SFT)
- Dataset: data/sft_data.jsonl
- Format: Issue + Multiple Proposals → Selection + Reasoning
- Epochs: 3-5 epochs with learning rate warmup
Reinforcement Learning (GRPO)
- Technique: DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization ) a variant of GRPO
- Reward Function: reward_func/proposal_reward.py
- Dataset: data/grpo_data.jsonl
- Training: Multi-node distributed training with ZeRO-3

🎓 Training Pipeline

Data Preparation

The training data consists of:

SFT Data (data/sft_data.jsonl): Supervised learning examples
GRPO Data (data/grpo_data.jsonl): Preference-based training pairs

Data format:

{
  "messages": [
    {
      "role": "system",
      "content": "system prompt here..."
    },
    {
      "role": "user",
      "content": "Issue + Proposals here..."
    }
  ],
  "best_proposal_id": 0,
  "golden_proposal": "Sythesized golden proposal text here...",
  "explanation": "Justification for the best proposal selection here...",
  "cot": "Chain of Thought sythesis here...",
  "price": 0
}

Training Scripts

SFT Training

bash train_script/sft_qwen_full_multi_node.sh

DPO/GRPO Training

bash train_script/dapo_qwen_zero_3.sh

Training Infrastructure

Multi-Node Support: Distributed training across multiple GPUs/nodes
Mixed Precision: Automatic mixed precision (AMP) for faster training
Checkpointing: Regular model checkpoints with best model selection
Monitoring: TensorBoard logging and metrics tracking

🔮 Inference

python infer/batch_local_llm_request.py \
    --input_file data/benchmark/swe-manager_benchmark.jsonl \
    --output_file res/inference_results.jsonl \ 
    --model_path path_to_trained_model \
    --batch_size 4

P2A Framework

The Proposal-to-Action (P2A) framework includes:

Proposal Agent: Generate multiple candidate proposals
Technical Manager: Compare and select the best proposal with justification, and synthesize a golden proposal
Implementation Agent: Implement the synthesized golden proposal into code

P2A Framework

See P2A/README.md for detailed documentation.

📁 Project Structure

swe-manager/
├── README.md                 # This file
├── data/                     # Training data
│   ├── sft_data.jsonl       # Supervised fine-tuning data
│   ├── grpo_data.jsonl      # GRPO training data
│   ├── benchmark/           # Evaluation benchmarks
│   └── ablation_study/      # Ablation study data
├── train_script/            # Training scripts
│   ├── sft_qwen_full_multi_node.sh    # SFT training
│   └── dapo_qwen_zero_3.sh            # DPO/GRPO training
├── infer/                   # Inference scripts
│   ├── batch_local_llm_request.py     # Batch inference
│   └── swift_infer.sh                 # Swift-based inference
├── P2A/                     # Proposal-to-Action framework
│   └── README.md
├── reward_func/             # Reward function implementations
├── res/                     # Results and outputs
└── img/                     # Documentation images

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Qwen Team for the excellent base model
ModelScope Swift for training infrastructure
DeepSpeed for distributed training support
All contributors and researchers in the software engineering AI community

⭐ If you find this project helpful, please consider giving it a star! ⭐

Made with ❤️ by the SWE-Manager Team