SWE-Manager

January 23, 2026 ยท View on GitHub

SWE-Manager

An 8B Model Trained via Reinforcement Learning for Intelligent Proposal Selection and Synthesis

License Python 3.8+ Model Training

SWE-Manager yields an 8B model trained via reinforcement learning (RL) to compare proposals, justify its choice, and synthesize a golden proposal for implementation.


๐ŸŒŸ Overview

SWE-Manager is a specialized 8B parameter language model fine-tuned on the Qwen3-8B. The model excels at:

  • ๐Ÿ“Š Comparing Multiple Proposals: Analyzing technical tradeoffs across different solution approaches
  • ๐Ÿ’ญ Reasoning & Justification: Providing detailed explanations for proposal selection decisions
  • ๐ŸŽจ Golden Proposal Synthesis: Generating optimized, actionable implementation plans by combining best aspects of multiple proposals

The model is trained using a combination of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), specifically DAPO.

๐ŸŒŸ Results

SWE-Lancer Manager Benchmark Performance

SWE-Manager Benchmark Results

SWE-Lancer IC Benchmark Performance

P2A Framework Results

๐Ÿ—๏ธ Model Architecture

Base Model

  • Architecture: Qwen-3-8B
  • Parameters: ~8 billion
  • Context Window: 8K tokens
  • Vocabulary: 152K tokens

Training Stages

graph LR
    A[Qwen-3-8B Base] --> B[SFT Stage]
    B --> C[GRPO Stage]
    C --> D[SWE-Manager 8B]
    
    style A fill:#f9f,stroke:#333
    style D fill:#9f9,stroke:#333
  1. Supervised Fine-Tuning (SFT)

    • Dataset: data/sft_data.jsonl
    • Format: Issue + Multiple Proposals โ†’ Selection + Reasoning
    • Epochs: 3-5 epochs with learning rate warmup
  2. Reinforcement Learning (GRPO)

    • Technique: DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization ) a variant of GRPO
    • Reward Function: reward_func/proposal_reward.py
    • Dataset: data/grpo_data.jsonl
    • Training: Multi-node distributed training with ZeRO-3

๐ŸŽ“ Training Pipeline

Data Preparation

The training data consists of:

Data format:

{
  "messages": [
    {
      "role": "system",
      "content": "system prompt here..."
    },
    {
      "role": "user",
      "content": "Issue + Proposals here..."
    }
  ],
  "best_proposal_id": 0,
  "golden_proposal": "Sythesized golden proposal text here...",
  "explanation": "Justification for the best proposal selection here...",
  "cot": "Chain of Thought sythesis here...",
  "price": 0
}

Training Scripts

SFT Training

bash train_script/sft_qwen_full_multi_node.sh

DPO/GRPO Training

bash train_script/dapo_qwen_zero_3.sh

Training Infrastructure

  • Multi-Node Support: Distributed training across multiple GPUs/nodes
  • Mixed Precision: Automatic mixed precision (AMP) for faster training
  • Checkpointing: Regular model checkpoints with best model selection
  • Monitoring: TensorBoard logging and metrics tracking

๐Ÿ”ฎ Inference

python infer/batch_local_llm_request.py \
    --input_file data/benchmark/swe-manager_benchmark.jsonl \
    --output_file res/inference_results.jsonl \ 
    --model_path path_to_trained_model \
    --batch_size 4

P2A Framework

The Proposal-to-Action (P2A) framework includes:

  1. Proposal Agent: Generate multiple candidate proposals
  2. Technical Manager: Compare and select the best proposal with justification, and synthesize a golden proposal
  3. Implementation Agent: Implement the synthesized golden proposal into code

P2A Framework

See P2A/README.md for detailed documentation.

๐Ÿ“ Project Structure

swe-manager/
โ”œโ”€โ”€ README.md                 # This file
โ”œโ”€โ”€ data/                     # Training data
โ”‚   โ”œโ”€โ”€ sft_data.jsonl       # Supervised fine-tuning data
โ”‚   โ”œโ”€โ”€ grpo_data.jsonl      # GRPO training data
โ”‚   โ”œโ”€โ”€ benchmark/           # Evaluation benchmarks
โ”‚   โ””โ”€โ”€ ablation_study/      # Ablation study data
โ”œโ”€โ”€ train_script/            # Training scripts
โ”‚   โ”œโ”€โ”€ sft_qwen_full_multi_node.sh    # SFT training
โ”‚   โ””โ”€โ”€ dapo_qwen_zero_3.sh            # DPO/GRPO training
โ”œโ”€โ”€ infer/                   # Inference scripts
โ”‚   โ”œโ”€โ”€ batch_local_llm_request.py     # Batch inference
โ”‚   โ””โ”€โ”€ swift_infer.sh                 # Swift-based inference
โ”œโ”€โ”€ P2A/                     # Proposal-to-Action framework
โ”‚   โ””โ”€โ”€ README.md
โ”œโ”€โ”€ reward_func/             # Reward function implementations
โ”œโ”€โ”€ res/                     # Results and outputs
โ””โ”€โ”€ img/                     # Documentation images

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • Qwen Team for the excellent base model
  • ModelScope Swift for training infrastructure
  • DeepSpeed for distributed training support
  • All contributors and researchers in the software engineering AI community

โญ If you find this project helpful, please consider giving it a star! โญ

Made with โค๏ธ by the SWE-Manager Team