SWAP: Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model
April 15, 2026 · View on GitHub
This repository contains the code for the paper [ACL 25 (main)] Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model.
Overview
SWAP introduces a structure-aware planning framework for deliberate multi-step reasoning in language models. The framework consists of two core components:
- Generator
- Discriminator
Within this framework, the generator is repurposed to serve three roles:
- Policy model ()
- World model ()
- Controller ()
Given a goal and an initial state , SWAP operates as follows:
- Planning: The policy model generates an optimized plan .
- Action generation: Using , , and the current state , the policy model proposes the next action through deliberate planning.
- State prediction: The world model predicts the next state and updates the entailment graph .
- Control: Based on and the updated state , the controller decides whether to continue the reasoning process or output the final answer.
During this process, the generator explores multiple candidate actions, and the discriminator evaluates the resulting partial trajectories to determine which trajectory should be continued.
SWAP performs multi-step reasoning through structure-aware planning in tasks such as FOLIO (left) and MATH (right). At each step, given the current state (represented as a graph) and an action, the world model predicts the next state as an updated graph. The policy model is guided by this graph to propose the next action.
Quick Start
Directory structure
SWAP/
├── model_weights/
├── output/
├── script/
└── src/
Setup
git clone https://github.com/xiongsiheng/SWAP.git
cd SWAP
# Create and activate the training environment
conda create -n swap_train python=3.10 -y
conda activate swap_train
# Install training dependencies
pip install -r requirements_train.txt
# Create and activate the evaluation environment
# vLLM is used to substantially accelerate evaluation
conda create -n swap_eval python=3.10 -y
conda activate swap_eval
# Install evaluation dependencies
pip install -r requirements_eval.txt
Training
# Train the generator
bash script/train_sft_generator_gsm8k.sh
# Train the discriminator
bash script/train_sft_discriminator_gsm8k.sh
# Optional: distributed training
bash script/train_sft_discriminator_gsm8k_dist.sh
# Optional: DPO training
bash script/train_dpo_discrimintor_gsm8k.sh
Evaluation
# Evaluate the generator (without planning)
bash script/eval_generator_gsm8k.sh
# Evaluate the full system
bash script/eval_system_gsm8k.sh
# Optional: distributed evaluation
CUDA_VISIBLE_DEVICES=0 NUM_SHARDS=4 SHARD_INDEX=0 bash script/eval_system_gsm8k.sh
CUDA_VISIBLE_DEVICES=1 NUM_SHARDS=4 SHARD_INDEX=1 bash script/eval_system_gsm8k.sh
CUDA_VISIBLE_DEVICES=2 NUM_SHARDS=4 SHARD_INDEX=2 bash script/eval_system_gsm8k.sh
CUDA_VISIBLE_DEVICES=3 NUM_SHARDS=4 SHARD_INDEX=3 bash script/eval_system_gsm8k.sh
# Optional: download our checkpoints
For detailed descriptions of the available arguments and configuration options, please refer to the source code.
Datasets & Checkpoints
All datasets used in SWAP (GSM8K, MATH, FOLIO, ReClor, HumanEval, MBPP) with trajectory and process supervision are available here:
from datasets import load_dataset
dataset = load_dataset("sxiong/SWAP", "gsm8k_trajectory")
print(dataset)
split = dataset["train"]
We also provide the corresponding checkpoints.
In addition, we release an updated version of datasets and provide the corresponding checkpoints.
Citation
@inproceedings{xiong2025deliberate,
title={Deliberate reasoning in language models as structure-aware planning with an accurate world model},
author={Xiong, Siheng and Payani, Ali and Yang, Yuan and Fekri, Faramarz},
booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={31900--31931},
year={2025}
}