RAM Projects

April 10, 2026 · View on GitHub

Here we list projects undertaken in the RAM framework that are shared publicly, either in the form of papers, public tasks and/or shared model code. This directory also contains subfolders for some of the projects which are housed in the RAM repo, others are maintained via external websites.

Reasoning

CoT and RL

Prinicipia [paper] [blog] [tweet] Reasoning over Mathematical Objects.
ParaGator [paper] [blog] [tweet] Train generation with pass@k, and aggregation with pass@1 on-policy, end-to-end for better results.
AggLM [paper] [tweets] Uses RL to train an LLM solution aggregator, with strong results.
RESTRAIN [paper] [tweets] Self-training RL method that improves over other label-free / test-time training methods.
StepWiser [paper] Stepwise Generative Judge trained with RL. SOTA on ProcessBench; gains at when used at train/test time.
OptimalThinkingBench [project] [paper]. New benchmark measuring overthinking & underthinking of LLMs.
Reasoning for Factuality [paper]. Shows how to learn CoTs that improve factuality via a new reward function.
ASTRO [paper]. Teaching LLMs to reason by reflecting and backtracking in-context.
NaturalThoughts [paper]. Creates better CoT distillation emphasizing difficult and diverse reasoning.
Bridging Online and Offline RL [paper]. Mix verifiable & non-verifiable tasks, comparing semi-online DPO & GRPO (similar results).
Thinking LLMs [paper]. Train LLMs to write down its internal thoughts for general instructions (non-verifiable tasks).
Iterative Reasoning Preference Optimization [paper] Shows how to use iterative optimization to train CoTs on verifiable tasks.

other algorithms

Coconut (Continuous Chain-of-Thought)* [project]. Training LLMs to reason in continuous latent space (rather than using language).
Backtracking Improves Generation Safety [paper]. Trains LLMs to generate a RESET token if the partial-generation is bad.
System 2 Distillation [paper]. Distilling reasoning traces (System 2) back into the Transformer (System 1).
Beyond A* [paper]. Better Planning with Transformers via Search Dynamics Bootstrapping.
SWEET-RL [project]. Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks.

inference

From decoding to meta-generation [paper]. Survey paper on reasoning methods.
System 2 Attention [paper]. Make LLM plan what it attends to as a generative process, decreasing bias & increasing factuality.
Beyond A* [paper]. Better Planning with Transformers via Search Dynamics Bootstrapping.
Chain-of-Verification Reduces Hallucination [paper]. Reduces hallucination by LLM self-identifying and verifying generated facts.
Ask, Refine, Trust [paper]. Technique that uses critical questions to determine if an LLM generation needs refinement.
ToolVerifier [paper]. Generalization to New Tools via Self-Verification.

Reward Models & Evaluation

RLLM [paper] [blog] [tweet] Unified Post-Training via On-Policy-Trained LM-as-RM.
HERO [paper] [tweets] Combines sparse verifiable and dense RMs into a hybrid reward to give better results.
StepWiser [paper] Stepwise Generative Judge trained with RL. SOTA on ProcessBench; gains at when used at train/test time.
DARLING [paper] Method to optimize quality+diversity reward to give gains on each over conventional GRPO RL
Reasoning for Factuality [paper]. Shows how to learn CoTs that improve factuality via a new reward function.
J1 [paper]. Learns CoTs for LLM-as-a-Judge via GRPO, outperforms EvalPlanner & Distilled R1 models at 8B and 70B scale.
Eval-Planner [paper]). Learning powerful plan+execution CoTs for LLM-as-a-Judge critics, SOTA on RewardBench.
Self-Taught Evaluators [project]. Improving LLM-as-a-Judge using iteratively generated synthetic data only (no human annotation).
Branch-Solve-Merge [paper]. Reasoning method to improve LLM Evaluation and Generation.
Self-Rewarding LLMs [paper] Shows LLMs can judge themselves to self-improve without human feedback.

Agents

Experience Synthesis [paper] [tweets]. Scaling training environments for RL by simulating them with reasoning LLMs
Early Experience [paper] [tweets]. SFT is sparse; RL on long-horizons is hard. EE provides new mid-training signals that help
Self-Challenging LLM Agents [paper]. _LLM creates own challenging agentic tool-use tasks, resulting in better agentic pe
SWEET-RL [project]. Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks.
ToolVerifier [paper]. Generalization to New Tools via Self-Verification.

Pre- and Mid-Training

Thinking Mid-training: RL of Interleaved Reasoning [paper] [blog] [tweet] An intermediate SFT+RL mid-training phase to teach models how to think.
Self-Improving Pretraining [paper] [tweet] Reinvents pretraining with sequence-based RL to improve safety, quality and factuality.
Recycling the Web [paper] A method to create more high quality pretraining data via rewriting low quality documents.

Synthetic Data

synthetic data & data quality

CoT-Self-Instruct [paper] Create synthetic data using reasoning followed by filtering for high quality, for large gains.
RIP [paper] A method to curate high quality data, or create high quality synthetic data. Gives large improvements.
Recycling the Web [paper] A method to create more high quality pretraining data via rewriting low quality documents.
Instruction Back-and-Forth Translation [paper] Improves Instruction Backtranslation by rewriting the web document.
Instruction Backtranslation [paper] Self-Alignment method by predicting instructions for web documents.

synthetic data for complex reasoning & tools

SPICE [paper] [tweets]. Challenger creates tasks grounded on documents, Reasoner solves them in self-play, both trained by RL.
Self-Challenging LLM Agents [paper]. LLM creates own challenging agentic tool-use tasks, resulting in better agentic performance.
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions [paper]. Scaling reasoning capabilities with diverse and high-quality questions.
Source2Synth [paper]. Generating synthetic data from real sources to improve LLMs on complex reasoning tasks.
ToolVerifier [paper]. Generalization to New Tools via Self-Verification.

(Self-)Alignment

(self-)alignment optimization techniques

CharacterFlywheel [paper] Continual learning in production (with humans-in-the-loop).
SPICE [paper] [tweets]. Challenger creates tasks grounded on documents, Reasoner solves them in self-play, both trained by RL.
WaltzRL [paper] [tweets] Method to improve safety alignment through multi-agent RL
RLHI [paper] [tweets] Method to RL train from organic Human Interaction (aka RLHI) which helps.
DARLING [paper] Method to optimize quality+diversity reward to give gains on each over conventional GRPO RL
Self-Challenging LLM Agents [paper]. LLM creates own challenging agentic tool-use tasks, resulting in better agentic performance.
Solve & Verify [paper]. A self-play framework for LLMs to learn how to code by writing code & unit tests.
Bridging Online and Offline RL [paper]. Mix verifiable & non-verifiable tasks, comparing semi-online DPO & GRPO (similar results).
Diversity Preference Optimization [paper] SOTA LLMs have model collapse. DivPO training improves diversity with similar quality.
Self-Consistency Preference Optimization [paper] self-training without human labels that matches supervised training performance.
Thinking LLMs [paper]. Train LLMs to write down its internal thoughts before responding to general instructions.
Meta-Rewarding LLMs [paper] LLMs that can judge their own judgments to self-improve both acting & evaluating actions.
Iterative Reasoning Preference Optimization [paper] Shows how to improve reasoning tasks with iterative DPO.
Length Following [project] Method to make LLMs follow length instructions much better & removing length bias in evaluations.
Self-Rewarding LLMs [paper] Shows LLMs can judge themselves to self-improve without human feedback.
Iterative DPO & Cringe Loss [paper] Shows iterative learning improves alignment.

(self-)alignment via other methods

Instruction Back-and-Forth Translation [paper] Improves Instruction Backtranslation by rewriting the web document.
Instruction Backtranslation [paper] Self-Alignment method by predicting instructions for web documents.
Leveraging Implicit Feedback [paper] Method to learn from human feedback in dialogue deployment data to improve LLM.

data curation

RIP [paper] A method to curate high quality data, or create high quality synthetic data. Gives large improvements.

Memory & Architectures

memory

Reverse Training [paper] Method for pretraining that helps the reversal curse & improves performance.
MemWalker [paper] Novel memory architecture: builds & navigates a tree (structured long-term memory) via LLM prompting.
Self-Notes [project] LLMs generate internal thoughts as they read text, enabling reasoning & memorization.

architectures

Stochastic activations [paper] [tweets] Select between several non-linear functions in the feed-forward layers of an LLM.
Multi-token attention [project] [paper] Attention mechanism that can focus on multiple tokens simultaneously.
Byte Latent Transformer [paper] New Byte-level LLM architecture that matches tokenization-based LLM performance at scale.
Adaptive Decoding via Latent Preference Optimization [paper] New layer that selects decoding params automatically per token.
Contextual Position Encoding [project] New attention mechanism that fixes problems in copying & counting for Transformers.
Branch-Train-MiX [paper] Novel MoE architecture that is very efficient during training.