RAGEN Evaluation Guide
April 14, 2026 · View on GitHub
This guide explains how to evaluate trained RAGEN models and configure output formats.
Quick Start
Evaluate a model using the default configuration:
python -m ragen.llm_agent.agent_proxy --config-name eval
Or use a specific config:
python -m ragen.llm_agent.agent_proxy --config-name _2_sokoban
Configuration File
Evaluation settings are configured in config/eval.yaml. Key sections:
Model Configuration
model_path: Qwen/Qwen2.5-3B-Instruct
lora:
rank: 0 # Set to 0 to disable LoRA; set to > 0 for LoRA-finetuned models
alpha: 64
target_modules: all-linear
Rollout Settings
actor_rollout_ref:
rollout:
max_model_len: 3600 # Max context length
response_length: 400 # Max tokens per response
val_kwargs:
do_sample: True # Enable sampling
temperature: 0.5 # Sampling temperature
top_p: 1.0 # Nucleus sampling
top_k: -1 # Top-k sampling (-1 = disabled)
Agent Proxy Settings
agent_proxy:
context_window_mode: "full" # "full" | "limited_multi_turn" | "single_turn"
max_context_window: -1 # Number of previous turns to retain (-1 = unlimited)
max_turn: 5 # Maximum interaction turns
enable_think: True # Enable <think>...</think> reasoning
Context Window Modes:
full: Keep all previous turns in contextlimited_multi_turn: Keep only the lastmax_context_windowturnssingle_turn: Only current state, no history
Environment Settings
$\text{yaml} \text{es\_manager}: \text{val}: \text{env\_groups}: 32 # \text{Number} \text{of} \text{environment} \text{groups} \text{group\_size}: 16 # \text{Environments} \text{per} \text{group} (\text{total} = \text{groups} \times \text{size}) \text{env\_configs}: \text{tags}: ["\text{CoordSokoban}"] # \text{Environment} \text{type}(\text{s}) \text{n\_groups}: [32] # \text{Groups} \text{per} \text{environment} \text{type} $
Available environment tags are defined in config/envs.yaml under custom_envs.
Output Configuration
output:
dir: results/eval # Output directory
filename: val_rollouts.pkl # Output filename
format: pkl # pkl | jsonl
append_timestamp: true # Add timestamp to filename
save_jsonl_backup: false # Save JSONL backup when format=pkl
save_pkl_backup: false # Save PKL backup when format=jsonl
keep_batch_keys: null # Filter batch keys (null = keep all)
keep_non_tensor_keys: null # Filter non-tensor keys (null = keep all)
keep_meta_info: true # Include metadata
Output Formats
PKL Format (Default)
Binary format containing the full DataProto object with tensors, metadata, and trajectories.
output:
format: pkl
filename: val_rollouts.pkl
Visualization:
python scripts/visualize.py --rollout_path results/eval/
JSONL Format (OpenAI-Compatible)
Human-readable JSONL where each line is a trajectory in OpenAI message format.
output:
format: jsonl
filename: trajectories.jsonl
JSONL structure:
{
"custom_id": "traj_0",
"messages": [
{"role": "user", "content": "Initial state..."},
{"role": "assistant", "content": "<think>...</think><ans>action</ans>"},
{"role": "user", "content": "Next state... (reward: 1.0)"},
...
],
"metadata": {
"env_id": 0,
"group_id": 0,
"success": true,
"total_reward": 5.0,
"num_turns": 3,
"entropy": 2.45,
"n_tokens": 128
}
}
Dual Output
Save both formats simultaneously:
output:
format: pkl
save_jsonl_backup: true # Also save JSONL
Or:
output:
format: jsonl
save_pkl_backup: true # Also save PKL
Converting Existing PKL Files
Convert existing PKL rollouts to JSONL:
python scripts/convert_to_jsonl.py \
--input results/eval/val_rollouts_20260413_123456.pkl \
--output trajectories.jsonl
Auto-generate output filename:
python scripts/convert_to_jsonl.py --input results/eval/val_rollouts_*.pkl
# Creates: val_rollouts_*.jsonl in the same directory
Advanced Usage
Override Config from Command Line
python -m ragen.llm_agent.agent_proxy --config-name eval \
model_path=path/to/checkpoint \
actor_rollout_ref.rollout.temperature=0.7 \
output.format=jsonl \
es_manager.val.env_groups=64
Custom Evaluation Seeds
Control randomness for reproducibility:
seed:
val: 123 # Validation seed
GPU Configuration
system:
CUDA_VISIBLE_DEVICES: "0" # GPU device(s)
actor_rollout_ref:
rollout:
tensor_model_parallel_size: 1 # Number of GPUs for tensor parallelism
gpu_memory_utilization: 0.9 # Max GPU memory fraction
Filtering Output Data
Reduce file size by filtering keys:
output:
keep_batch_keys: ["rm_scores", "responses"] # Only keep these tensor keys
keep_non_tensor_keys: ["history", "metrics"] # Only keep these non-tensor keys
Set to null to keep all keys.
Metrics
After evaluation, metrics are displayed in the terminal:
rollout rewards: 0.85
metrics:
CoordSokoban/success: 0.78
CoordSokoban/num_actions: 4.2
CoordSokoban/pass@16: 0.92
Common metrics:
{env}/success: Success rate (0-1){env}/num_actions: Average actions per trajectory{env}/pass@k: At least one success in group of k rolloutsepisodic_return: Cumulative reward
Troubleshooting
Out of memory:
actor_rollout_ref:
rollout:
max_model_len: 2048 # Reduce context length
response_length: 128 # Reduce response length
gpu_memory_utilization: 0.7 # Lower memory usage
Evaluation too slow:
- Reduce
es_manager.val.env_groupsorgroup_size - Use
temperature: 0for greedy decoding (faster) - Enable
enforce_eager: Falsefor compiled mode (if compatible)
JSONL parsing errors:
- Ensure
historydata is serializable - Check for special characters in state/response strings
- Use
save_pkl_backup: trueto preserve original data
Related Documentation
- Main README - General RAGEN overview
- Rollout Filtering Guide - Training-time filtering
- V1 README - Legacy evaluation instructions
- WebShop Evaluation - WebShop-specific setup