Experiments
June 20, 2026 · View on GitHub
This directory contains the benchmark adapters, reproduction entrypoints, and benchmark-specific documentation for the current LightMem2 runtime path.
Large benchmark assets and experiment outputs are stored outside Git so that repository clones stay lightweight.
What stays in the repository
The repository keeps:
- benchmark task definitions
- grader code
- experiment-side scripts
- runtime adapters
- environment notes
- reproduction READMEs
The repository does not keep the full public asset bundles or large experiment result dumps.
External experiment storage
Shared experiment storage lives on Google Drive:
Recommended top-level layout on Drive:
LightMem2/
└── TokenPilot/
├── experiment-data/
│ ├── pinchbench/
│ │ └── assets/
│ └── claw-eval/
│ ├── general/
│ └── tasks/
└── experiment-results/
├── pinchbench/
│ ├── isolated/
│ └── continuous/
└── claw-eval/
├── isolated/
└── continuous/
Use experiment-data/ for benchmark inputs and experiment-results/ for
produced outputs, logs, tables, and merged result artifacts.
Public benchmark index
- experiments/tokenpilot/README.md
- experiments/tokenpilot/pinchbench/README.md
- experiments/tokenpilot/claw-eval/README.md
Before you reproduce anything
Complete the root-level runtime setup first:
- follow the installation instructions in the repository README.md
- make sure
openclawis already runnable in your shell - confirm that the current TokenPilot runtime component is installed and usable
- download the required benchmark data bundle(s) from the shared Google Drive
- place the bundle(s) into the benchmark-local directories described below
The benchmark directories assume the runtime path is already working before you start reproduction.
Recommended workflow
- Choose the benchmark you want to reproduce.
- Open the benchmark-specific README.
- Download the matching
experiment-databundle from Google Drive. - Copy the bundle into the local benchmark directory expected by the scripts.
- Run the official baseline or method runner from that subtree.
- Save large outputs under your local run directories, then upload the final
result bundle to the matching
experiment-resultsfolder on Drive.
The canonical public entrypoints are:
experiments/tokenpilot/pinchbench/scripts/run_baseline.shexperiments/tokenpilot/pinchbench/scripts/run_method.shexperiments/tokenpilot/claw-eval/scripts/run_baseline.shexperiments/tokenpilot/claw-eval/scripts/run_method.sh
Local mount points
PinchBench
- benchmark README:
- local asset mount point:
experiments/tokenpilot/pinchbench/dataset/assets/
- recommended Drive source:
LightMem2/TokenPilot/experiment-data/pinchbench/assets/
- recommended Drive result sink:
LightMem2/TokenPilot/experiment-results/pinchbench/
Claw-Eval
- benchmark README:
- local asset mount points:
experiments/tokenpilot/claw-eval/dataset/general/experiments/tokenpilot/claw-eval/dataset/tasks/
- recommended Drive source:
LightMem2/TokenPilot/experiment-data/claw-eval/general/LightMem2/TokenPilot/experiment-data/claw-eval/tasks/
- recommended Drive result sink:
LightMem2/TokenPilot/experiment-results/claw-eval/
Notes
- The root README gives the public project overview.
- This directory is the top-level entry for experiment reproduction.
- Exact benchmark commands, asset requirements, and pitfalls remain documented in each benchmark subtree.