Experiments

June 20, 2026 · View on GitHub

This directory contains the benchmark adapters, reproduction entrypoints, and benchmark-specific documentation for the current LightMem2 runtime path.

Large benchmark assets and experiment outputs are stored outside Git so that repository clones stay lightweight.

What stays in the repository

The repository keeps:

  • benchmark task definitions
  • grader code
  • experiment-side scripts
  • runtime adapters
  • environment notes
  • reproduction READMEs

The repository does not keep the full public asset bundles or large experiment result dumps.

External experiment storage

Shared experiment storage lives on Google Drive:

Recommended top-level layout on Drive:

LightMem2/
└── TokenPilot/
    ├── experiment-data/
    │   ├── pinchbench/
    │   │   └── assets/
    │   └── claw-eval/
    │       ├── general/
    │       └── tasks/
    └── experiment-results/
        ├── pinchbench/
        │   ├── isolated/
        │   └── continuous/
        └── claw-eval/
            ├── isolated/
            └── continuous/

Use experiment-data/ for benchmark inputs and experiment-results/ for produced outputs, logs, tables, and merged result artifacts.

Public benchmark index

Before you reproduce anything

Complete the root-level runtime setup first:

  1. follow the installation instructions in the repository README.md
  2. make sure openclaw is already runnable in your shell
  3. confirm that the current TokenPilot runtime component is installed and usable
  4. download the required benchmark data bundle(s) from the shared Google Drive
  5. place the bundle(s) into the benchmark-local directories described below

The benchmark directories assume the runtime path is already working before you start reproduction.

  1. Choose the benchmark you want to reproduce.
  2. Open the benchmark-specific README.
  3. Download the matching experiment-data bundle from Google Drive.
  4. Copy the bundle into the local benchmark directory expected by the scripts.
  5. Run the official baseline or method runner from that subtree.
  6. Save large outputs under your local run directories, then upload the final result bundle to the matching experiment-results folder on Drive.

The canonical public entrypoints are:

  • experiments/tokenpilot/pinchbench/scripts/run_baseline.sh
  • experiments/tokenpilot/pinchbench/scripts/run_method.sh
  • experiments/tokenpilot/claw-eval/scripts/run_baseline.sh
  • experiments/tokenpilot/claw-eval/scripts/run_method.sh

Local mount points

PinchBench

  • benchmark README:
  • local asset mount point:
    • experiments/tokenpilot/pinchbench/dataset/assets/
  • recommended Drive source:
    • LightMem2/TokenPilot/experiment-data/pinchbench/assets/
  • recommended Drive result sink:
    • LightMem2/TokenPilot/experiment-results/pinchbench/

Claw-Eval

  • benchmark README:
  • local asset mount points:
    • experiments/tokenpilot/claw-eval/dataset/general/
    • experiments/tokenpilot/claw-eval/dataset/tasks/
  • recommended Drive source:
    • LightMem2/TokenPilot/experiment-data/claw-eval/general/
    • LightMem2/TokenPilot/experiment-data/claw-eval/tasks/
  • recommended Drive result sink:
    • LightMem2/TokenPilot/experiment-results/claw-eval/

Notes

  • The root README gives the public project overview.
  • This directory is the top-level entry for experiment reproduction.
  • Exact benchmark commands, asset requirements, and pitfalls remain documented in each benchmark subtree.