RE-experiment

April 1, 2026 · View on GitHub

An authoritative skills repository for experiment design, implementation, validation, and analysis in Research-Equality.

This repository focuses on the empirical workflow: planning research programs, formalizing methods, designing experiments, implementing experiment code, debugging failures, analyzing results, generating figures and tables, and tracing reported numbers back to executed code. It is not a general end-to-end academic research toolkit. It is a curated home for skills directly related to experimental design and implementation.

English | 简体中文

Positioning

Keep only skills related to experiment planning, method formalization, implementation, debugging, validation, and result analysis
Normalize everything into a portable repository layout without depending on one author's local machine
Serve as the authoritative home for future experiment-oriented skills in Research-Equality

Included Skills

The skill collection lives under skills/.

Planning and Formalization

research-planning: turn a topic or method idea into an executable research and experiment roadmap
hypothesis-discrimination: turn observations into falsifiable and competing hypotheses
atomic-decomposition: break complex methods into atomic components with math-code mappings
algorithm-design: generate pseudocode and system diagrams before implementation
math-reasoning: support derivations, proof sketches, notation discipline, and statistical-test selection
dataset-intake-profiling: profile dataset structure, coverage, imbalance, and leakage risk before finalizing the plan
statistical-analysis-plan: pre-specify estimands, tests, power assumptions, and reporting rules
critical-experiment-review: audit design validity, baseline fairness, confounds, and claim scope before trusting a result
experiment-memory-evolution: distill repeated outcomes into reusable strategy memory and failed-direction memory

Experiment Execution

experiment-orchestration: manage multi-hypothesis experiment campaigns and state across many runs
experiment-template-design: build reusable experiment sandboxes with stable entrypoints and seed ideas
experiment-design: define staged experiment plans, baselines, datasets, metrics, and ablations
resource-aware-experiment-design: scope conditions against real hardware and runtime budgets
execution-preflight: verify hardware, packages, commands, output paths, and smoke-test strategy before long runs
bounded-experiment-loop: execute a small run budget against a fixed baseline and command contract
experiment-stage-reflection: revise the remaining plan after a meaningful stage using metrics, artifacts, and unmet success signals
modular-training-stack: structure training code into model, data, trainer, callback, and logging layers
experiment-code: implement and iteratively improve training and evaluation pipelines
experiment-code-validation: catch syntax, policy, import, and fake-metric issues before execution
code-debugging: diagnose runtime, logic, and output failures in experiment code
experiment-repair-loop: classify deficient runs and generate targeted repairs beyond basic debugging
experiment-output-contract: standardize run_i/ outputs, summaries, plots, and notes for handoff

LLM Implementation Stack

llm-data-pipeline: build scalable ingestion, curation, deduplication, and sharding pipelines
llm-fine-tuning: route LoRA and QLoRA style supervised fine-tuning to the right stack
llm-post-training: handle SFT, DPO, PPO, and GRPO style post-training workflows
distributed-training: choose between Accelerate, FSDP2, DeepSpeed, and Ray Train
llm-benchmarking: benchmark language and code models with standard harnesses
experiment-tracking: standardize W&B, MLflow, or TensorBoard based observability

Validation and Reporting

data-analysis: compute descriptive statistics, significance tests, and result summaries
evidence-sufficiency-gate: decide whether current evidence is strong enough to stop iterating and write up
figure-generation: generate scientific plots for curves, ablations, comparisons, and diagnostics
table-generation: convert structured results into publication-ready LaTeX tables
backward-traceability: link reported numbers back to the exact code output that produced them

See skills/README.md for the skill catalog.

Skill Routing

The repository now uses a shorter default path plus optional overlays, so users do not have to route every project through all skills.

Default path for most projects:

research-planning
dataset-intake-profiling if the data contract is still unclear
experiment-design
resource-aware-experiment-design and statistical-analysis-plan
experiment-code
experiment-code-validation
execution-preflight
bounded-experiment-loop
experiment-stage-reflection
data-analysis
critical-experiment-review
evidence-sufficiency-gate
experiment-output-contract, figure-generation, table-generation, backward-traceability

Optional overlays:

hypothesis-discrimination, atomic-decomposition, algorithm-design, and math-reasoning only when the method still needs formalization
experiment-orchestration and experiment-memory-evolution only for long-running or multi-branch experiment programs
experiment-template-design only when packaging a reusable sandbox
modular-training-stack only when architecture cleanup is the bottleneck
llm-data-pipeline, llm-fine-tuning, llm-post-training, distributed-training, and llm-benchmarking only for LLM-specific implementations

Key boundaries:

experiment-design defines what to test; resource-aware-experiment-design trims it to the real budget
experiment-code-validation checks code statically; execution-preflight checks the runtime environment and launch contract
code-debugging fixes bugs; experiment-repair-loop repairs broader failed runs with scientific or budget deficiencies
critical-experiment-review audits methodological validity; evidence-sufficiency-gate decides whether the evidence is already enough to stop iterating
bounded-experiment-loop is for one idea under a fixed budget; experiment-orchestration is for campaigns with multiple active branches

Shared artifact conventions:

outputs/<topic-slug>/plan/ for planning JSON or Markdown artifacts
outputs/<topic-slug>/runs/ for raw experiment runs, logs, and checkpoints
outputs/<topic-slug>/analysis/ for summaries, significance tests, and error analyses
outputs/<topic-slug>/figures/ and outputs/<topic-slug>/tables/ for generated assets
outputs/<topic-slug>/report/ for manuscript snippets, traceability checks, and release-ready deliverables

Repository Layout

skills/
  research-planning/
  hypothesis-discrimination/
  atomic-decomposition/
  algorithm-design/
  math-reasoning/
  dataset-intake-profiling/
  statistical-analysis-plan/
  critical-experiment-review/
  experiment-memory-evolution/
  experiment-orchestration/
  experiment-template-design/
  experiment-design/
  resource-aware-experiment-design/
  execution-preflight/
  bounded-experiment-loop/
  experiment-stage-reflection/
  modular-training-stack/
  experiment-code/
  experiment-code-validation/
  code-debugging/
  experiment-repair-loop/
  experiment-output-contract/
  llm-data-pipeline/
  llm-fine-tuning/
  llm-post-training/
  distributed-training/
  llm-benchmarking/
  experiment-tracking/
  data-analysis/
  evidence-sufficiency-gate/
  figure-generation/
  table-generation/
  backward-traceability/

Usage

Command examples assume you run them from the repository root.

python skills/experiment-design/scripts/design_experiments.py \
  --method "contrastive learning" \
  --task classification \
  --format markdown

python skills/data-analysis/scripts/stat_summary.py \
  --input outputs/topic-x/runs/results.csv \
  --compare method \
  --metric accuracy \
  --output outputs/topic-x/analysis/summary.json

python skills/table-generation/scripts/results_to_table.py \
  --input outputs/topic-x/analysis/main_results.json \
  --type comparison \
  --caption "Main results on benchmark datasets" \
  --label tab:main-results

Recommended environment:

Python 3.10+
pip install -r requirements-optional.txt for the bundled analysis and plotting scripts
Optional: numpy, scipy, pandas, scikit-learn for data-analysis
Optional: matplotlib, seaborn for figure-generation

Generated artifacts should live under outputs/<topic-slug>/ so the skill directories remain clean and reusable.

Curation Rules

A skill must directly support experiment planning, implementation, debugging, validation, or result reporting
Skills for literature discovery, citation work, paper formatting, slide generation, or general repository mining should not live here
Prefer skills that are scriptable, reusable, and auditable
The normalized authoritative version is the one under skills/

Provenance

The initial curated skill set in this repository was refactored from the local source snapshot agent-research-skills/.

An additional implementation-focused layer was later distilled from AI-research-SKILLs/, but rewritten here as repository-native generic skills instead of mirroring upstream framework names one-to-one.

Only the experiment-related skills are retained. Overlapping capabilities outside the experiment workflow are intentionally excluded instead of copied one-to-one. Examples of excluded scope:

literature-search, literature-review, and deep-research: literature discovery belongs in a literature-focused repository
citation-management, paper-compilation, and latex-formatting: paper-production utilities are outside this repository's scope
slide-generation and github-research: presentation and repo-mining workflows are not part of the core experiment pipeline
framework-specific knowledge from AI-research-SKILLs is curated here only when it supports experiment execution directly, for example fine-tuning, distributed training, benchmarking, or experiment tracking
ai-scientist contributes template-based experimentation patterns here, especially baseline-first templates, bounded run loops, and standardized output contracts
AutoResearchClaw contributes the control-plane side of experiment execution here: falsifiable hypothesis shaping, resource-aware scoping, pre-execution validation, and structured repair loops
claude-scientific-skills contributes experiment-rigor auditing, pre-registered statistical planning, dataset intake profiling, and modular training architecture patterns distilled from scientific-critical-thinking, statistical-analysis, exploratory-data-analysis, and pytorch-lightning
EvoScientist contributes reusable experiment operations patterns here: execution preflight, post-stage reflection, durable experiment memory, and explicit evidence-sufficiency gates distilled from its experimental agent prompts and sub-agent contracts