RE-experiment

April 1, 2026 · View on GitHub

An authoritative skills repository for experiment design, implementation, validation, and analysis in Research-Equality.

This repository focuses on the empirical workflow: planning research programs, formalizing methods, designing experiments, implementing experiment code, debugging failures, analyzing results, generating figures and tables, and tracing reported numbers back to executed code. It is not a general end-to-end academic research toolkit. It is a curated home for skills directly related to experimental design and implementation.

English | 简体中文

Positioning

  • Keep only skills related to experiment planning, method formalization, implementation, debugging, validation, and result analysis
  • Normalize everything into a portable repository layout without depending on one author's local machine
  • Serve as the authoritative home for future experiment-oriented skills in Research-Equality

Included Skills

The skill collection lives under skills/.

Planning and Formalization

  • research-planning: turn a topic or method idea into an executable research and experiment roadmap
  • hypothesis-discrimination: turn observations into falsifiable and competing hypotheses
  • atomic-decomposition: break complex methods into atomic components with math-code mappings
  • algorithm-design: generate pseudocode and system diagrams before implementation
  • math-reasoning: support derivations, proof sketches, notation discipline, and statistical-test selection
  • dataset-intake-profiling: profile dataset structure, coverage, imbalance, and leakage risk before finalizing the plan
  • statistical-analysis-plan: pre-specify estimands, tests, power assumptions, and reporting rules
  • critical-experiment-review: audit design validity, baseline fairness, confounds, and claim scope before trusting a result
  • experiment-memory-evolution: distill repeated outcomes into reusable strategy memory and failed-direction memory

Experiment Execution

  • experiment-orchestration: manage multi-hypothesis experiment campaigns and state across many runs
  • experiment-template-design: build reusable experiment sandboxes with stable entrypoints and seed ideas
  • experiment-design: define staged experiment plans, baselines, datasets, metrics, and ablations
  • resource-aware-experiment-design: scope conditions against real hardware and runtime budgets
  • execution-preflight: verify hardware, packages, commands, output paths, and smoke-test strategy before long runs
  • bounded-experiment-loop: execute a small run budget against a fixed baseline and command contract
  • experiment-stage-reflection: revise the remaining plan after a meaningful stage using metrics, artifacts, and unmet success signals
  • modular-training-stack: structure training code into model, data, trainer, callback, and logging layers
  • experiment-code: implement and iteratively improve training and evaluation pipelines
  • experiment-code-validation: catch syntax, policy, import, and fake-metric issues before execution
  • code-debugging: diagnose runtime, logic, and output failures in experiment code
  • experiment-repair-loop: classify deficient runs and generate targeted repairs beyond basic debugging
  • experiment-output-contract: standardize run_i/ outputs, summaries, plots, and notes for handoff

LLM Implementation Stack

  • llm-data-pipeline: build scalable ingestion, curation, deduplication, and sharding pipelines
  • llm-fine-tuning: route LoRA and QLoRA style supervised fine-tuning to the right stack
  • llm-post-training: handle SFT, DPO, PPO, and GRPO style post-training workflows
  • distributed-training: choose between Accelerate, FSDP2, DeepSpeed, and Ray Train
  • llm-benchmarking: benchmark language and code models with standard harnesses
  • experiment-tracking: standardize W&B, MLflow, or TensorBoard based observability

Validation and Reporting

  • data-analysis: compute descriptive statistics, significance tests, and result summaries
  • evidence-sufficiency-gate: decide whether current evidence is strong enough to stop iterating and write up
  • figure-generation: generate scientific plots for curves, ablations, comparisons, and diagnostics
  • table-generation: convert structured results into publication-ready LaTeX tables
  • backward-traceability: link reported numbers back to the exact code output that produced them

See skills/README.md for the skill catalog.

Skill Routing

The repository now uses a shorter default path plus optional overlays, so users do not have to route every project through all skills.

Default path for most projects:

  1. research-planning
  2. dataset-intake-profiling if the data contract is still unclear
  3. experiment-design
  4. resource-aware-experiment-design and statistical-analysis-plan
  5. experiment-code
  6. experiment-code-validation
  7. execution-preflight
  8. bounded-experiment-loop
  9. experiment-stage-reflection
  10. data-analysis
  11. critical-experiment-review
  12. evidence-sufficiency-gate
  13. experiment-output-contract, figure-generation, table-generation, backward-traceability

Optional overlays:

  • hypothesis-discrimination, atomic-decomposition, algorithm-design, and math-reasoning only when the method still needs formalization
  • experiment-orchestration and experiment-memory-evolution only for long-running or multi-branch experiment programs
  • experiment-template-design only when packaging a reusable sandbox
  • modular-training-stack only when architecture cleanup is the bottleneck
  • llm-data-pipeline, llm-fine-tuning, llm-post-training, distributed-training, and llm-benchmarking only for LLM-specific implementations

Key boundaries:

  • experiment-design defines what to test; resource-aware-experiment-design trims it to the real budget
  • experiment-code-validation checks code statically; execution-preflight checks the runtime environment and launch contract
  • code-debugging fixes bugs; experiment-repair-loop repairs broader failed runs with scientific or budget deficiencies
  • critical-experiment-review audits methodological validity; evidence-sufficiency-gate decides whether the evidence is already enough to stop iterating
  • bounded-experiment-loop is for one idea under a fixed budget; experiment-orchestration is for campaigns with multiple active branches

Shared artifact conventions:

  • outputs/<topic-slug>/plan/ for planning JSON or Markdown artifacts
  • outputs/<topic-slug>/runs/ for raw experiment runs, logs, and checkpoints
  • outputs/<topic-slug>/analysis/ for summaries, significance tests, and error analyses
  • outputs/<topic-slug>/figures/ and outputs/<topic-slug>/tables/ for generated assets
  • outputs/<topic-slug>/report/ for manuscript snippets, traceability checks, and release-ready deliverables

Repository Layout

skills/
  research-planning/
  hypothesis-discrimination/
  atomic-decomposition/
  algorithm-design/
  math-reasoning/
  dataset-intake-profiling/
  statistical-analysis-plan/
  critical-experiment-review/
  experiment-memory-evolution/
  experiment-orchestration/
  experiment-template-design/
  experiment-design/
  resource-aware-experiment-design/
  execution-preflight/
  bounded-experiment-loop/
  experiment-stage-reflection/
  modular-training-stack/
  experiment-code/
  experiment-code-validation/
  code-debugging/
  experiment-repair-loop/
  experiment-output-contract/
  llm-data-pipeline/
  llm-fine-tuning/
  llm-post-training/
  distributed-training/
  llm-benchmarking/
  experiment-tracking/
  data-analysis/
  evidence-sufficiency-gate/
  figure-generation/
  table-generation/
  backward-traceability/

Usage

Command examples assume you run them from the repository root.

python skills/experiment-design/scripts/design_experiments.py \
  --method "contrastive learning" \
  --task classification \
  --format markdown

python skills/data-analysis/scripts/stat_summary.py \
  --input outputs/topic-x/runs/results.csv \
  --compare method \
  --metric accuracy \
  --output outputs/topic-x/analysis/summary.json

python skills/table-generation/scripts/results_to_table.py \
  --input outputs/topic-x/analysis/main_results.json \
  --type comparison \
  --caption "Main results on benchmark datasets" \
  --label tab:main-results

Recommended environment:

  • Python 3.10+
  • pip install -r requirements-optional.txt for the bundled analysis and plotting scripts
  • Optional: numpy, scipy, pandas, scikit-learn for data-analysis
  • Optional: matplotlib, seaborn for figure-generation

Generated artifacts should live under outputs/<topic-slug>/ so the skill directories remain clean and reusable.

Curation Rules

  • A skill must directly support experiment planning, implementation, debugging, validation, or result reporting
  • Skills for literature discovery, citation work, paper formatting, slide generation, or general repository mining should not live here
  • Prefer skills that are scriptable, reusable, and auditable
  • The normalized authoritative version is the one under skills/

Provenance

The initial curated skill set in this repository was refactored from the local source snapshot agent-research-skills/.

An additional implementation-focused layer was later distilled from AI-research-SKILLs/, but rewritten here as repository-native generic skills instead of mirroring upstream framework names one-to-one.

Only the experiment-related skills are retained. Overlapping capabilities outside the experiment workflow are intentionally excluded instead of copied one-to-one. Examples of excluded scope:

  • literature-search, literature-review, and deep-research: literature discovery belongs in a literature-focused repository
  • citation-management, paper-compilation, and latex-formatting: paper-production utilities are outside this repository's scope
  • slide-generation and github-research: presentation and repo-mining workflows are not part of the core experiment pipeline
  • framework-specific knowledge from AI-research-SKILLs is curated here only when it supports experiment execution directly, for example fine-tuning, distributed training, benchmarking, or experiment tracking
  • ai-scientist contributes template-based experimentation patterns here, especially baseline-first templates, bounded run loops, and standardized output contracts
  • AutoResearchClaw contributes the control-plane side of experiment execution here: falsifiable hypothesis shaping, resource-aware scoping, pre-execution validation, and structured repair loops
  • claude-scientific-skills contributes experiment-rigor auditing, pre-registered statistical planning, dataset intake profiling, and modular training architecture patterns distilled from scientific-critical-thinking, statistical-analysis, exploratory-data-analysis, and pytorch-lightning
  • EvoScientist contributes reusable experiment operations patterns here: execution preflight, post-stage reflection, durable experiment memory, and explicit evidence-sufficiency gates distilled from its experimental agent prompts and sub-agent contracts