RE-experiment
April 1, 2026 · View on GitHub
An authoritative skills repository for experiment design, implementation, validation, and analysis in Research-Equality.
This repository focuses on the empirical workflow: planning research programs, formalizing methods, designing experiments, implementing experiment code, debugging failures, analyzing results, generating figures and tables, and tracing reported numbers back to executed code. It is not a general end-to-end academic research toolkit. It is a curated home for skills directly related to experimental design and implementation.
Positioning
- Keep only skills related to experiment planning, method formalization, implementation, debugging, validation, and result analysis
- Normalize everything into a portable repository layout without depending on one author's local machine
- Serve as the authoritative home for future experiment-oriented skills in Research-Equality
Included Skills
The skill collection lives under skills/.
Planning and Formalization
research-planning: turn a topic or method idea into an executable research and experiment roadmaphypothesis-discrimination: turn observations into falsifiable and competing hypothesesatomic-decomposition: break complex methods into atomic components with math-code mappingsalgorithm-design: generate pseudocode and system diagrams before implementationmath-reasoning: support derivations, proof sketches, notation discipline, and statistical-test selectiondataset-intake-profiling: profile dataset structure, coverage, imbalance, and leakage risk before finalizing the planstatistical-analysis-plan: pre-specify estimands, tests, power assumptions, and reporting rulescritical-experiment-review: audit design validity, baseline fairness, confounds, and claim scope before trusting a resultexperiment-memory-evolution: distill repeated outcomes into reusable strategy memory and failed-direction memory
Experiment Execution
experiment-orchestration: manage multi-hypothesis experiment campaigns and state across many runsexperiment-template-design: build reusable experiment sandboxes with stable entrypoints and seed ideasexperiment-design: define staged experiment plans, baselines, datasets, metrics, and ablationsresource-aware-experiment-design: scope conditions against real hardware and runtime budgetsexecution-preflight: verify hardware, packages, commands, output paths, and smoke-test strategy before long runsbounded-experiment-loop: execute a small run budget against a fixed baseline and command contractexperiment-stage-reflection: revise the remaining plan after a meaningful stage using metrics, artifacts, and unmet success signalsmodular-training-stack: structure training code into model, data, trainer, callback, and logging layersexperiment-code: implement and iteratively improve training and evaluation pipelinesexperiment-code-validation: catch syntax, policy, import, and fake-metric issues before executioncode-debugging: diagnose runtime, logic, and output failures in experiment codeexperiment-repair-loop: classify deficient runs and generate targeted repairs beyond basic debuggingexperiment-output-contract: standardizerun_i/outputs, summaries, plots, and notes for handoff
LLM Implementation Stack
llm-data-pipeline: build scalable ingestion, curation, deduplication, and sharding pipelinesllm-fine-tuning: route LoRA and QLoRA style supervised fine-tuning to the right stackllm-post-training: handle SFT, DPO, PPO, and GRPO style post-training workflowsdistributed-training: choose between Accelerate, FSDP2, DeepSpeed, and Ray Trainllm-benchmarking: benchmark language and code models with standard harnessesexperiment-tracking: standardize W&B, MLflow, or TensorBoard based observability
Validation and Reporting
data-analysis: compute descriptive statistics, significance tests, and result summariesevidence-sufficiency-gate: decide whether current evidence is strong enough to stop iterating and write upfigure-generation: generate scientific plots for curves, ablations, comparisons, and diagnosticstable-generation: convert structured results into publication-ready LaTeX tablesbackward-traceability: link reported numbers back to the exact code output that produced them
See skills/README.md for the skill catalog.
Skill Routing
The repository now uses a shorter default path plus optional overlays, so users do not have to route every project through all skills.
Default path for most projects:
research-planningdataset-intake-profilingif the data contract is still unclearexperiment-designresource-aware-experiment-designandstatistical-analysis-planexperiment-codeexperiment-code-validationexecution-preflightbounded-experiment-loopexperiment-stage-reflectiondata-analysiscritical-experiment-reviewevidence-sufficiency-gateexperiment-output-contract,figure-generation,table-generation,backward-traceability
Optional overlays:
hypothesis-discrimination,atomic-decomposition,algorithm-design, andmath-reasoningonly when the method still needs formalizationexperiment-orchestrationandexperiment-memory-evolutiononly for long-running or multi-branch experiment programsexperiment-template-designonly when packaging a reusable sandboxmodular-training-stackonly when architecture cleanup is the bottleneckllm-data-pipeline,llm-fine-tuning,llm-post-training,distributed-training, andllm-benchmarkingonly for LLM-specific implementations
Key boundaries:
experiment-designdefines what to test;resource-aware-experiment-designtrims it to the real budgetexperiment-code-validationchecks code statically;execution-preflightchecks the runtime environment and launch contractcode-debuggingfixes bugs;experiment-repair-looprepairs broader failed runs with scientific or budget deficienciescritical-experiment-reviewaudits methodological validity;evidence-sufficiency-gatedecides whether the evidence is already enough to stop iteratingbounded-experiment-loopis for one idea under a fixed budget;experiment-orchestrationis for campaigns with multiple active branches
Shared artifact conventions:
outputs/<topic-slug>/plan/for planning JSON or Markdown artifactsoutputs/<topic-slug>/runs/for raw experiment runs, logs, and checkpointsoutputs/<topic-slug>/analysis/for summaries, significance tests, and error analysesoutputs/<topic-slug>/figures/andoutputs/<topic-slug>/tables/for generated assetsoutputs/<topic-slug>/report/for manuscript snippets, traceability checks, and release-ready deliverables
Repository Layout
skills/
research-planning/
hypothesis-discrimination/
atomic-decomposition/
algorithm-design/
math-reasoning/
dataset-intake-profiling/
statistical-analysis-plan/
critical-experiment-review/
experiment-memory-evolution/
experiment-orchestration/
experiment-template-design/
experiment-design/
resource-aware-experiment-design/
execution-preflight/
bounded-experiment-loop/
experiment-stage-reflection/
modular-training-stack/
experiment-code/
experiment-code-validation/
code-debugging/
experiment-repair-loop/
experiment-output-contract/
llm-data-pipeline/
llm-fine-tuning/
llm-post-training/
distributed-training/
llm-benchmarking/
experiment-tracking/
data-analysis/
evidence-sufficiency-gate/
figure-generation/
table-generation/
backward-traceability/
Usage
Command examples assume you run them from the repository root.
python skills/experiment-design/scripts/design_experiments.py \
--method "contrastive learning" \
--task classification \
--format markdown
python skills/data-analysis/scripts/stat_summary.py \
--input outputs/topic-x/runs/results.csv \
--compare method \
--metric accuracy \
--output outputs/topic-x/analysis/summary.json
python skills/table-generation/scripts/results_to_table.py \
--input outputs/topic-x/analysis/main_results.json \
--type comparison \
--caption "Main results on benchmark datasets" \
--label tab:main-results
Recommended environment:
- Python 3.10+
pip install -r requirements-optional.txtfor the bundled analysis and plotting scripts- Optional:
numpy,scipy,pandas,scikit-learnfordata-analysis - Optional:
matplotlib,seabornforfigure-generation
Generated artifacts should live under outputs/<topic-slug>/ so the skill directories remain clean and reusable.
Curation Rules
- A skill must directly support experiment planning, implementation, debugging, validation, or result reporting
- Skills for literature discovery, citation work, paper formatting, slide generation, or general repository mining should not live here
- Prefer skills that are scriptable, reusable, and auditable
- The normalized authoritative version is the one under skills/
Provenance
The initial curated skill set in this repository was refactored from the local source snapshot agent-research-skills/.
An additional implementation-focused layer was later distilled from AI-research-SKILLs/, but rewritten here as repository-native generic skills instead of mirroring upstream framework names one-to-one.
Only the experiment-related skills are retained. Overlapping capabilities outside the experiment workflow are intentionally excluded instead of copied one-to-one. Examples of excluded scope:
literature-search,literature-review, anddeep-research: literature discovery belongs in a literature-focused repositorycitation-management,paper-compilation, andlatex-formatting: paper-production utilities are outside this repository's scopeslide-generationandgithub-research: presentation and repo-mining workflows are not part of the core experiment pipeline- framework-specific knowledge from
AI-research-SKILLsis curated here only when it supports experiment execution directly, for example fine-tuning, distributed training, benchmarking, or experiment tracking ai-scientistcontributes template-based experimentation patterns here, especially baseline-first templates, bounded run loops, and standardized output contractsAutoResearchClawcontributes the control-plane side of experiment execution here: falsifiable hypothesis shaping, resource-aware scoping, pre-execution validation, and structured repair loopsclaude-scientific-skillscontributes experiment-rigor auditing, pre-registered statistical planning, dataset intake profiling, and modular training architecture patterns distilled fromscientific-critical-thinking,statistical-analysis,exploratory-data-analysis, andpytorch-lightningEvoScientistcontributes reusable experiment operations patterns here: execution preflight, post-stage reflection, durable experiment memory, and explicit evidence-sufficiency gates distilled from its experimental agent prompts and sub-agent contracts