A/B Testing Experiments in Agentic Workflows
June 13, 2026 · View on GitHub
How Experiments Work
Per run:
- Restore — activation job loads experiment state from configured storage (git branch default, or Actions cache).
- Pick —
pick_experiment.cjspicks the variant with the lowest invocation count (ties broken by array order). - Save — updated counter written back.
- Upload — state uploaded as workflow artifact
experiment(30-day retention). - Inject — variant available as
${{ experiments.<name> }}and in{{#if experiments.<name> }}blocks.
Key properties:
- Every run gets one variant per experiment; no sampling.
- Assignment persists across runs automatically.
- Multiple experiments run simultaneously, each independently balanced.
Basic Syntax
---
on:
schedule: daily on weekdays
engine: copilot
experiments:
prompt_style: [concise, detailed]
---
{{#if experiments.prompt_style == "concise" }}
Summarise the findings in ≤ 5 bullets.
{{#else}}
Provide a detailed analysis with reasoning for each finding.
{{#endif}}
Naming Rules
- Names must match
[a-zA-Z_][a-zA-Z0-9_]*. Uselowercase_with_underscores. - Non-matching names are silently skipped at compile time.
Variant Rules
- At least 2 variants required.
- Plain strings, lowercase descriptive (
concise,detailed,step_by_step). - ~10 variants practical max — sample size per variant grows fast beyond that.
Object Form (Weighted Variants and Date Gating)
Object form supports non-uniform weights, date gating, and governance metadata:
experiments:
prompt_style:
variants: [concise, detailed, step_by_step]
weight: [2, 1, 1] # 50% concise, 25% detailed, 25% step_by_step
description: "Verbosity A/B test"
metric: "ai_credits"
hypothesis: "H0: no change in ai_credits. H1: concise reduces by >=15%"
guardrail_metrics:
- name: success_rate
threshold: ">=0.95"
- name: empty_output_rate
direction: min
threshold: 0.0
issue: "42"
start_date: "2026-05-01"
end_date: "2026-06-01"
Fields:
variants:— array of variant strings (required, ≥ 2 entries).weight:— non-negative integers, same length asvariants. Enables weighted-random selection.[2, 1, 1]= 50/25/25. All zeros → always returns control (first variant). Omit for round-robin.start_date:/end_date:— ISO-8601YYYY-MM-DD. Outside this window, control variant is returned and counters do not increment.description:,metric:,issue:,hypothesis:— governance metadata (no runtime effect).guardrail_metrics:— array; if any guardrail fails for any variant, experiment is auto-abandoned. Each entry:name(required) — metric identifier.threshold(required) — comparison string (">=0.95","==0") or bare number paired withdirection.direction(optional,"min"/"max") — lower-better vs higher-better. With bare numericthreshold:min→ metric ≤ threshold;max→ metric ≥ threshold.
Bare-array and object forms can be mixed in the same experiments: map.
Storage Configuration
experiments:
storage: repo # or: cache
prompt_style: [concise, detailed]
| Value | Behaviour | When to use |
|---|---|---|
repo (default) | Commits state.json to branch experiments/{sanitizedWorkflowID} (hyphens stripped, e.g. my-workflow → experiments/myworkflow). Adds a push_experiments_state job; needs contents: write. Durable. | Recommended for all experiments. |
cache | GitHub Actions cache. No extra job/permission. May evict after 7 days of inactivity. | Use only when contents: write cannot be granted. |
The branch is created automatically on first run as an orphan containing
state.jsonandassignments.json.
Referencing the Active Variant
Two forms, both resolved before the agent sees the prompt:
1 — Conditional blocks (most common)
{{#if experiments.tone == "formal" }}
Use formal, professional language throughout the report.
{{#else}}
Use a friendly, conversational tone.
{{#endif}}
2 — Direct interpolation
Use `${{ experiments.tone }}` tone when writing the issue body.
Designing a Good Experiment
- One dimension per experiment.
- Falsifiable hypothesis.
- Primary metric measurable from workflow run data (artifacts, outputs, duration, tokens).
- Guardrail metrics — things that must not degrade. Use
direction: min+ bare number for lower-is-better rates, or">=0.95"for higher-is-better. - Sample size estimate per variant.
Prefer high-frequency workflows for faster significance.
Dimensions Worth Experimenting On
Prompt Design
experiments:
prompt_style: [concise, detailed]
reasoning_depth: [shallow, deep]
output_format: [bullets, prose, table]
tone: [formal, casual]
Use {{#if experiments.prompt_style == "concise" }} blocks to swap prompt instructions. Always compare against a specific variant value.
⚠️ Never write the internal env-var form
__GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed. The compiler expandsexperiments.<name>references automatically.
Typical metrics: output quality, AI credits, success rate, output length.
Engine & Model
experiments:
engine_variant: [copilot, claude]
⚠️ Engine experiments require separate compiled files: the
engine:key cannot be switched mid-run from a single file. Use two parallel workflow files and compare run metrics.
Typical metrics: run cost (tokens), duration, completion rate, error rate.
Tool Configuration
experiments:
tool_scope: [narrow, broad]
{{#if experiments.tool_scope == "narrow" }}
Only use the `issues` and `pull_requests` toolsets.
{{#else}}
Use any available GitHub MCP tools.
{{#endif}}
Typical metrics: number of tool calls, run duration, output accuracy.
Skill Usage
experiments:
skill_hint: [enabled, disabled]
{{#if experiments.skill_hint == "enabled" }}
Check `skills/` for SKILL.md files relevant to this task and apply their guidance.
{{#endif}}
Typical metrics: output quality, context token consumption, run duration.
Timeout & Pacing
experiments:
timeout: [short, long]
Pair with a conditional step, or use two compiled files with different timeout-minutes:.
Minimal Working Example
---
description: Daily PR summary — A/B test concise vs. detailed output
on:
schedule: daily on weekdays
engine: copilot
permissions:
pull-requests: read
tools:
github:
toolsets: [pull_requests]
safe-outputs:
create-discussion:
title-prefix: "[pr-summary] "
close-older-discussions: true
timeout-minutes: 15
experiments:
output_style: [concise, detailed]
---
Summarise the pull requests merged in ${{ github.repository }} today.
{{#if experiments.output_style == "concise" }}
Write a maximum of 5 bullet points. Each bullet is one sentence.
{{#else}}
Write a structured report with sections for: new features, bug fixes, refactors,
and documentation changes. Include a one-paragraph executive summary at the top.
{{#endif}}
Include links to each PR. Use ${{ github.server_url }}/${{ github.repository }}/pull/<number> format.
Compile and deploy:
gh aw compile pr-summary
First run picks concise (count 0), second picks detailed, alternating until one variant wins.
Multiple Simultaneous Experiments
Independent assignment, all three injected into the prompt:
experiments:
prompt_style: [concise, detailed]
emoji_density: [heavy, minimal]
skill_hint: [enabled, disabled]
⚠️ Interaction effects — limit to 2–3 simultaneous experiments unless you can run factorial analysis.
Lifecycle of an Experiment
- Design — hypothesis, dimension, primary + guardrail metrics.
- Instrument — add
experiments:and{{#if experiments.<name> == "<variant>" }}blocks. Never use__GH_AW_EXPERIMENTS__*. - Compile —
gh aw compile <workflow-name>. - Run — check activation job step summary for variant assignment.
- Analyse — once min sample size reached, compare distributions.
- Conclude — rewrite baseline to winning variant, remove
experiments:, recompile.
Anti-Patterns
- ❌ Multiple dimensions in one experiment — can't attribute the improvement.
- ❌ Removing
experiments:before sample size reached — resets state, invalidates counts. - ❌ Interpreting early results (<~20 runs/variant) — chance variation dominates.
- ❌ Experiments as feature flags — use
features:for deterministic switches. - ❌ Engine experiments in one file —
engine:cannot switch mid-run; use two parallel files. - ❌ Conditional frontmatter imports — keep imports security-stable and use
{{#if experiments.<name> }}with{{#runtime-import? path}}(optional form, not promoted to unconditional lock-file macros) for prompt experiments instead. - ❌ Nesting
{{#if experiments.<name> }}inside{{#runtime-import? }}— evaluation order is brittle across import boundaries. Prefer explicit branching in the main workflow prompt or separate workflow files per variant. - ❌ Writing the internal env-var form
__GH_AW_EXPERIMENTS__*— implementation detail, may change.