README.md

April 3, 2026 · View on GitHub

Rashomon

Claude Code License

Know whether your skills actually improve agent behavior — not just look different.

Why rashomon?

Inspired by the Rashomon effect — the idea that the same event can produce different outcomes depending on perspective. rashomon makes those differences explicit and comparable.

  • Built a skill but unsure if it actually changes agent behavior?
  • Iterating on skills and prompts by gut feel instead of evidence?
  • Want proof that your changes made things better, not just different?

rashomon evaluates skills and prompts through blind comparison — running tasks with and without your changes in isolated environments, then comparing real outputs without knowing which version produced which.

Who Is This For?

rashomon is designed for:

  • Skill authors who want evidence-based validation
  • Developers using Claude Code daily
  • Teams iterating on complex prompts (coding, analysis, writing)
  • Anyone who wants evidence, not vibes, when improving skills and prompts

Not ideal if:

  • You want one-shot prompt rewriting without comparison

Quick Example

Skill Evaluation

/recipe-eval-skill create

Creates a skill through interactive dialog, then evaluates effectiveness:

  1. Collects domain knowledge, project-specific rules, and trigger phrases
  2. Generates optimized skill content (graded A/B/C)
  3. Runs a test task with and without the skill in isolated environments, using blind A/B comparison

What the evaluation report looks like:

Skill Quality: Grade A
- Project-specific rules clearly encoded, no critical issues

Trigger Check: pass (discovered + invoked)

Execution Effectiveness:
- Winner: with-skill
- Assessment: structural improvement
- Key difference: 3-stage catch ordering and retry constraints
  applied correctly (attributed to skill Rules 3 and 6)

Recommendation: ship
/recipe-eval-skill api-error-handling skill's scope needs adjustment

Updates an existing skill, then evaluates old vs new version side by side.

See a real-world example: I Built a Skill Reviewer. Then I Ran It on Itself.

Prompt Evaluation

/recipe-eval-prompt Write a function to sort an array

Analyzes prompt issues, generates an improved version, runs both in isolated environments, and shows what actually changed.

Prompt Evaluation Details

What You Get

1. Detected Issues

- BP-002 (Vague Instructions): Sort order, language, and error handling not specified
- BP-003 (Missing Output Format): No expected output structure defined

2. Improved Prompt

Write a TypeScript function that sorts a number array in ascending order.
- Return empty array for empty input
- Include JSDoc comments
- Output: function code with example usage

3. Comparison Report

AspectOriginalImproved
Type definitionsNoneIncluded
Edge case handlingNoneIncluded
DocumentationNoneJSDoc added

Result: Structural Improvement - The optimization made a meaningful difference.

Example: When rashomon finds no real improvement

/recipe-eval-prompt Summarize this article in 3 bullet points

Result: Variance - Prompt was already well-scoped; differences were stylistic only.

Installation

Requires Claude Code (this is a Claude Code plugin)

# 1. Start Claude Code
claude

# 2. Install the marketplace
/plugin marketplace add shinpr/rashomon

# 3. Install plugin
/plugin install rashomon@rashomon

# 4. Restart session (required)
# Exit and restart Claude Code

Usage

Skill Evaluation

/recipe-eval-skill create

Create a new skill and evaluate its effectiveness.

/recipe-eval-skill my-skill-name what to change

Update an existing skill and compare old vs new.

Prompt Evaluation

/recipe-eval-prompt Your prompt here

From a file:

/recipe-eval-prompt Generate code following this skill: ./prompts/my-skill.md

For complex tasks that need more time, just mention it in natural language:

/recipe-eval-prompt Refactor the entire authentication module. This might take a while.

How It Works

``$ \text{Skill} \text{Evaluation} (/\text{recipe}-\text{eval}-\text{skill}) ├── \text{skill}-\text{creator} (\text{generates}/\text{modifies} \text{skills}) ├── \text{skill}-\text{reviewer} (\text{grades} \text{quality} \text{A}/\text{B}/\text{C}) ├── \text{eval}-\text{executor} \times 2 (\text{isolated} \text{worktrees}) └── \text{skill}-\text{eval}-\text{reporter} (\text{blind} \text{A}/\text{B} \text{comparison})

\text{Prompt} \text{Evaluation} (/\text{recipe}-\text{eval}-\text{prompt}) ├── \text{prompt}-\text{analyzer} (\text{analyzes} \text{and} \text{optimizes}) ├── \text{prompt}-\text{executor} \times 2 (\text{isolated} \text{worktrees}) └── \text{report}-\text{generator} (\text{compares} \text{results}) $``

Technical Details

Isolated Execution

rashomon uses git worktrees to run both versions in completely separate environments. A worktree is a Git feature that creates independent working directories from the same repository—this ensures the two executions don't interfere with each other.

Improvement Classification

Not all differences are improvements. rashomon classifies results into four categories:

ClassificationMeaningRecommendation
StructuralReal improvement in accuracy, completeness, or qualityUse the new version
Context AdditionOne version had more project-specific knowledgeUseful if the context is accurate
ExpressiveDifferent wording, same substanceEither version is fine
VarianceJust normal LLM randomnessOriginal was already good

Classification is based on:

  • Whether detected issues were resolved
  • Output completeness and constraint adherence
  • Agreement between blind quality assessment and observable output differences
Quality Patterns (BP-001 through BP-008)

Both skill review and prompt analysis check against 8 common patterns:

PriorityIssues
CriticalNegative instructions ("don't do X"), vague instructions, missing output format
High ImpactUnstructured prompts, missing context, complex tasks without breakdown
EnhancementBiased examples, no permission for uncertainty

P1: Critical (Must Fix)

IDPatternProblemFix
BP-001Negative Instructions"Don't do X" often backfires—LLMs focus on what's mentionedReframe positively: "Don't include opinions" → "Include only factual information"
BP-002Vague InstructionsMissing specifics cause high output varianceAdd explicit constraints: format, length, scope, tone
BP-003Missing Output FormatNo format spec leads to inconsistent outputsDefine expected structure: JSON schema, section headers, etc.

P2: High Impact (Should Fix)

IDPatternProblemFix
BP-004Unstructured PromptWall of text obscures prioritiesApply 4-block pattern: Context / Task / Constraints / Output Format
BP-005Missing ContextNo background leads to wrong assumptionsAdd purpose, audience, relevant constraints
BP-006Complex TaskUndivided complex tasks have higher error ratesBreak into steps with quality checkpoints

P3: Enhancement (Could Fix)

IDPatternProblemFix
BP-007Biased ExamplesHomogeneous examples cause overfittingDiversify: include edge cases, different formats
BP-008No Uncertainty PermissionNo "I don't know" option causes hallucinationAdd: "If unsure, say so"
About Knowledge Base

Knowledge Base

rashomon learns from your project over time.

Location: .claude/.rashomon/prompt-knowledge.yaml

How it works:

  • Automatically enabled when the file exists
  • Stores project-specific patterns (not generic best practices)
  • Referenced during analysis, updated after comparisons
  • Max 20 entries, lowest-confidence ones removed first

Key principle: Old knowledge isn't automatically removed. Patterns that have worked for a long time are often the most valuable.

Troubleshooting

Troubleshooting

Leftover worktrees

If rashomon exits unexpectedly, temporary directories might remain:

# Worktrees are stored in system temp directory
# Clean up manually if needed:
rm -rf ${TMPDIR:-/tmp}/worktree-rashomon-*

Timeout issues

For complex prompts that need more time, mention it when invoking:

/recipe-eval-prompt Complex task here. This might take longer than usual.

"Not a git repository" error

rashomon requires a git repository. Initialize one with:

git init

Requirements

  • Git 2.5+
  • Python 3.9+
  • Claude Code
  • Must run inside a git repository

License

MIT