๐Ÿ”ฌ MarkScientist

May 14, 2026 ยท View on GitHub

๐Ÿ”ฌ MarkScientist

Self-evolving Research Agent with Built-in Scientific Taste and Taste Learning

Challenger prepares โ†’ Solver researches โ†’ Judge reviews

License: MIT Python 3.10+ Built On Workflow Taste Learning Trace Scope

MarkScientist is a higher-layer framework for turning a user request into a research project workspace, executing that project, and reviewing both the project definition and resulting report on top of ResearchHarness.

Unlike a standalone execution harness, this project is intentionally centered on:

  • Challenger, Solver, and Judge role separation
  • Taste Learning as a core Judge calibration feature
  • project-first research workflows
  • review-driven improvement loops
  • workflow-level traces layered on top of per-agent harness traces
  • higher-level orchestration and evaluation policies
  • a CLI that exposes the full research loop across multiple agents

The point is not to replace ResearchHarness. The point is to build a scientific workflow layer that reuses the lower-layer runtime while adding project setup, role structure, review pressure, and orchestration logic.


๐Ÿ“š Table of Contents


โœจ Highlights

  • Built on ResearchHarness ResearchHarness owns SDK calls, tool calling, and the ReAct loop; MarkScientist owns multi-agent roles and workflow orchestration.
  • Taste learning as a first-class feature Judge standards can be calibrated from a visible workspace feedback log instead of hidden machine-local state.
  • Three-role research loop Challenger prepares the project, Solver performs the research, and Judge scores both the project definition and the resulting report.
  • Project-first execution The workflow is built around a concrete workspace with staged inputs, a public execution package, hidden judge criteria, code, outputs, and report/report.md.
  • Review-driven improvement The workflow can iteratively improve outputs based on Judge feedback instead of stopping at one draft.
  • Conditional re-challenge Judge can send the workflow back to Challenger when the project definition itself is too weak, too toy-like, or not grounded in the available inputs, not just when the report is weak.
  • Workflow-level traces MarkScientist preserves per-agent ResearchHarness traces and adds a higher-level workflow summary.
  • Checklist-based judging Judge scores the project and report against an explicit INSTRUCTIONS.md task contract and a hidden judge checklist rather than vague style preferences.
  • Scenario-aware Judge policies Judge uses explicit review policies that combine scenario, reviewer perspective, and scoring skill instead of one generic review prompt.
  • Judge skill library The scoring skills are stored as standard markdown skills under markscientist/skills/*/SKILL.md, not hard-coded prompt blobs.
  • Multi-reviewer Judge panels Judge simulates multiple specialized reviewers and aggregates them into one final benchmark decision.
  • Visible taste learning task/target_study/feedback_history.jsonl keeps calibration inputs inside the project workspace, so score shifts are inspectable and reproducible.

At a Glance

AreaWhat MarkScientist focuses on
Runtime dependencyReuses ResearchHarness for execution
RolesChallenger, Solver, Judge
Core artifactA prepared research project workspace
Review modelScore, critique, and improve the report
Judge system15 scenarios ร— 12 perspectives ร— 5 skills
Skill storagemarkscientist/skills/*/SKILL.md
Taste learningVisible workspace feedback calibration
Trace modelWorkflow summary plus per-agent traces
UXInteractive multi-agent CLI
ScopeScientific workflow layer, not execution harness

๐Ÿš€ Quick Start

git submodule update --init --recursive
pip install -e .
markscientist

MarkScientist currently assumes a source checkout with the ResearchHarness git submodule available. Wheel-only installs are not a supported standalone distribution mode.

๐Ÿง  How It Works

MarkScientist is not a second execution harness. It is a higher-layer framework built on top of ResearchHarness.

flowchart TD
    U[User]
    C[Challenger]
    P[Project]
    S[Solver]
    R[Report]
    J[Judge]
    SK[Skill]
    TL[Taste Learning]

    U -->|research request| C
    C -->|prepare project| P
    P -->|execution package| S
    S -->|write report| R
    R -->|submit for review| J
    J -->|solver revision| S
    J -->|rechallenge| C
    SK -->|review skill| J
    J -->|update taste signals| TL
    TL -->|apply learned calibration| J

The lower-layer execution details live in ResearchHarness, and MarkScientist connects to them like this:

flowchart TD
    subgraph MS[MarkScientist]
        WF[Workflow / Scheduling] --> AG[Challenger / Solver / Judge]
        AG --> RP[Role Prompts]
        WF --> WR[Workflow Trajectory Wrapper]
    end

    subgraph RH[ResearchHarness]
        AB[BaseAgent / MultiTurnReactAgent]
        LOOP[ReAct Runtime]
        TOOLS[Tool Registry + Execution]
        TRACE[FlatTraceWriter]
    end

    AG --> AB
    AB --> LOOP
    LOOP --> TOOLS
    LOOP --> TRACE
    WR --> TRACE

๐Ÿ—‚ Project Model

The workflow now separates the Solver-visible execution workspace from Judge-only evaluation materials.

Expected layout:

workspace_root/
  task/
    task_info.json     # private ResearchClawBench-style task contract
    data/              # canonical source data created/curated by Challenger
    related_work/      # canonical real source PDFs created/curated by Challenger
    target_study/
      paper.pdf        # hidden target-study anchor PDF
      checklist.json   # hidden judge rubric
      images/          # optional hidden reference images
      feedback_history.jsonl  # optional visible taste-learning log for judge calibration
  public/
    INSTRUCTIONS.md
    data/              # solver-visible staged subset of task/data/
    related_work/      # solver-visible staged subset of task/related_work/ (starts as PDFs; solver tools may later create local extracted sidecars)
    code/
    outputs/
    report/
      report.md
      images/

Role responsibilities:

  • Challenger works at the private task level and builds the project from scratch when needed: it creates or curates canonical source materials under task/data/ and task/related_work/, writes task/task_info.json, writes the hidden task/target_study/* assets, and then the harness exports the solver-visible subset into public/.
  • task/data/ is for canonical data artifacts only. It should contain datasets or data directories, not literature PDFs. Real PDF references belong under task/related_work/ or task/target_study/.
  • Solver-visible related work should come from real source PDFs in task/related_work/, or from genuinely downloaded PDFs that Challenger first saves under task/related_work/ and then stages into public/related_work/. Placeholder PDFs or fabricated paper files are not valid project inputs.
  • Solver works only inside public/, performs the research, and must finish with public/report/report.md.
  • Judge evaluates the public deliverables and may additionally read hidden materials under task/target_study/.

This separation is intentional: hidden scoring criteria or target answers should never be exposed through the public project files that the Solver can read, but the Challenger is still responsible for constructing the canonical source materials and packaging the full executable project.

๐Ÿงช Judge Model

The current Judge keeps the simple Challenger / Solver / Judge architecture, but its review logic is no longer one flat prompt. It now uses a lightweight policy model:

  • Scenario: what kind of thing is being judged
  • Perspective: which specialized reviewer viewpoint to emulate
  • Skill: which scoring style to emulate

The exact scoring skills are stored as standard markdown skill files:

  • markscientist/skills/judge-geval/SKILL.md
  • markscientist/skills/judge-prometheus/SKILL.md
  • markscientist/skills/judge-pairwise/SKILL.md
  • markscientist/skills/judge-pandalm/SKILL.md
  • markscientist/skills/judge-judgelm/SKILL.md

The policy system currently defines 15 built-in Judge scenarios:

ScenarioWhat it emphasizes
idea_generationearly research idea quality before project commitment
novelty_checkdifferentiation from prior work
project_definitiongrounding, scope, executability, scientific value, non-toy quality
experiment_designmethodology, controls, and reproducibility before execution
result_analysiscorrectness, interpretation, and uncertainty handling
research_reportmethodology, evidence, results, limitations, reproducibility
claim_validationevidence support, claim scope, overclaim risk
ablation_reviewablation quality and variable isolation
paper_outlinepaper structure and completeness
section_draftsection-level scientific writing quality
figure_tablescientific usefulness of figures and tables
rebuttalrebuttal responsiveness and evidence use
revisionwhether a revised artifact materially improved
code_reviewcode correctness and engineering quality
literature_reviewliterature coverage, synthesis, and recency

The default workflow mainly uses project_definition and research_report, while the remaining scenarios stay available for stricter or more specialized review passes.

Built-in reviewer perspectives:

PerspectiveFocus
senior_revieweroverall decision quality
novelty_criticoriginality and overlap with prior work
methods_expertdesign rigor and scope control
statistics_expertquantitative validity and uncertainty handling
writing_expertclarity, structure, and presentation
domain_expertdomain-specific technical correctness
literature_expertprior work coverage and positioning
code_expertimplementation correctness and engineering quality
reproducibility_advocateartifact completeness
skepticunsupported claims and overclaim detection
area_chairbalanced final judgment
visualization_expertfigure and table quality

Current scoring skills:

SkillStyle
gevalmulti-dimensional rubric scoring
prometheusstrict criterion-by-criterion grading
pairwisebefore-after comparison
pandalmbalanced full-artifact evaluation with calibrated tie handling
judgelmevidence-heavy judgment and claim scrutiny

The public workflow currently uses reviewer panels internally:

  • project definition panel defaults to methods_expert ร— prometheus, literature_expert ร— geval, and area_chair ร— judgelm
  • report panel defaults to area_chair ร— judgelm, skeptic ร— geval, and reproducibility_advocate ร— prometheus
  • claim validation remains available as an explicit report-review scenario when a caller chooses it programmatically, and it uses its own panel composition

Taste learning is visible and optional. If task/target_study/feedback_history.jsonl exists inside the current project workspace, Judge can apply small score offsets derived from repeated user feedback. Calibrations are keyed by the full reviewer identity (scenario + perspective + skill), which keeps different judging modes from contaminating each other. This keeps taste learning inside the workspace instead of relying on hidden machine-local files, and makes every calibration source inspectable by the user.

๐Ÿงญ Architecture Boundary

  • ResearchHarness is the execution layer:
    • OpenAI-compatible SDK calls
    • native tool calling
    • ReAct loop
    • tool registry and execution
    • flat per-agent trace writing
  • MarkScientist is the orchestration layer:
    • Challenger / Solver / Judge roles
    • project preparation and workflow scheduling
    • solver/judge improvement loops
    • role-specific prompt addenda
    • workflow-level trajectory summaries

MarkScientist agents inherit the ResearchHarness agent base instead of reimplementing the lower-layer execution stack.

๐Ÿ’ฌ Usage

Interactive REPL

markscientist

Default mode runs the full research workflow.

[workflow] > Analyze the attached dataset and produce a research report.

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Final Report โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ # Research Report                            โ”‚
โ”‚ ...                                          โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Workflow Summary โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Status      Success                          โ”‚
โ”‚ Score       75.0/100                        โ”‚
โ”‚ Iterations  2                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Switch to a single role when needed:

[workflow] > /challenger
[challenger] > Prepare a project for reproducing the core claim.

[challenger] > /solver
[solver] > Execute the prepared project and write the report.

[solver] > /judge
[judge] > Score the current report against the hidden judge checklist.

CLI One-Shot Commands

# Full Challenger -> Solver -> Judge workflow
markscientist "Study whether the benchmark result is reproducible"

# Challenger only
markscientist "Prepare a project for evaluating the dataset" --agent challenger

# Solver only
markscientist "Execute the prepared project" --agent solver

# Judge only
markscientist "Review the current report" --agent judge

# JSON output
markscientist "Review the current report" --agent judge --json

Python API

from pathlib import Path

from markscientist.config import Config, set_config
from markscientist.judging import JudgeScenario
from markscientist.project import ensure_project_layout

config = Config.from_env()
# If omitted, MarkScientist will create a project under data/workspaces/<session-id>.
# Set an explicit repo-local workspace root only when you want a stable named project path.
config.workspace_root = Path("./data/workspaces/demo-project")
set_config(config)

from markscientist.agents import ChallengerAgent, JudgeAgent, SolverAgent
from markscientist.workflow import ResearchWorkflow

paths = ensure_project_layout(config.workspace_root)

challenger = ChallengerAgent(config=config, workspace_root=paths.project_root)
challenger.run("Prepare a research project for the current prompt.", workspace_root=paths.project_root)

solver = SolverAgent(config=config, workspace_root=paths.public_root)
solver_result = solver.run("Execute the prepared project.", workspace_root=paths.public_root)

judge = JudgeAgent(config=config, workspace_root=paths.project_root)
judge_result = judge.review_project_report(
    original_prompt="Review the current report strictly.",
    instructions_text=paths.instructions_path.read_text(encoding="utf-8"),
    checklist_text=paths.judge_checklist_path.read_text(encoding="utf-8"),
    judge_materials_text="",
    report_text=paths.report_path.read_text(encoding="utf-8"),
    report_scenario=JudgeScenario.RESEARCH_REPORT,
    workspace_root=paths.project_root,
)

workflow = ResearchWorkflow(config=config)
workflow_result = workflow.run("Write a research report", workspace_root=config.workspace_root)
print(workflow_result.final_score)
print(workflow_result.metadata["report_path"])

๐Ÿ“‹ Commands

/help        Show commands       /workflow    Full workflow
/challenger  Challenger mode     /solver      Solver mode
/judge       Judge mode          /model       Switch model
/config      Show config         /clear       New session
/exit        Exit

โš™๏ธ Config

# .env
API_KEY=your-key
API_BASE=https://your-openai-compatible-endpoint/v1
MODEL_NAME=gpt-5.4
# SUMMARY_MODEL_NAME=gpt-5.4
SERPER_KEY_ID=your_serper_key
JINA_API_KEYS=your_jina_key
MINERU_TOKEN=your_mineru_token

MarkScientist reads API_KEY, API_BASE, and MODEL_NAME directly. The extra keys are included because the underlying ResearchHarness tool layer may need them when the workflow uses web search, web fetch, or PDF parsing.

Agent runtime defaults and trajectory defaults live in code. Override them programmatically on Config(...) when needed.

If you need a non-default workspace root, set config.workspace_root before creating agents.

๐Ÿงช Testing

PYTHONDONTWRITEBYTECODE=1 pytest -q -p no:cacheprovider tests

The test suite checks:

  • role agents inheriting the ResearchHarness base agent
  • the Challenger -> Solver -> Judge workflow loop
  • CLI JSON output and single-agent entry points

๐Ÿชช License

This project is released under the MIT License.