Potato: The Portable Annotation Tool

June 8, 2026 · View on GitHub

Potato is a free, self-hosted annotation platform for NLP, Agentic, and GenAI research. Annotate text, audio, video, images, documents, agent traces, and more — configured entirely through YAML. No coding required.

Try the live demo on HuggingFace Spaces — no installation needed.

Quick Start

pip install potato-annotation
# The examples/ folder ships with the source repo (see "run from source" below).
# After a PyPI install, clone the repo for the examples, or point `potato start`
# at your own config (see docs/quick-start.md).
potato start examples/classification/single-choice/config.yaml -p 8000

Or run from source (recommended to get the examples/):

git clone https://github.com/davidjurgens/potato.git
cd potato && pip install -r requirements.txt
python potato/flask_server.py start examples/classification/single-choice/config.yaml -p 8000

Open http://localhost:8000 and start annotating. Browse the examples/ directory for ready-to-use templates.

What Can You Annotate?

Potato handles the full spectrum of annotation tasks — from traditional NLP labeling to evaluating the latest AI agent systems.

Data Types

Modality	Capabilities
Text	Classification, span labeling, entity linking, coreference, pairwise comparison (docs)
Agent Traces	Step-by-step evaluation of LLM agents, tool calls, ReAct chains, and multi-agent systems (docs)
Web Agents	Screenshot-based review with SVG click/scroll overlays, or live browsing with automatic trace recording (docs)
RAG Pipelines	Retrieval relevance, answer faithfulness, citation accuracy, hallucination detection
Audio	Waveform visualization, segment labeling, ELAN-style tiered annotation (docs)
Video	Frame-by-frame labeling, temporal segments, playback sync (docs)
Images	Bounding boxes, polygons, landmarks, classification (docs)
Dialogue	Turn-level annotation, conversation trees, interactive chat evaluation
Documents	PDF, Word, Markdown, code, and spreadsheets with coordinate mapping (docs)

Annotation Schemes

Scheme	Use Case
Radio / Checkbox / Likert	Classification, multi-label, rating scales
Span annotation	NER, highlighting, hallucination marking
Pairwise comparison	A/B testing, best-worst scaling
Per-step ratings	Evaluate individual agent actions or dialogue turns
Free text	Open-ended responses with validation
Triage	Rapid accept/reject/skip curation (docs)
Conditional logic	Adaptive forms that respond to prior answers (docs)

Agent & LLM Evaluation

Potato provides purpose-built tooling for evaluating AI agents at every level of granularity.

Trace Formats

Import traces from any major agent framework with the built-in converter:

python -m potato.trace_converter --input traces.json --input-format openai --output data.jsonl

Supported formats: OpenAI, Anthropic/Claude, ReAct, LangChain, LangFuse, WebArena, SWE-bench, OpenTelemetry, CrewAI/AutoGen/LangGraph, MCP, Aider, Claude Code, ATIF, SWE-Agent, and Web Agent. Auto-detection is available with --auto-detect.

Evaluation Levels

Level	What You Annotate	Example
Trajectory	Overall task success, efficiency, safety	"Did the agent complete the task?"
Step	Individual action correctness, reasoning quality	Per-turn Likert ratings on each agent step
Span	Specific text segments within agent output	Highlight hallucinated claims, factual errors
Comparison	Side-by-side A/B agent evaluation	"Which agent performed better?"

Web Agent Viewer

An interactive viewer for GUI agent traces — navigate step-by-step through screenshots with SVG overlays showing clicks, bounding boxes, mouse paths, and scroll actions. Annotators rate each step with inline controls while a filmstrip bar provides quick navigation.

Ready-to-Use Agent Examples

Example	What It Evaluates
agent-trace-evaluation	Text agent traces with MAST error taxonomy + hallucination spans
visual-agent-evaluation	GUI agents with screenshot grounding accuracy
agent-comparison	Side-by-side A/B agent comparison
rag-evaluation	RAG retrieval relevance and citation accuracy
openai-evaluation	OpenAI Chat API traces with tool calls
anthropic-evaluation	Claude messages with tool_use blocks
swebench-evaluation	Coding agents with patch correctness ratings
multi-agent-evaluation	Multi-agent coordination (CrewAI, AutoGen, LangGraph)
web-agent-review	Pre-recorded web traces with step-by-step overlay viewer
web-agent-creation	Live web browsing with automatic trace recording

AI-Powered Annotation

LLM Label Suggestions

Integrate any LLM provider to pre-annotate instances and suggest labels. Annotators review and correct — dramatically faster than labeling from scratch.

Supported backends: OpenAI, Anthropic, Ollama, vLLM, Gemini, HuggingFace, OpenRouter

Active Learning

Potato reorders your annotation queue based on model uncertainty so annotators label the most informative instances first. Supports uncertainty sampling, BADGE, BALD, diversity, and hybrid strategies (docs).

Solo Mode

A human-LLM collaborative workflow where the system learns from annotator feedback and progressively transitions to autonomous LLM labeling as agreement improves (docs).

Chat Assistant

An LLM-powered sidebar where annotators can ask questions about difficult instances. The AI provides guidance informed by your task description and annotation guidelines — helping annotators think through decisions without auto-labeling (docs).

Quality Control & Workflows

Quality Assurance

Feature	Description
Attention checks	Automatically inserted known-answer items to verify engagement
Gold standards	Track annotator accuracy against expert labels
Inter-annotator agreement	Krippendorff's alpha (general) and Cohen's kappa (step-level agent evaluation)
Training phase	Practice annotations with feedback before the real task
Behavioral tracking	Timing, click patterns, and annotation change history

Annotation Workflows

Workflow	Description
Multi-annotator	Multiple annotators per item with overlap control and agreement metrics
Adjudication	Expert review of annotator disagreements to produce gold labels (docs)
Solo mode	Human-LLM collaboration with progressive automation (docs)
Crowdsourcing	Prolific and MTurk integration with platform-specific auth (docs)
Triage	Rapid accept/reject/skip for data curation (docs)

Authentication & Deployment

Potato supports multiple authentication methods, from passwordless quick-start to enterprise SSO:

Method	Use Case
In-memory	Local development, quick studies
Password + file persistence	Team annotation with shared credential files (docs)
Database	Production deployments with SQLite or PostgreSQL (docs)
OAuth / SSO	Google, GitHub, or institutional OIDC login (docs)
Clerk	Managed authentication via Clerk.com (docs)
Passwordless	Low-stakes tasks where ease of access matters (docs)

Passwords are hashed with per-user PBKDF2-SHA256 salts. Admins can reset passwords via CLI (potato reset-password) or REST API. Self-service token-based reset is also available.

Example Projects

Ready-to-use templates organized by type in examples/:

Category	Examples
Classification	Radio, checkbox, Likert, slider, pairwise comparison
Span	NER, span linking, coreference, entity linking
Agent Traces	LLM agents, web agents, RAG, multi-agent, code agents
Audio	Waveform annotation, classification, ELAN-style tiered
Video	Frame-level labeling, temporal segments
Image	Bounding boxes, PDF/document annotation
Advanced	Solo mode, adjudication, quality control, conditional logic
AI-Assisted	LLM suggestions, Ollama integration
Custom Layouts	Content moderation, dialogue QA, medical review

Research Showcase

The Potato Showcase contains annotation projects from published research — sentiment analysis, dialogue evaluation, summarization, and more.

Documentation

Topic	Link
Quick Start	docs/quick-start.md
Configuration Reference	docs/configuration/configuration.md
Schema Gallery	docs/annotation-types/schemas_and_templates.md
Agent Trace Evaluation	docs/agent-evaluation/agent_traces.md
Web Agent Annotation	docs/agent-evaluation/web_agent_annotation.md
AI Support	docs/ai-intelligence/ai_support.md
Active Learning	docs/ai-intelligence/active_learning_guide.md
Solo Mode	docs/solo-mode/solo_mode.md
Quality Control	docs/workflow/quality_control.md
Password Management	docs/auth-users/password_management.md
SSO & OAuth	docs/auth-users/sso_authentication.md
Admin Dashboard	docs/administration/admin_dashboard.md
Crowdsourcing	docs/deployment/crowdsourcing.md
Export Formats	docs/data-export/export_formats.md
Full Documentation Index	docs/index.md

Development

# Run tests
pytest tests/ -v

# By category
pytest tests/unit/ -v        # Unit tests (fast)
pytest tests/server/ -v      # Integration tests
pytest tests/selenium/ -v    # Browser tests

# With coverage
pytest --cov=potato --cov-report=html

Support

Issues: GitHub Issues
Questions: jurgens@umich.edu
Docs: potatoannotator.readthedocs.io

License

Potato is free software, licensed under the GNU General Public License v3.0 or later (GPLv3+). You are free to use, study, modify, and redistribute it — including for commercial purposes — provided that any distributed derivative works are also licensed under the GPLv3+ and made available with their source code. See the LICENSE file for the full terms.

Citation

@inproceedings{jurgens2026potato,
  title={POTATO 2.0: A Comprehensive Annotation Platform\\with Support for AI-in-the-Loop and Agentic Systems},
  author={Jurgens, David and Chen, Michael and Iyer, Lina},
  booktitle={Proceedings of the The 64th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  year={2026}
}