Potato: The Portable Annotation Tool
June 8, 2026 · View on GitHub
Potato is a free, self-hosted annotation platform for NLP, Agentic, and GenAI research. Annotate text, audio, video, images, documents, agent traces, and more — configured entirely through YAML. No coding required.
Try the live demo on HuggingFace Spaces — no installation needed.
Quick Start
pip install potato-annotation
# The examples/ folder ships with the source repo (see "run from source" below).
# After a PyPI install, clone the repo for the examples, or point `potato start`
# at your own config (see docs/quick-start.md).
potato start examples/classification/single-choice/config.yaml -p 8000
Or run from source (recommended to get the examples/):
git clone https://github.com/davidjurgens/potato.git
cd potato && pip install -r requirements.txt
python potato/flask_server.py start examples/classification/single-choice/config.yaml -p 8000
Open http://localhost:8000 and start annotating. Browse the examples/ directory for ready-to-use templates.
What Can You Annotate?
Potato handles the full spectrum of annotation tasks — from traditional NLP labeling to evaluating the latest AI agent systems.
Data Types
| Modality | Capabilities |
|---|---|
| Text | Classification, span labeling, entity linking, coreference, pairwise comparison (docs) |
| Agent Traces | Step-by-step evaluation of LLM agents, tool calls, ReAct chains, and multi-agent systems (docs) |
| Web Agents | Screenshot-based review with SVG click/scroll overlays, or live browsing with automatic trace recording (docs) |
| RAG Pipelines | Retrieval relevance, answer faithfulness, citation accuracy, hallucination detection |
| Audio | Waveform visualization, segment labeling, ELAN-style tiered annotation (docs) |
| Video | Frame-by-frame labeling, temporal segments, playback sync (docs) |
| Images | Bounding boxes, polygons, landmarks, classification (docs) |
| Dialogue | Turn-level annotation, conversation trees, interactive chat evaluation |
| Documents | PDF, Word, Markdown, code, and spreadsheets with coordinate mapping (docs) |
Annotation Schemes
| Scheme | Use Case |
|---|---|
| Radio / Checkbox / Likert | Classification, multi-label, rating scales |
| Span annotation | NER, highlighting, hallucination marking |
| Pairwise comparison | A/B testing, best-worst scaling |
| Per-step ratings | Evaluate individual agent actions or dialogue turns |
| Free text | Open-ended responses with validation |
| Triage | Rapid accept/reject/skip curation (docs) |
| Conditional logic | Adaptive forms that respond to prior answers (docs) |
Agent & LLM Evaluation
Potato provides purpose-built tooling for evaluating AI agents at every level of granularity.
Trace Formats
Import traces from any major agent framework with the built-in converter:
python -m potato.trace_converter --input traces.json --input-format openai --output data.jsonl
Supported formats: OpenAI, Anthropic/Claude, ReAct, LangChain, LangFuse, WebArena, SWE-bench, OpenTelemetry, CrewAI/AutoGen/LangGraph, MCP, Aider, Claude Code, ATIF, SWE-Agent, and Web Agent. Auto-detection is available with --auto-detect.
Evaluation Levels
| Level | What You Annotate | Example |
|---|---|---|
| Trajectory | Overall task success, efficiency, safety | "Did the agent complete the task?" |
| Step | Individual action correctness, reasoning quality | Per-turn Likert ratings on each agent step |
| Span | Specific text segments within agent output | Highlight hallucinated claims, factual errors |
| Comparison | Side-by-side A/B agent evaluation | "Which agent performed better?" |
Web Agent Viewer
An interactive viewer for GUI agent traces — navigate step-by-step through screenshots with SVG overlays showing clicks, bounding boxes, mouse paths, and scroll actions. Annotators rate each step with inline controls while a filmstrip bar provides quick navigation.
Ready-to-Use Agent Examples
| Example | What It Evaluates |
|---|---|
| agent-trace-evaluation | Text agent traces with MAST error taxonomy + hallucination spans |
| visual-agent-evaluation | GUI agents with screenshot grounding accuracy |
| agent-comparison | Side-by-side A/B agent comparison |
| rag-evaluation | RAG retrieval relevance and citation accuracy |
| openai-evaluation | OpenAI Chat API traces with tool calls |
| anthropic-evaluation | Claude messages with tool_use blocks |
| swebench-evaluation | Coding agents with patch correctness ratings |
| multi-agent-evaluation | Multi-agent coordination (CrewAI, AutoGen, LangGraph) |
| web-agent-review | Pre-recorded web traces with step-by-step overlay viewer |
| web-agent-creation | Live web browsing with automatic trace recording |
AI-Powered Annotation
LLM Label Suggestions
Integrate any LLM provider to pre-annotate instances and suggest labels. Annotators review and correct — dramatically faster than labeling from scratch.
Supported backends: OpenAI, Anthropic, Ollama, vLLM, Gemini, HuggingFace, OpenRouter
Active Learning
Potato reorders your annotation queue based on model uncertainty so annotators label the most informative instances first. Supports uncertainty sampling, BADGE, BALD, diversity, and hybrid strategies (docs).
Solo Mode
A human-LLM collaborative workflow where the system learns from annotator feedback and progressively transitions to autonomous LLM labeling as agreement improves (docs).
Chat Assistant
An LLM-powered sidebar where annotators can ask questions about difficult instances. The AI provides guidance informed by your task description and annotation guidelines — helping annotators think through decisions without auto-labeling (docs).
Quality Control & Workflows
Quality Assurance
| Feature | Description |
|---|---|
| Attention checks | Automatically inserted known-answer items to verify engagement |
| Gold standards | Track annotator accuracy against expert labels |
| Inter-annotator agreement | Krippendorff's alpha (general) and Cohen's kappa (step-level agent evaluation) |
| Training phase | Practice annotations with feedback before the real task |
| Behavioral tracking | Timing, click patterns, and annotation change history |
Annotation Workflows
| Workflow | Description |
|---|---|
| Multi-annotator | Multiple annotators per item with overlap control and agreement metrics |
| Adjudication | Expert review of annotator disagreements to produce gold labels (docs) |
| Solo mode | Human-LLM collaboration with progressive automation (docs) |
| Crowdsourcing | Prolific and MTurk integration with platform-specific auth (docs) |
| Triage | Rapid accept/reject/skip for data curation (docs) |
Authentication & Deployment
Potato supports multiple authentication methods, from passwordless quick-start to enterprise SSO:
| Method | Use Case |
|---|---|
| In-memory | Local development, quick studies |
| Password + file persistence | Team annotation with shared credential files (docs) |
| Database | Production deployments with SQLite or PostgreSQL (docs) |
| OAuth / SSO | Google, GitHub, or institutional OIDC login (docs) |
| Clerk | Managed authentication via Clerk.com (docs) |
| Passwordless | Low-stakes tasks where ease of access matters (docs) |
Passwords are hashed with per-user PBKDF2-SHA256 salts. Admins can reset passwords via CLI (potato reset-password) or REST API. Self-service token-based reset is also available.
Example Projects
Ready-to-use templates organized by type in examples/:
| Category | Examples |
|---|---|
| Classification | Radio, checkbox, Likert, slider, pairwise comparison |
| Span | NER, span linking, coreference, entity linking |
| Agent Traces | LLM agents, web agents, RAG, multi-agent, code agents |
| Audio | Waveform annotation, classification, ELAN-style tiered |
| Video | Frame-level labeling, temporal segments |
| Image | Bounding boxes, PDF/document annotation |
| Advanced | Solo mode, adjudication, quality control, conditional logic |
| AI-Assisted | LLM suggestions, Ollama integration |
| Custom Layouts | Content moderation, dialogue QA, medical review |
Research Showcase
The Potato Showcase contains annotation projects from published research — sentiment analysis, dialogue evaluation, summarization, and more.
Documentation
| Topic | Link |
|---|---|
| Quick Start | docs/quick-start.md |
| Configuration Reference | docs/configuration/configuration.md |
| Schema Gallery | docs/annotation-types/schemas_and_templates.md |
| Agent Trace Evaluation | docs/agent-evaluation/agent_traces.md |
| Web Agent Annotation | docs/agent-evaluation/web_agent_annotation.md |
| AI Support | docs/ai-intelligence/ai_support.md |
| Active Learning | docs/ai-intelligence/active_learning_guide.md |
| Solo Mode | docs/solo-mode/solo_mode.md |
| Quality Control | docs/workflow/quality_control.md |
| Password Management | docs/auth-users/password_management.md |
| SSO & OAuth | docs/auth-users/sso_authentication.md |
| Admin Dashboard | docs/administration/admin_dashboard.md |
| Crowdsourcing | docs/deployment/crowdsourcing.md |
| Export Formats | docs/data-export/export_formats.md |
| Full Documentation Index | docs/index.md |
Development
# Run tests
pytest tests/ -v
# By category
pytest tests/unit/ -v # Unit tests (fast)
pytest tests/server/ -v # Integration tests
pytest tests/selenium/ -v # Browser tests
# With coverage
pytest --cov=potato --cov-report=html
Support
- Issues: GitHub Issues
- Questions: jurgens@umich.edu
- Docs: potatoannotator.readthedocs.io
License
Potato is free software, licensed under the GNU General Public License v3.0 or later (GPLv3+). You are free to use, study, modify, and redistribute it — including for commercial purposes — provided that any distributed derivative works are also licensed under the GPLv3+ and made available with their source code. See the LICENSE file for the full terms.
Citation
@inproceedings{jurgens2026potato,
title={POTATO 2.0: A Comprehensive Annotation Platform\\with Support for AI-in-the-Loop and Agentic Systems},
author={Jurgens, David and Chen, Michael and Iyer, Lina},
booktitle={Proceedings of the The 64th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
year={2026}
}