Potato: The Portable Annotation Tool

June 8, 2026 · View on GitHub

Documentation PyPI License Paper Live Demo

Potato is a free, self-hosted annotation platform for NLP, Agentic, and GenAI research. Annotate text, audio, video, images, documents, agent traces, and more — configured entirely through YAML. No coding required.

Try the live demo on HuggingFace Spaces — no installation needed.


Quick Start

pip install potato-annotation
# The examples/ folder ships with the source repo (see "run from source" below).
# After a PyPI install, clone the repo for the examples, or point `potato start`
# at your own config (see docs/quick-start.md).
potato start examples/classification/single-choice/config.yaml -p 8000

Or run from source (recommended to get the examples/):

git clone https://github.com/davidjurgens/potato.git
cd potato && pip install -r requirements.txt
python potato/flask_server.py start examples/classification/single-choice/config.yaml -p 8000

Open http://localhost:8000 and start annotating. Browse the examples/ directory for ready-to-use templates.


What Can You Annotate?

Potato handles the full spectrum of annotation tasks — from traditional NLP labeling to evaluating the latest AI agent systems.

Data Types

ModalityCapabilities
TextClassification, span labeling, entity linking, coreference, pairwise comparison (docs)
Agent TracesStep-by-step evaluation of LLM agents, tool calls, ReAct chains, and multi-agent systems (docs)
Web AgentsScreenshot-based review with SVG click/scroll overlays, or live browsing with automatic trace recording (docs)
RAG PipelinesRetrieval relevance, answer faithfulness, citation accuracy, hallucination detection
AudioWaveform visualization, segment labeling, ELAN-style tiered annotation (docs)
VideoFrame-by-frame labeling, temporal segments, playback sync (docs)
ImagesBounding boxes, polygons, landmarks, classification (docs)
DialogueTurn-level annotation, conversation trees, interactive chat evaluation
DocumentsPDF, Word, Markdown, code, and spreadsheets with coordinate mapping (docs)

Annotation Schemes

SchemeUse Case
Radio / Checkbox / LikertClassification, multi-label, rating scales
Span annotationNER, highlighting, hallucination marking
Pairwise comparisonA/B testing, best-worst scaling
Per-step ratingsEvaluate individual agent actions or dialogue turns
Free textOpen-ended responses with validation
TriageRapid accept/reject/skip curation (docs)
Conditional logicAdaptive forms that respond to prior answers (docs)

Agent & LLM Evaluation

Potato provides purpose-built tooling for evaluating AI agents at every level of granularity.

Trace Formats

Import traces from any major agent framework with the built-in converter:

python -m potato.trace_converter --input traces.json --input-format openai --output data.jsonl

Supported formats: OpenAI, Anthropic/Claude, ReAct, LangChain, LangFuse, WebArena, SWE-bench, OpenTelemetry, CrewAI/AutoGen/LangGraph, MCP, Aider, Claude Code, ATIF, SWE-Agent, and Web Agent. Auto-detection is available with --auto-detect.

Evaluation Levels

LevelWhat You AnnotateExample
TrajectoryOverall task success, efficiency, safety"Did the agent complete the task?"
StepIndividual action correctness, reasoning qualityPer-turn Likert ratings on each agent step
SpanSpecific text segments within agent outputHighlight hallucinated claims, factual errors
ComparisonSide-by-side A/B agent evaluation"Which agent performed better?"

Web Agent Viewer

An interactive viewer for GUI agent traces — navigate step-by-step through screenshots with SVG overlays showing clicks, bounding boxes, mouse paths, and scroll actions. Annotators rate each step with inline controls while a filmstrip bar provides quick navigation.

Ready-to-Use Agent Examples

ExampleWhat It Evaluates
agent-trace-evaluationText agent traces with MAST error taxonomy + hallucination spans
visual-agent-evaluationGUI agents with screenshot grounding accuracy
agent-comparisonSide-by-side A/B agent comparison
rag-evaluationRAG retrieval relevance and citation accuracy
openai-evaluationOpenAI Chat API traces with tool calls
anthropic-evaluationClaude messages with tool_use blocks
swebench-evaluationCoding agents with patch correctness ratings
multi-agent-evaluationMulti-agent coordination (CrewAI, AutoGen, LangGraph)
web-agent-reviewPre-recorded web traces with step-by-step overlay viewer
web-agent-creationLive web browsing with automatic trace recording

AI-Powered Annotation

LLM Label Suggestions

Integrate any LLM provider to pre-annotate instances and suggest labels. Annotators review and correct — dramatically faster than labeling from scratch.

Supported backends: OpenAI, Anthropic, Ollama, vLLM, Gemini, HuggingFace, OpenRouter

Active Learning

Potato reorders your annotation queue based on model uncertainty so annotators label the most informative instances first. Supports uncertainty sampling, BADGE, BALD, diversity, and hybrid strategies (docs).

Solo Mode

A human-LLM collaborative workflow where the system learns from annotator feedback and progressively transitions to autonomous LLM labeling as agreement improves (docs).

Chat Assistant

An LLM-powered sidebar where annotators can ask questions about difficult instances. The AI provides guidance informed by your task description and annotation guidelines — helping annotators think through decisions without auto-labeling (docs).


Quality Control & Workflows

Quality Assurance

FeatureDescription
Attention checksAutomatically inserted known-answer items to verify engagement
Gold standardsTrack annotator accuracy against expert labels
Inter-annotator agreementKrippendorff's alpha (general) and Cohen's kappa (step-level agent evaluation)
Training phasePractice annotations with feedback before the real task
Behavioral trackingTiming, click patterns, and annotation change history

Annotation Workflows

WorkflowDescription
Multi-annotatorMultiple annotators per item with overlap control and agreement metrics
AdjudicationExpert review of annotator disagreements to produce gold labels (docs)
Solo modeHuman-LLM collaboration with progressive automation (docs)
CrowdsourcingProlific and MTurk integration with platform-specific auth (docs)
TriageRapid accept/reject/skip for data curation (docs)

Authentication & Deployment

Potato supports multiple authentication methods, from passwordless quick-start to enterprise SSO:

MethodUse Case
In-memoryLocal development, quick studies
Password + file persistenceTeam annotation with shared credential files (docs)
DatabaseProduction deployments with SQLite or PostgreSQL (docs)
OAuth / SSOGoogle, GitHub, or institutional OIDC login (docs)
ClerkManaged authentication via Clerk.com (docs)
PasswordlessLow-stakes tasks where ease of access matters (docs)

Passwords are hashed with per-user PBKDF2-SHA256 salts. Admins can reset passwords via CLI (potato reset-password) or REST API. Self-service token-based reset is also available.


Example Projects

Ready-to-use templates organized by type in examples/:

CategoryExamples
ClassificationRadio, checkbox, Likert, slider, pairwise comparison
SpanNER, span linking, coreference, entity linking
Agent TracesLLM agents, web agents, RAG, multi-agent, code agents
AudioWaveform annotation, classification, ELAN-style tiered
VideoFrame-level labeling, temporal segments
ImageBounding boxes, PDF/document annotation
AdvancedSolo mode, adjudication, quality control, conditional logic
AI-AssistedLLM suggestions, Ollama integration
Custom LayoutsContent moderation, dialogue QA, medical review

Research Showcase

The Potato Showcase contains annotation projects from published research — sentiment analysis, dialogue evaluation, summarization, and more.


Documentation

TopicLink
Quick Startdocs/quick-start.md
Configuration Referencedocs/configuration/configuration.md
Schema Gallerydocs/annotation-types/schemas_and_templates.md
Agent Trace Evaluationdocs/agent-evaluation/agent_traces.md
Web Agent Annotationdocs/agent-evaluation/web_agent_annotation.md
AI Supportdocs/ai-intelligence/ai_support.md
Active Learningdocs/ai-intelligence/active_learning_guide.md
Solo Modedocs/solo-mode/solo_mode.md
Quality Controldocs/workflow/quality_control.md
Password Managementdocs/auth-users/password_management.md
SSO & OAuthdocs/auth-users/sso_authentication.md
Admin Dashboarddocs/administration/admin_dashboard.md
Crowdsourcingdocs/deployment/crowdsourcing.md
Export Formatsdocs/data-export/export_formats.md
Full Documentation Indexdocs/index.md

Development

# Run tests
pytest tests/ -v

# By category
pytest tests/unit/ -v        # Unit tests (fast)
pytest tests/server/ -v      # Integration tests
pytest tests/selenium/ -v    # Browser tests

# With coverage
pytest --cov=potato --cov-report=html

Support


License

Potato is free software, licensed under the GNU General Public License v3.0 or later (GPLv3+). You are free to use, study, modify, and redistribute it — including for commercial purposes — provided that any distributed derivative works are also licensed under the GPLv3+ and made available with their source code. See the LICENSE file for the full terms.


Citation

@inproceedings{jurgens2026potato,
  title={POTATO 2.0: A Comprehensive Annotation Platform\\with Support for AI-in-the-Loop and Agentic Systems},
  author={Jurgens, David and Chen, Michael and Iyer, Lina},
  booktitle={Proceedings of the The 64th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  year={2026}
}