Chapter 8: Production Deployment and Advanced Patterns
April 13, 2026 · View on GitHub
What Problem Does This Solve?
The make dev local setup is designed for a single developer with direct host access. Production deployments face different requirements:
- Multi-user access: multiple users submitting research tasks concurrently
- State persistence: thread history and memory must survive service restarts
- Security: untrusted user inputs must not escape the sandbox; the API must require authentication
- Observability: you need to know when agent runs fail, which LLM calls are expensive, and where latency is concentrated
- Resource control: long-running research jobs must not starve other users
- Reliability: the system must recover from sandbox container crashes, LLM API rate limits, and network failures
This chapter covers each of these production concerns with concrete configuration examples.
How it Works Under the Hood
Production Architecture
graph TB
subgraph "Load Balancer / Auth"
LB[HTTPS Load Balancer<br/>TLS termination<br/>Authentication gateway]
end
subgraph "Application Layer"
N[Nginx :2026<br/>Internal routing]
LG[LangGraph Server :2024<br/>AioSandboxProvider<br/>Postgres checkpointer]
GW[Gateway API :8001<br/>FastAPI]
FE[Next.js :3000]
end
subgraph "Sandbox Layer"
SB1[Docker Sandbox<br/>Thread 1]
SB2[Docker Sandbox<br/>Thread 2]
SBN[Docker Sandbox<br/>Thread N]
end
subgraph "Persistence Layer"
PG[PostgreSQL<br/>Thread state + checkpoints]
FS[Shared Filesystem / S3<br/>Artifacts, uploads, outputs]
end
subgraph "Observability"
LS[LangSmith / Langfuse<br/>LLM call tracing]
LOG[Centralized Logging<br/>stdout → aggregator]
end
LB --> N
N --> LG
N --> GW
N --> FE
LG --> SB1
LG --> SB2
LG --> SBN
LG --> PG
GW --> PG
LG --> FS
GW --> FS
LG --> LS
Deployment Sizing
ByteDance documents three deployment profiles:
| Profile | CPU | RAM | Disk | Use Case |
|---|---|---|---|---|
| Local eval | 4 vCPU | 8 GB | 20 GB SSD | Single developer testing |
| Docker dev | 4 vCPU | 8 GB | 25 GB SSD | Team dev environment |
| Production server | 8–16 vCPU | 16–32 GB | 40+ GB SSD | Multi-user production |
The large RAM requirement comes from:
- LangGraph server keeping active thread states in memory
- Docker sandbox containers (each uses ~256 MB–1 GB)
- LLM context windows being processed (long research contexts can use GB of memory during inference)
Production Docker Compose
A production-ready Docker Compose configuration:
# docker-compose.prod.yml
version: "3.9"
services:
nginx:
image: nginx:1.25-alpine
ports:
- "2026:2026"
volumes:
- ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
depends_on:
- langgraph
- gateway
- frontend
restart: unless-stopped
langgraph:
build:
context: ./backend
dockerfile: Dockerfile
environment:
- LANGGRAPH_POSTGRES_URI=postgresql://deerflow:${DB_PASSWORD}@postgres:5432/deerflow
- DEER_FLOW_CONFIG_PATH=/app/config.yaml
- LANGCHAIN_TRACING_V2=true
- LANGCHAIN_API_KEY=${LANGSMITH_API_KEY}
- LANGCHAIN_PROJECT=deer-flow-production
volumes:
- ./config.yaml:/app/config.yaml:ro
- user_data:/mnt/user-data
- ./skills:/mnt/skills:ro
depends_on:
postgres:
condition: service_healthy
restart: unless-stopped
deploy:
resources:
limits:
cpus: "8"
memory: 16G
gateway:
build:
context: ./backend
dockerfile: Dockerfile
target: gateway
environment:
- DEER_FLOW_CONFIG_PATH=/app/config.yaml
volumes:
- ./config.yaml:/app/config.yaml:ro
- user_data:/mnt/user-data
- ./extensions_config.json:/app/extensions_config.json
restart: unless-stopped
frontend:
build:
context: ./frontend
dockerfile: Dockerfile
environment:
- NEXT_PUBLIC_API_URL=http://nginx:2026
restart: unless-stopped
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: deerflow
POSTGRES_USER: deerflow
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U deerflow"]
interval: 10s
timeout: 5s
retries: 5
restart: unless-stopped
sandbox:
image: bytedance/deer-flow-sandbox:latest
volumes:
- user_data:/mnt/user-data
deploy:
replicas: 4 # Pre-warm 4 sandbox containers
resources:
limits:
cpus: "2"
memory: 2G
volumes:
postgres_data:
user_data:
Postgres Checkpointer
The development setup uses SQLite for checkpointing. Production must use Postgres for:
- Persistence across service restarts
- Concurrent access from multiple LangGraph server instances
- Efficient state queries for thread history
# backend/packages/harness/deerflow/agents/checkpointer/async_provider.py
import os
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
def make_checkpointer():
postgres_uri = os.environ.get("LANGGRAPH_POSTGRES_URI")
if postgres_uri:
# Production: use Postgres
return AsyncPostgresSaver.from_conn_string(postgres_uri)
else:
# Development: fall back to SQLite
from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver
return AsyncSqliteSaver.from_conn_string("./checkpoints.db")
Gateway Mode: Eliminating LangGraph Platform Dependency
DeerFlow's "gateway mode" embeds the agent runtime inside the FastAPI Gateway API, reducing the process count from 4 to 3 and removing the LangGraph Platform server:
# backend/app/gateway/app.py (gateway mode)
# In gateway mode, the FastAPI app hosts the agent runtime directly
from deerflow.agents import make_lead_agent
from langgraph.checkpoint.postgres.aio import AsyncPostgresSaver
app = FastAPI(title="DeerFlow Gateway + Agent Runtime")
@app.on_event("startup")
async def startup():
checkpointer = AsyncPostgresSaver.from_conn_string(settings.POSTGRES_URI)
app.state.agent = make_lead_agent()
@app.post("/threads/{thread_id}/runs")
async def run_agent(thread_id: str, request: RunRequest):
"""SSE streaming endpoint for agent execution."""
async def generate():
async for event in app.state.agent.astream_events(
request.input,
config={"configurable": {"thread_id": thread_id}},
version="v2",
):
yield f"data: {json.dumps(event)}\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
Enable gateway mode in the environment:
DEERFLOW_GATEWAY_MODE=true
Gateway mode is marked experimental but is the recommended approach for deployments that want to avoid the LangGraph Platform licensing and process overhead.
LangSmith Observability
LangSmith provides per-call tracing for all LLM invocations and tool calls:
# .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=ls__...
LANGCHAIN_PROJECT=deer-flow-production
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com # default
With tracing enabled, every research run creates a trace with:
- Full message history at each step
- Tool call inputs and outputs
- Token counts and costs per LLM call
- Latency at each node
- Error stacks for failed runs
graph LR
A[LangGraph Agent Run] --> B[LangSmith Trace]
B --> C[Run overview<br/>total tokens, cost, latency]
B --> D[Per-step spans<br/>each LLM call + tool call]
B --> E[Input/output at each step]
B --> F[Error details if failed]
Langfuse: Open-Source Alternative
For teams that cannot send data to LangSmith (data residency requirements), Langfuse is an open-source observability alternative:
# .env
LANGFUSE_HOST=https://cloud.langfuse.com # or self-hosted
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
# backend/packages/harness/deerflow/agents/lead_agent/agent.py
# Langfuse integration (conceptual — actual config is via env vars)
from langfuse.callback import CallbackHandler
langfuse_handler = CallbackHandler()
# Injected into agent config:
config = {
"callbacks": [langfuse_handler],
"configurable": {"thread_id": thread_id},
}
Security Hardening
1. Sandbox isolation — always use AioSandboxProvider (Docker) in production:
# config.yaml
sandbox:
use: deerflow.community.aio_sandbox:AioSandboxProvider
allow_host_bash: false # NEVER true in production
auto_start: true
container_prefix: deer-flow-sandbox
The sandbox provides:
- Filesystem isolation (agent code cannot reach host files outside
/mnt/user-data) - Network isolation (configurable — default allows outbound for web search)
- Resource limits per container (CPU and memory caps)
2. Authentication — DeerFlow defaults to no auth. Add authentication at Nginx:
# nginx.conf (production auth example using OAuth2 Proxy)
location / {
auth_request /oauth2/auth;
error_page 401 = /oauth2/sign_in;
proxy_pass http://frontend:3000;
}
location /oauth2/ {
proxy_pass http://oauth2-proxy:4180;
}
Or use the built-in better-auth integration in the frontend:
// frontend/src/server/better-auth/config.ts
import { betterAuth } from "better-auth";
import { prismaAdapter } from "better-auth/adapters/prisma";
export const auth = betterAuth({
database: prismaAdapter(prisma, { provider: "postgresql" }),
socialProviders: {
github: {
clientId: process.env.GITHUB_CLIENT_ID!,
clientSecret: process.env.GITHUB_CLIENT_SECRET!,
},
google: {
clientId: process.env.GOOGLE_CLIENT_ID!,
clientSecret: process.env.GOOGLE_CLIENT_SECRET!,
},
},
});
3. Network exposure — DeerFlow's documentation states:
"DeerFlow is designed for local trusted deployments. Untrusted network exposure requires IP allowlists, authentication gateways, and network isolation to prevent unauthorized agent invocation and potential abuse."
Never expose the LangGraph server port (:2024) or Gateway API port (:8001) directly to the internet. Route all traffic through Nginx or a load balancer with authentication.
Health Checks and Monitoring
Configure health checks for each service:
# docker-compose.prod.yml health checks
services:
langgraph:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:2024/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
gateway:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8001/health"]
interval: 30s
timeout: 5s
retries: 3
The make doctor script can also be run as a scheduled health check:
# Cron: run doctor every 5 minutes, alert if it fails
*/5 * * * * cd /opt/deer-flow && make doctor >> /var/log/deerflow-health.log 2>&1
Recovery Patterns
LLM API Rate Limits:
DeerFlow's LLMErrorHandlingMiddleware handles rate limit responses from LLM providers with exponential backoff. No additional configuration is needed, but you should monitor LangSmith/Langfuse for runs that are slow due to rate limit retries.
Sandbox Container Crashes:
If a Docker sandbox container crashes mid-execution:
SandboxMiddlewaredetects the missing container ID inThreadState.sandbox- A new container is provisioned for the thread on the next invocation
- The agent resumes from the last checkpoint (previous messages are intact)
- Files in the workspace directory persist (they are on the shared volume, not inside the container)
LangGraph Server Restart:
With Postgres checkpointer:
- All thread states are persisted and immediately available after restart
- In-flight runs at the time of restart are marked as failed
- Users can resubmit their last message to resume from the last saved checkpoint
Disk Full:
Generated artifacts (MP3s, PDFs, slides) accumulate on the shared volume. Implement periodic cleanup:
# scripts/cleanup_old_threads.py
import os
import shutil
from datetime import datetime, timedelta
from pathlib import Path
OUTPUTS_DIR = Path("/mnt/user-data/outputs")
RETENTION_DAYS = 30
def cleanup_old_outputs():
cutoff = datetime.now() - timedelta(days=RETENTION_DAYS)
for thread_dir in OUTPUTS_DIR.iterdir():
if thread_dir.is_dir():
mtime = datetime.fromtimestamp(thread_dir.stat().st_mtime)
if mtime < cutoff:
shutil.rmtree(thread_dir)
print(f"Deleted: {thread_dir}")
if __name__ == "__main__":
cleanup_old_outputs()
Advanced Pattern: Agent Guardrails
For deployments where you need to restrict what the agent can do (e.g., corporate environments):
# backend/docs/GUARDRAILS.md describes this pattern
# Guardrails are implemented as middleware that intercepts tool calls
class ContentGuardrailMiddleware(BaseMiddleware):
"""Block tool calls to specific domains or with specific patterns."""
BLOCKED_DOMAINS = {"competitor.com", "internal.company.com"}
async def before_tool_call(
self,
tool_name: str,
tool_input: dict,
state: ThreadState,
) -> dict | None:
"""Return None to allow, return error dict to block."""
if tool_name in ("web_search", "web_fetch"):
url = tool_input.get("url", "")
query = tool_input.get("query", "")
for domain in self.BLOCKED_DOMAINS:
if domain in url or domain in query:
return {"error": f"Access to {domain} is restricted by policy."}
return None # Allow the tool call
Cost Management
LLM API costs scale with:
- Number of concurrent users
- Research depth (sub-agent count and search iterations)
- Model selection (o3 vs. gpt-4o-mini cost ratio can be 100x)
Production cost controls:
# config.yaml — cost-efficient defaults with premium model available
models:
- name: gpt-4o-mini
display_name: GPT-4o Mini (Default)
use: langchain_openai:ChatOpenAI
model: gpt-4o-mini
api_key: $OPENAI_API_KEY
- name: o3-mini
display_name: o3 Mini (Deep Research)
use: langchain_openai:ChatOpenAI
model: o3-mini
api_key: $OPENAI_API_KEY
supports_thinking: true
# Per-agent config: limit sub-agent parallelism to control costs
# workspace/agents/lead_agent/config.yaml
subagent:
max_concurrent: 2 # Reduce from default 3 to limit parallel LLM calls
Summary
Production DeerFlow deployment requires:
- Postgres checkpointer for persistent thread state
- AioSandboxProvider (Docker) for isolated code execution
- Authentication at Nginx or via better-auth
- LangSmith or Langfuse for observability
- Shared filesystem volume for artifacts persistence across service restarts
- Resource sizing based on concurrent user count and research depth
- Periodic cleanup for accumulated artifacts
- Network isolation preventing direct exposure of internal service ports
The gateway mode (experimental) simplifies the architecture by embedding the agent runtime in the FastAPI server, reducing the process count and removing the LangGraph Platform dependency.
Tutorial Complete
You have now covered the full DeerFlow system:
- Chapter 1: Installation, configuration, and first research query
- Chapter 2: LangGraph state machine, 14-stage middleware pipeline, async checkpointing
- Chapter 3: Research pipeline — CLARIFY → PLAN → ACT, deep research skill, citations
- Chapter 4: RAG and search tools — DuckDuckGo, Tavily, Exa, Firecrawl, sandbox REPL, MCP
- Chapter 5: Three-service architecture, SSE streaming, Gateway API, IM channels
- Chapter 6: Skills system, custom tools, MCP servers, per-agent config overrides
- Chapter 7: Podcast generation, PowerPoint, charts, image and video generation
- Chapter 8: Production deployment, Postgres checkpointer, security, observability, cost management
Further Resources
- DeerFlow GitHub Repository
- Architecture Documentation
- Configuration Reference
- Skills Library
- Contributing Guide