Trace Stack - Local Observability Infrastructure

December 22, 2025 · View on GitHub

Distributed tracing, metrics, and log aggregation stack for local development.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Refly API (tmux)                         │
│                                                                 │
│  OpenTelemetry SDK  →  Metrics + Traces                       │
│  stdout/stderr      →  Logs (via tmux pipe-pane)              │
└─────────────────────────────────────────────────────────────────┘

        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ↓                     ↓                     ↓
┌───────────────┐    ┌───────────────┐    ┌───────────────┐
│ OTEL Collector│    │ OTEL Collector│    │     Alloy     │
│  (metrics)    │    │   (traces)    │    │  (log files)  │
│ Port 8889     │    └───────────────┘    └───────────────┘
└───────────────┘            │                     │
        │                     ↓                     ↓
        ↓            ┌───────────────┐    ┌───────────────┐
┌───────────────┐   │     Tempo     │    │     Loki      │
│  Prometheus   │←──│  (port 33200) │    │  (port 33100) │
│  (port 39090) │   └───────────────┘    └───────────────┘
└───────────────┘            │                     │
        │                     │                     │
        └─────────────────────┼─────────────────────┘

                      ┌───────────────┐
                      │    Grafana    │
                      │  (port 33000) │
                      └───────────────┘
                 Exemplars: Metrics → Traces

Components

ServiceImagePortPurpose
Grafanagrafana/grafana:12.3.033000Visualization and exploration
Prometheusprom/prometheus:v2.48.039090Metrics storage and query
Tempografana/tempo:2.3.033200Distributed tracing
Lokigrafana/loki:2.9.033100Log aggregation
OTEL Collectorotel/opentelemetry-collector-contrib:0.91.034318 (OTLP), 38889 (Prometheus)OTLP receiver and exporters
Alloygrafana/alloy:v1.5.112345Log collection from files

Quick Start

1. Start the trace stack

cd deploy/docker/trace
docker-compose up -d

2. Enable log collection (one-time setup)

Configure tmux to pipe refly-api logs to a file:

# Create log directory
mkdir -p /tmp/alloy/logs/refly

# Enable tmux pipe-pane (appends to file)
tmux pipe-pane -o -t refly-api "cat >> /tmp/alloy/logs/refly/api.log"

# Verify pipe-pane is active
tmux display-message -p -t refly-api '#{session_name} (pipe-pane #{?#{pane_pipe},active,inactive})'
# Expected: "refly-api (pipe-pane active)"

Note: This configuration persists across tmux sessions. You only need to run this once per refly-api session.

3. Verify data flow

Wait 10-15 seconds for initial data collection, then check:

# Check Prometheus metrics
curl -s http://localhost:39090/api/v1/label/__name__/values | jq '.data[:5]'

# Check Tempo traces
curl -s http://localhost:33200/api/search | jq '.traces[:3]'

# Check Loki logs
curl -s http://localhost:33100/loki/api/v1/label | jq '.data'

4. Access Grafana

Open http://localhost:33000 in your browser.

Default credentials:

  • Username: admin
  • Password: admin

Or use anonymous access (enabled by default with Admin role).

Data Collection

Metrics (via OpenTelemetry)

The refly-api automatically exports metrics via OTLP:

Available metrics:

Business Metrics (Level 1 - Core):

  • llm.invocation.count{status, model_name} - LLM invocation count
  • llm.invocation.duration{model_name} - LLM invocation latency
  • llm.token.count{token_type, model_name} - Token consumption
  • tool.invocation.count{tool_name, toolset_key, status} - Tool invocation count
  • tool.execution.duration{tool_name, toolset_key} - Tool execution latency

Infrastructure Metrics (Level 0):

  • db.query.duration{operation, model} - Database query latency
  • db.slow_query.count{operation, model} - Slow query counter (>100ms)
  • db.query.count{operation, model} - Total query counter
  • http_server_duration{http_route, http_method} - HTTP request duration
  • process_cpu_seconds_total - CPU usage
  • process_resident_memory_bytes - Memory usage

Exemplars: Prometheus scrapes metrics with sampled traceIds. Click exemplar points in Grafana to jump to the corresponding trace in Tempo.

Traces (via OpenTelemetry)

Distributed traces are automatically collected:

Logs (via Alloy)

Alloy collects logs from the refly-api tmux session:

Prerequisites:

  1. Log directory: /tmp/alloy/logs/refly/ must exist
  2. Tmux pipe-pane: Must be enabled for refly-api session
  3. Alloy container: Must be running

Data flow:

refly-api stdout → tmux pipe-pane → /tmp/alloy/logs/refly/api.log

                                    Alloy container
                                  (mounts /tmp/alloy/logs)

                                     Parses Pino JSON
                                     Extracts level label

                                     Loki (7 days retention)

Log labels:

  • job: "refly-api" (fixed)
  • env: "local" (fixed)
  • level: trace/debug/info/warn/error/fatal (extracted from Pino JSON)
  • filename: "/tmp/alloy/logs/refly/api.log" (auto)

Querying in Grafana

Metrics (Prometheus)

Navigate to Explore → Select Prometheus datasource:

# Database query rate by table
rate(db_query_count_total[5m])

# P99 query latency
histogram_quantile(0.99, rate(db_query_duration_bucket[5m]))

# HTTP request rate by route
rate(http_server_duration_count[1m])

# LLM invocation rate by model
rate(llm_invocation_count_total[5m])

# Tool success rate
sum(rate(tool_invocation_count_total{status="success"}[5m]))
  / sum(rate(tool_invocation_count_total[5m]))

Using Exemplars:

  1. Query metrics (e.g., rate(llm_invocation_count_total[5m]))
  2. Look for small circular points on the graph
  3. Click an exemplar point → Select "View Trace"
  4. Grafana opens the corresponding trace in Tempo

Traces (Tempo)

Navigate to Explore → Select Tempo datasource:

  • Search by trace ID
  • Filter by service name: "reflyd"
  • View trace waterfall and span details

Logs (Loki)

Navigate to Explore → Select Loki datasource:

# All API logs
{job="refly-api"}

# Error logs only
{job="refly-api", level="error"}

# Search by traceId (JSON parsing)
{job="refly-api"} | json | traceId="abc123..."

# Find slow queries
{job="refly-api"} | json | msg =~ "Slow query detected"

# Filter by status code
{job="refly-api"} | json | statusCode="404"

Configuration Files

FilePurpose
docker-compose.ymlContainer orchestration
otel-collector-config.yamlOTLP receiver and exporters
prometheus-config.yamlPrometheus scrape and storage config
tempo-config.yamlTempo trace storage config
loki-config.yamlLoki log storage and retention
alloy-local-config.alloyAlloy log collection pipeline
grafana-datasources.yamlPre-configured datasources
dashboards/Provisioned Grafana dashboards

Troubleshooting

No logs in Loki

Check tmux pipe-pane:

tmux display-message -p -t refly-api '#{pane_pipe}'
# Should output: 1

If inactive, re-enable:

tmux pipe-pane -o -t refly-api "cat >> /tmp/alloy/logs/refly/api.log"

Check log file:

tail -f /tmp/alloy/logs/refly/api.log
# Should show refly-api logs

Check Alloy container:

docker logs --tail 20 refly_alloy
# Look for "tail routine: started" and no errors

No metrics in Prometheus

Check OTLP endpoint:

curl http://localhost:34318/health
# Should return healthy status

Check Prometheus targets: Open http://localhost:39090/targets - OTEL Collector should be "UP"

Check refly-api OTLP export: Look for OTLP export logs in refly-api output.

No traces in Tempo

Check Tempo health:

curl http://localhost:33200/ready
# Should return 200 OK

Check OTEL Collector logs:

docker logs --tail 50 refly_otel_collector | grep -i trace

Stopping and Cleanup

Stop all containers

cd deploy/docker/trace
docker-compose down

Remove volumes (deletes all data)

docker-compose down -v

Disable tmux log collection

# Stop pipe-pane
tmux pipe-pane -o -t refly-api

# Remove log file
rm -rf /tmp/alloy/logs/refly

Data Retention

ComponentRetentionStorage
Prometheus30 daysDocker volume prometheus_data
Tempo7 daysDocker volume tempo_data
Loki7 daysDocker volume loki_data
AlloyN/ADocker volume alloy_data (position tracking)

Notes

  • Alloy requirement: Runs as a Docker container, no local installation needed
  • Log format: Expects Pino JSON logs (auto-detected)
  • Network: All containers share refly_trace_network bridge network
  • Security: Anonymous Grafana access enabled for local development only