Vidify

June 26, 2026 · View on GitHub

简体中文

Vidify is a video understanding agent. Give it a YouTube URL, HTTP video URL, or local video and get structured analysis, searchable indexes, Q&A, highlights, reports, and live-stream understanding.

What It Does

CapabilityDescription
AnalyzeDownload media, extract subtitles/metadata, run ASR when needed, and build timelines
UnderstandCaption frames, read OCR text, detect objects, analyze emotion, and translate transcripts
Search & AskBuild a FAISS index over transcript, frames, and metadata for evidence-backed Q&A
EditDetect highlights, export clips, and optionally assemble reels
StreamProcess webcams or RTMP/HTTP streams with adaptive segmentation and live Q&A
OperateRetry transient failures, degrade optional skills gracefully, emit progress events, and run hooks

Vidify is ASR-first: subtitles and speech usually carry the main story, so visual model calls are skipped when transcript coverage is sufficient. See Project Overview for the full processing flow.

Quick Start

1. Install

pip install -e .

System requirements: Python 3.11+, ffmpeg, and yt-dlp.

Optional feature groups:

pip install -e ".[asr,ocr,emotion,live,serving]"
pip install -r requirements-full.txt

2. Configure

cp .env.example .env

Edit .env when you need custom model endpoints, model names, cache paths, or web search credentials. Full details are in Configuration.

3. Start Model Serving

Vidify expects an OpenAI-compatible multimodal endpoint, usually vLLM:

# vLLM >= 0.19.0 is required for Qwen3.5 support.
pip install "vllm>=0.19.0"

bash scripts/serving_qwen3_5.sh

Manual example:

vllm serve Qwen/Qwen3.5-9B \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --allowed-local-media-path $(pwd)/cache

See Deployment for GPU, Ascend/NPU, Docker, and validation commands.

4. Run

CLI:

python -m agent.main analyze youtube "https://www.youtube.com/watch?v=..." --mode detailed
python -m agent.main analyze local media/example.mp4 --mode brief
python -m agent.main analyze local media/example.mp4 --mode ask --question "What changed?"

REST API and web UI:

uvicorn server.app:app --host 0.0.0.0 --port 9000

curl -X POST http://localhost:9000/analyze \
  -H 'Content-Type: application/json' \
  -d '{"source_type":"youtube","uri":"https://www.youtube.com/watch?v=...","mode":"detailed"}'

Open http://localhost:9000 for the web interface.

Workflow Modes

brief is the canonical lightweight mode. quick is still accepted as a legacy alias in the CLI and API.

ModeUse It ForExample
briefFast ASR-first summarypython -m agent.main analyze youtube URL --mode brief
detailedOCR, object detection, emotion, translation, and richer timelinespython -m agent.main analyze youtube URL --mode detailed
askQuestion-answering over an indexed videopython -m agent.main analyze youtube URL --mode ask --question "What are the conclusions?"
highlightsClip export and optional reelspython -m agent.main analyze youtube URL --mode highlights
reportStructured report generation, optionally with web searchpython -m agent.main analyze youtube URL --mode report --include-web-search
liveWebcam, RTMP, or HTTP stream understandingpython -m agent.main analyze local webcam --mode live

See Workflows and API Reference for complete parameters and request schemas.

Hermes

This repo ships a Hermes-native skill at .agents/skills/media/vidify.

python -m agent.main hermes install-skill

The installer symlinks the skill into ~/.hermes/skills/media/vidify by default. Use --strategy copy for a standalone copy. The legacy openclaw/ skill remains available for older setups.

Testing

Run the fast test suite:

pytest tests/

Validate against an existing model endpoint:

bash scripts/run_test_gpu.sh --api-base http://localhost:8000/v1 --video media/my_video.mp4
python scripts/test_all.py --video-path media/my_video.mp4 --api-base http://localhost:8000/v1

See Testing Guide for focused tests, YouTube E2E validation, and hardware-specific notes.

Repository Layout

PathPurpose
agent/core/Orchestration, schemas, events, hooks, retries, segmenting, and parallel execution
agent/extensions/skills/Reusable video, audio, retrieval, and analysis units
agent/extensions/workflows/User-facing workflow composition
agent/extensions/models/Model adapters and direct-loading helpers
server/FastAPI app, SSE endpoints, and web routes
templates/Web UI templates
scripts/Serving, validation, and demo helpers
docs/Architecture, workflow, deployment, and API documentation
cache/Runtime artifacts; do not commit generated outputs

Documentation

DocumentContents
Project OverviewASR-first design, capability map, and processing flow
DeploymentvLLM serving, GPU validation, Ascend/NPU helpers, and Docker
Live StreamingWebcam/stream architecture, CLI/API usage, and config
Production FeaturesRetries, graceful degradation, parallelism, progress events, hooks, and logging
ArchitectureData models, cache structure, model interfaces, and orchestrator
WorkflowsBrief, detailed, index, ask, highlights, report, and live modes
Skills ReferenceSkill APIs and responsibilities
API ReferenceREST endpoints, CLI arguments, examples, and schemas
ConfigurationYAML files, environment variables, vLLM setup, and Docker
Testing GuidePytest, local E2E, GPU/Ascend endpoint validation, and YouTube E2E
Web SearchGoogle Custom Search and fallback search setup