Vidify

June 26, 2026 · View on GitHub

Vidify is a video understanding agent. Give it a YouTube URL, HTTP video URL, or local video and get structured analysis, searchable indexes, Q&A, highlights, reports, and live-stream understanding.

What It Does

Capability	Description
Analyze	Download media, extract subtitles/metadata, run ASR when needed, and build timelines
Understand	Caption frames, read OCR text, detect objects, analyze emotion, and translate transcripts
Search & Ask	Build a FAISS index over transcript, frames, and metadata for evidence-backed Q&A
Edit	Detect highlights, export clips, and optionally assemble reels
Stream	Process webcams or RTMP/HTTP streams with adaptive segmentation and live Q&A
Operate	Retry transient failures, degrade optional skills gracefully, emit progress events, and run hooks

Vidify is ASR-first: subtitles and speech usually carry the main story, so visual model calls are skipped when transcript coverage is sufficient. See Project Overview for the full processing flow.

Quick Start

1. Install

pip install -e .

System requirements: Python 3.11+, ffmpeg, and yt-dlp.

Optional feature groups:

pip install -e ".[asr,ocr,emotion,live,serving]"
pip install -r requirements-full.txt

2. Configure

cp .env.example .env

Edit .env when you need custom model endpoints, model names, cache paths, or web search credentials. Full details are in Configuration.

3. Start Model Serving

Vidify expects an OpenAI-compatible multimodal endpoint, usually vLLM:

# vLLM >= 0.19.0 is required for Qwen3.5 support.
pip install "vllm>=0.19.0"

bash scripts/serving_qwen3_5.sh

Manual example:

vllm serve Qwen/Qwen3.5-9B \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --allowed-local-media-path $(pwd)/cache

See Deployment for GPU, Ascend/NPU, Docker, and validation commands.

4. Run

CLI:

python -m agent.main analyze youtube "https://www.youtube.com/watch?v=..." --mode detailed
python -m agent.main analyze local media/example.mp4 --mode brief
python -m agent.main analyze local media/example.mp4 --mode ask --question "What changed?"

REST API and web UI:

uvicorn server.app:app --host 0.0.0.0 --port 9000

curl -X POST http://localhost:9000/analyze \
  -H 'Content-Type: application/json' \
  -d '{"source_type":"youtube","uri":"https://www.youtube.com/watch?v=...","mode":"detailed"}'

Open http://localhost:9000 for the web interface.

Workflow Modes

brief is the canonical lightweight mode. quick is still accepted as a legacy alias in the CLI and API.

Mode	Use It For	Example
`brief`	Fast ASR-first summary	`python -m agent.main analyze youtube URL --mode brief`
`detailed`	OCR, object detection, emotion, translation, and richer timelines	`python -m agent.main analyze youtube URL --mode detailed`
`ask`	Question-answering over an indexed video	`python -m agent.main analyze youtube URL --mode ask --question "What are the conclusions?"`
`highlights`	Clip export and optional reels	`python -m agent.main analyze youtube URL --mode highlights`
`report`	Structured report generation, optionally with web search	`python -m agent.main analyze youtube URL --mode report --include-web-search`
`live`	Webcam, RTMP, or HTTP stream understanding	`python -m agent.main analyze local webcam --mode live`

See Workflows and API Reference for complete parameters and request schemas.

Hermes

This repo ships a Hermes-native skill at .agents/skills/media/vidify.

python -m agent.main hermes install-skill

The installer symlinks the skill into ~/.hermes/skills/media/vidify by default. Use --strategy copy for a standalone copy. The legacy openclaw/ skill remains available for older setups.

Testing

Run the fast test suite:

pytest tests/

Validate against an existing model endpoint:

bash scripts/run_test_gpu.sh --api-base http://localhost:8000/v1 --video media/my_video.mp4
python scripts/test_all.py --video-path media/my_video.mp4 --api-base http://localhost:8000/v1

See Testing Guide for focused tests, YouTube E2E validation, and hardware-specific notes.

Repository Layout

Path	Purpose
`agent/core/`	Orchestration, schemas, events, hooks, retries, segmenting, and parallel execution
`agent/extensions/skills/`	Reusable video, audio, retrieval, and analysis units
`agent/extensions/workflows/`	User-facing workflow composition
`agent/extensions/models/`	Model adapters and direct-loading helpers
`server/`	FastAPI app, SSE endpoints, and web routes
`templates/`	Web UI templates
`scripts/`	Serving, validation, and demo helpers
`docs/`	Architecture, workflow, deployment, and API documentation
`cache/`	Runtime artifacts; do not commit generated outputs

Documentation

Document	Contents
Project Overview	ASR-first design, capability map, and processing flow
Deployment	vLLM serving, GPU validation, Ascend/NPU helpers, and Docker
Live Streaming	Webcam/stream architecture, CLI/API usage, and config
Production Features	Retries, graceful degradation, parallelism, progress events, hooks, and logging
Architecture	Data models, cache structure, model interfaces, and orchestrator
Workflows	Brief, detailed, index, ask, highlights, report, and live modes
Skills Reference	Skill APIs and responsibilities
API Reference	REST endpoints, CLI arguments, examples, and schemas
Configuration	YAML files, environment variables, vLLM setup, and Docker
Testing Guide	Pytest, local E2E, GPU/Ascend endpoint validation, and YouTube E2E
Web Search	Google Custom Search and fallback search setup