Project Overview

June 26, 2026 · View on GitHub

Vidify turns video sources into structured, searchable, and actionable analysis. It accepts YouTube URLs, HTTP video URLs, local files, webcams, and live streams.

Capability Map

Area	What Vidify Does
Analysis	Probes media, extracts subtitles and metadata, runs ASR fallback, and builds timelines
Visual understanding	Samples frames, captions visual content, extracts OCR text, detects objects, and analyzes emotion
Retrieval	Builds FAISS indexes over frames, transcript, metadata, and timeline chunks
Q&A	Answers natural-language questions with retrieved evidence and targeted visual lookup
Editing	Detects highlight segments and exports clips or reels
Reporting	Generates structured reports, optionally enriched with web search context
Streaming	Maintains live memory over webcams or RTMP/HTTP streams and supports mid-stream Q&A

ASR-First Design

Most videos communicate their main information through speech, subtitles, titles, and descriptions. Vidify uses visual model calls only when they add value.

Subtitles first - for YouTube and web videos, embedded manual or auto-generated subtitles are extracted with yt-dlp.
ASR fallback - if subtitles are unavailable, Whisper transcribes audio.
Metadata context - title, description, tags, and uploader information are included in downstream prompts.
Sufficiency check - a fast heuristic checks speech coverage and word count before expensive visual captioning.
Conditional visual processing - frame captioning runs for silent videos, music videos, sparse-speech media, or forced visual analysis.
Targeted visual lookup - visual Q&A captions only frames near relevant timestamps when possible.

The default sufficiency thresholds are configured through workflows.yaml:

brief:
  asr_first: true
  min_coverage_ratio: 0.3
  min_word_count: 50
  force_visual: false
  prefer_subtitles_over_asr: true

Processing Flow

Source video
  -> download or load local media
  -> probe duration, fps, resolution, and metadata
  -> extract subtitles if present
  -> run ASR if subtitles are missing or insufficient
  -> check transcript sufficiency
  -> skip or run visual captioning
  -> build timeline from transcript, metadata, and selected frames
  -> persist analysis in cache

ask, highlights, and report reuse cached analysis when possible. index builds the retrieval layer used by Q&A.

Project Structure

Path	Role
`agent/core/`	Shared contracts, orchestration, events, hooks, retries, segmenting, and parallel execution
`agent/extensions/skills/`	Standalone video/audio/retrieval/analysis units
`agent/extensions/workflows/`	End-to-end modes composed from skills
`agent/extensions/models/`	vLLM/OpenAI-compatible clients and direct model loading
`agent/integrations/`	External framework adapters, including Hermes
`agent/extensions/mra/`	Meta-Reflective Auditor implementation
`server/`	FastAPI app, REST endpoints, SSE progress streaming, and web UI routes
`scripts/`	Serving, validation, demos, and local workflow helpers
`docs/`	Architecture, API, workflow, deployment, and testing docs
`cache/`	Runtime outputs such as downloads, frames, audio, indexes, and analysis JSON

Runtime artifacts, model weights, extracted frames/audio, logs, FAISS indexes, and uploaded files should stay out of Git.