Deployment

June 26, 2026 · View on GitHub

Vidify runs against OpenAI-compatible model endpoints. The default local setup uses vLLM on port 8000, while the FastAPI app runs on port 9000.

Requirements

Component	Requirement
Python	3.11+
System tools	`ffmpeg`, `ffprobe`, `yt-dlp`
Model serving	vLLM-compatible GPU/NPU endpoint, or direct model loading
Default model	Qwen3.5, configurable in `models.yaml`
vLLM	`>=0.19.0` for Qwen3.5

Optional features may require PaddleOCR, Tesseract, YOLO weights, CUDA/NPU runtimes, or Google Custom Search credentials.

Local vLLM Serving

Recommended Qwen3.5 helper:

pip install "vllm>=0.19.0"
bash scripts/serving_qwen3_5.sh

Manual command:

vllm serve Qwen/Qwen3.5-9B \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --allowed-local-media-path $(pwd)/cache

On multi-GPU hosts:

TP_SIZE=2 MAX_MODEL_LEN=131072 bash scripts/serving_qwen3_5.sh

Legacy Qwen3-VL helper:

bash scripts/serving_qwen3vl.sh

App Server

uvicorn server.app:app --host 0.0.0.0 --port 9000

The web UI is available at http://localhost:9000. Swagger docs are available at http://localhost:9000/docs.

GPU Endpoint Validation

Use run_test_gpu.sh when a GPU-backed OpenAI-compatible endpoint is already running:

bash scripts/run_test_gpu.sh --api-base http://localhost:8000/v1 --video media/my_video.mp4

bash scripts/run_test_gpu.sh --api-base http://localhost:8000/v1 \
  --video media/my_video.mp4 --tests "frame_caption video_qa highlights"

Vidify can run against Ascend-backed vLLM deployments through the same OpenAI-compatible API used for GPU serving. Keep provider-specific scheduler commands, internal registry URLs, mount paths, and credentials in local docs or .env files instead of committed public docs.

Generic helpers:

# Qwen3.5-9B
TP_SIZE=2 bash scripts/serving_qwen3_5_ascend.sh /models/Qwen3.5-9B

# Qwen2.5-VL fallback
TP_SIZE=2 bash scripts/serving_qwen2_5vl_ascend.sh /models/Qwen2.5-VL-7B-Instruct

# Validate against an existing Ascend/NPU endpoint
bash scripts/run_test_ascend.sh --api-base http://localhost:8000/v1 --video media/my_video.mp4

For Qwen3.5 on Ascend, the helper uses --enforce-eager and conservative MAX_MODEL_LEN=16384 defaults. Tune TP_SIZE, MAX_MODEL_LEN, PORT, and ALLOWED_LOCAL_MEDIA_PATH for your hardware and vLLM build.

Docker

Full stack:

docker-compose up

App only:

docker build -t vidify .
docker run -p 9000:9000 vidify

Local Runtime Environment

Copy .env.example when you need overrides:

cp .env.example .env

Common values:

Variable	Description	Example
`LLM_BASE_URL`	OpenAI-compatible chat/completions endpoint	`http://localhost:8000/v1`
`LLM_MODEL`	Default multimodal model name	`qwen3.5-9b`
`EMBED_BASE_URL`	OpenAI-compatible embeddings endpoint	`http://localhost:8000/v1`
`EMBED_MODEL`	Default embedding model name	`qwen-embed`
`CACHE_ROOT`	Runtime cache directory	`./cache`

See Configuration for YAML config, precedence, and web search environment variables.