Deployment

June 26, 2026 ยท View on GitHub

Vidify runs against OpenAI-compatible model endpoints. The default local setup uses vLLM on port 8000, while the FastAPI app runs on port 9000.

Requirements

ComponentRequirement
Python3.11+
System toolsffmpeg, ffprobe, yt-dlp
Model servingvLLM-compatible GPU/NPU endpoint, or direct model loading
Default modelQwen3.5, configurable in models.yaml
vLLM>=0.19.0 for Qwen3.5

Optional features may require PaddleOCR, Tesseract, YOLO weights, CUDA/NPU runtimes, or Google Custom Search credentials.

Local vLLM Serving

Recommended Qwen3.5 helper:

pip install "vllm>=0.19.0"
bash scripts/serving_qwen3_5.sh

Manual command:

vllm serve Qwen/Qwen3.5-9B \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 65536 \
  --reasoning-parser qwen3 \
  --allowed-local-media-path $(pwd)/cache

On multi-GPU hosts:

TP_SIZE=2 MAX_MODEL_LEN=131072 bash scripts/serving_qwen3_5.sh

Legacy Qwen3-VL helper:

bash scripts/serving_qwen3vl.sh

App Server

uvicorn server.app:app --host 0.0.0.0 --port 9000

The web UI is available at http://localhost:9000. Swagger docs are available at http://localhost:9000/docs.

GPU Endpoint Validation

Use run_test_gpu.sh when a GPU-backed OpenAI-compatible endpoint is already running:

bash scripts/run_test_gpu.sh --api-base http://localhost:8000/v1 --video media/my_video.mp4

bash scripts/run_test_gpu.sh --api-base http://localhost:8000/v1 \
  --video media/my_video.mp4 --tests "frame_caption video_qa highlights"

Ascend / NPU Serving

Vidify can run against Ascend-backed vLLM deployments through the same OpenAI-compatible API used for GPU serving. Keep provider-specific scheduler commands, internal registry URLs, mount paths, and credentials in local docs or .env files instead of committed public docs.

Generic helpers:

# Qwen3.5-9B
TP_SIZE=2 bash scripts/serving_qwen3_5_ascend.sh /models/Qwen3.5-9B

# Qwen2.5-VL fallback
TP_SIZE=2 bash scripts/serving_qwen2_5vl_ascend.sh /models/Qwen2.5-VL-7B-Instruct

# Validate against an existing Ascend/NPU endpoint
bash scripts/run_test_ascend.sh --api-base http://localhost:8000/v1 --video media/my_video.mp4

For Qwen3.5 on Ascend, the helper uses --enforce-eager and conservative MAX_MODEL_LEN=16384 defaults. Tune TP_SIZE, MAX_MODEL_LEN, PORT, and ALLOWED_LOCAL_MEDIA_PATH for your hardware and vLLM build.

Docker

Full stack:

docker-compose up

App only:

docker build -t vidify .
docker run -p 9000:9000 vidify

Local Runtime Environment

Copy .env.example when you need overrides:

cp .env.example .env

Common values:

VariableDescriptionExample
LLM_BASE_URLOpenAI-compatible chat/completions endpointhttp://localhost:8000/v1
LLM_MODELDefault multimodal model nameqwen3.5-9b
EMBED_BASE_URLOpenAI-compatible embeddings endpointhttp://localhost:8000/v1
EMBED_MODELDefault embedding model nameqwen-embed
CACHE_ROOTRuntime cache directory./cache

See Configuration for YAML config, precedence, and web search environment variables.