TypemovieInfer

January 26, 2026 · View on GitHub

Production-Ready Multi-GPU Inference Framework for Consumer GPUs

Overview
Key Features
Supported Models
Installation
Quick Start
Usage Guide
Performance Benchmarks
Configuration
Troubleshooting
Advanced Topics
Contributing
License

Overview

TypemovieInfer is a unified, production-ready inference framework designed for consumer-grade GPUs (RTX 4090/5090). It provides optimized multi-GPU parallelism for state-of-the-art image and video generation models.

Why TypemovieInfer?

🚀 High Performance: Multi-GPU parallelism with Para-Attention acceleration achieves 27% faster inference on RTX 5090
🔌 Unified API: Single, consistent API for all models - just 3 lines of code to run any model
💾 Memory Optimized: FP8 quantization and smart offloading strategies enable running 8B models on a single RTX 4090
🎛️ Flexible Configuration: YAML-based configuration with hot-reload support for rapid iteration
🔧 Extensible Architecture: Plugin-based system allows adding new models in minutes
📦 Production Ready: 71% model coverage with comprehensive testing and benchmarks

Architecture Highlights

Model Registry System: Dynamic model discovery and loading with two-level caching
Handler Pattern: Unified interface for all models (inference(), model_info, __call__)
Multiple Parallel Strategies:
- ParaAttn (Para-Attention) for FLUX, Kontext, and WAN models
- DDP (Distributed Data Parallel) for Qwen models
- USP (Ulysses Sequence Parallel) support via xfuser
LoRA Management: Dynamic LoRA switching for video models with cache management
Memory Optimization: CPU offloading, FP8 quantization, and smart parameter management

Key Features

1. Multi-GPU Acceleration

TypemovieInfer supports multiple parallelization strategies:

Para-Attention (ParaAttn): Splits attention computation across GPUs for FLUX, Kontext, and WAN models
Distributed Data Parallel (DDP): Distributes batch processing for Qwen models
Sequence Parallel (USP): Advanced sequence parallelism via xfuser integration

2. Memory Optimization

Run large models on consumer GPUs:

FP8 Quantization: Reduce memory footprint by 50% with minimal quality loss
Smart Offloading: CPU offloading with configurable granularity (3 levels)
LoRA Cache Management: Efficient loading/unloading of LoRA weights

3. Flexible Configuration

YAML-based configuration system:

Model-specific settings in config.yaml
Hot-reload support for rapid development
Easy parameter tuning without code changes

4. Production Ready

Comprehensive testing and validation:

5/7 models tested successfully (71% coverage)
Detailed performance benchmarks on 4090/5090
Comprehensive error handling and diagnostics

Supported Models

Image Generation Models

FLUX (ParaAttn)

Model: flux_paraattn
Type: Text-to-Image
GPUs: 2× RTX 4090/5090
Features:
- High-quality image generation (1024×1024 default)
- Multiple LoRA styles (Film, Realistic, Cartoon, Anime, Sketch, Pixar, Ghibli, etc.)
- FP8 quantization enabled
- Trigger words automatically prepended

Usage:

model = load_model("flux_paraattn")
images = model.inference(prompt="A beautiful landscape", style_idx=0)

Kontext (ParaAttn)

Model: kontext_paraattn
Type: Context-aware Image Editing
GPUs: 2× RTX 4090/5090
Features:
- Text-to-image and image-to-image generation
- Auto-concatenate multiple input images
- Multiple aspect ratios (16:9, 9:16, 1:1, etc.)

Usage:

model = load_model("kontext_paraattn")
images = model.inference(prompt="Add sunset", image=input_img)

Qwen (DDP)

Models: qwen_ddp, qwen_edit_ddp
Type: Text-to-Image with Chinese Support
GPUs: Multiple GPUs via DDP
Features:
- Chinese language support
- Image editing variant available
- Various aspect ratios
Status: ⚠️ Requires QwenImagePipeline from diffusers

Usage:

model = load_model("qwen_ddp")
images = model.inference(prompt="一个美丽的风景", ratio="16:9")

Video Generation Models

WAN Video (4090/5090 variants)

Models: video_wan_lora_4090, `video_wan_lora_5090$
\text{Type}: \text{Image}-\text{to}-\text{Video} (\text{I2V})
\text{GPUs}: 8 \times \text{RTX} 4090 \text{or} 8 \times \text{RTX} 5090
\text{Features}:
- 81 \text{frames} @ 15-16 \text{fps}
- \text{Multiple} \text{LoRA} \text{categories}:
  - \text{Basic} (0\text{_x}): \text{General} \text{motion}
  - \text{Talk} (1\text{_x}): \text{Talking} \text{animations}
  - \text{Pet} (2\text{_x}): \text{Pet} \text{movements}
  - \text{Landscape} (3\text{_x}): \text{Landscape} \text{animations}
  - \text{Effects} (5\text{x}, 6\text{x}): \text{Special} \text{effects}
- 4090 \text{variant}: \text{Accelerated} \text{LoRA} \text{option} (8 \text{steps} \text{vs} 30)
- 5090 \text{variant}: \text{Enhanced} \text{stability} \text{and} 27% \text{faster} \text{inference}
- \text{FP8} \text{quantization} \text{enabled}
- \text{Memory} \text{optimization} \text{with} \text{CPU} \text{offloading}
\text{Performance}: \text{RTX} 5090 \text{is} 27% \text{faster} \text{than} \text{RTX} 4090
\text{Usage}: $``python model = load_model("video_wan_lora_5090") video_path = model.inference( input_image=Image.open("input.jpg"), lora_id="0_1", # Motion LoRA seed=42 )

Installation

Prerequisites

Python: 3.8 or higher
CUDA: 11.8 or higher
GPU: RTX 4090/5090 or similar (24GB+ VRAM recommended)
OS: Linux (tested), Windows (should work)

Basic Installation

# Clone repository
git clone https://github.com/your-org/TypemovieInfer.git
cd TypemovieInfer

# Install from source (recommended)
pip install -e .

Installation with Optional Features

# Install all features
pip install -e ".[all]"

# Install specific feature sets
pip install -e ".[image]"        # Image generation only
pip install -e ".[video]"        # Video generation only
pip install -e ".[distributed]"  # Distributed/parallel features
pip install -e ".[dev]"          # Development dependencies

Verify Installation

python -c "from typemovie_infer import load_model, list_models; print(list_models())"

Quick Start

3-Line Image Generation

from typemovie_infer import load_model

model = load_model("flux_paraattn")
images = model.inference(prompt="A beautiful sunset over mountains", seed=42)

Complete Example

from typemovie_infer import load_model
from PIL import Image

# List available models
from typemovie_infer import list_models
print("Available models:", list_models())

# Load FLUX model
model = load_model("flux_paraattn")

# Check model info
print("Model info:", model.model_info)

# List available styles
print("Available styles:", model.list_styles())

# Generate images
images = model.inference(
    prompt="A serene Japanese garden in autumn, 4K, highly detailed",
    num_images=4,
    seed=42,
    style_idx=0  # Film style
)

# Save images
for i, img in enumerate(images):
    img.save(f"output_{i}.png")
    print(f"Saved: output_{i}.png")

Usage Guide

Image Generation (FLUX)

from typemovie_infer import load_model

# Load model
model = load_model("flux_paraattn")

# List available LoRA styles
styles = model.list_styles()
print(f"Available styles: {styles}")

# Generate with specific style
images = model.inference(
    prompt="A futuristic cityscape",
    num_images=4,
    seed=42,
    style_idx=1,  # Realistic style
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=30
)

# Save results
for i, img in enumerate(images):
    img.save(f"flux_output_{i}.png")

Image Editing (Kontext)

from typemovie_infer import load_model
from PIL import Image

# Load model
model = load_model("kontext_paraattn")

# Load input image
input_img = Image.open("input.jpg")

# Edit image
images = model.inference(
    prompt="Add a beautiful sunset in the background",
    image=input_img,
    num_images=3,
    seed=42
)

# Save results
for i, img in enumerate(images):
    img.save(f"kontext_output_{i}.png")

Video Generation (WAN Video)

from typemovie_infer import load_model
from PIL import Image

# Load model (5090 variant for best performance)
model = load_model("video_wan_lora_5090")

# List available LoRAs
loras = model.list_loras()
print("Available LoRAs:")
for lora_id, info in loras.items():
    print(f"  {lora_id}: {info['description']}")

# Load input image
input_img = Image.open("input.jpg")

# Generate video with LoRA
video_path = model.inference(
    input_image=input_img,
    lora_id="0_1",  # Basic motion LoRA
    seed=42,
    num_frames=81,
    fps=16
)

print(f"Video generated: {video_path}")

# Generate without LoRA (native inference)
video_path_native = model.inference(
    input_image=input_img,
    lora_id="-1",  # No LoRA
    seed=42
)

Chinese Image Generation (Qwen)

from typemovie_infer import load_model

# Load model
model = load_model("qwen_ddp")

# List supported aspect ratios
ratios = model.list_ratios()
print(f"Supported ratios: {ratios}")

# Generate images with Chinese prompt
images = model.inference(
    prompt="一个美丽的中国山水画，水墨风格，高清",
    num_images=4,
    seed=42,
    ratio="16:9"
)

# Save results
for i, img in enumerate(images):
    img.save(f"qwen_output_{i}.png")

Performance Benchmarks

Hardware Tested

RTX 4090: 8× GPUs, 24GB VRAM each
RTX 5090: 8× GPUs, 32GB VRAM each
CPU: Intel Xeon / AMD EPYC
RAM: 256GB+

Benchmark Results

Image Generation (FLUX)

Configuration	Resolution	Images	GPU Memory
2× RTX 4090	1024×1024	4	1.5GB/GPU
2× RTX 5090	1024×1024	4	1.5GB/GPU

Video Generation (WAN Video)

Model	GPUs	Frames	FPS	GPU Memory
video_wan_lora_4090	8× 4090	81	15	1.64GB/GPU
video_wan_lora_5090	8× 5090	81	15	1.65GB/GPU

Performance Comparison (5090 vs 4090):

Overall Performance: 12% faster on RTX 5090
Inference Phase: 27% faster on RTX 5090
Memory efficiency: Both GPUs use similar memory (~1.65GB per GPU for 8-GPU setup)

Key Findings

RTX 5090 provides significant speedup: 27% faster in inference phase for video generation
Memory efficiency: Both GPUs use similar memory (~1.65GB per GPU for 8-GPU setup)
Scalability: Linear scaling with number of GPUs for ParaAttn models
LoRA support: Dynamic LoRA switching enables diverse motion styles

Configuration

Configuration Files

Each model has a config.yaml file located at:

typemovie_infer/models/<model_directory>/config.yaml

Example: typemovie_infer/models/image_flux_paraattn/config.yaml

Configuration Structure

model:
  name: "flux_paraattn"           # Model name (used in load_model())
  display_name: "FLUX ParaAttn"   # Display name
  type: "image_generation"        # Type: image_generation or video_generation
  description: "High-quality text-to-image generation with Para-Attention"

# Model-specific configuration
flux:
  model_path: "/path/to/flux/model"
  gpu_num: 2                      # Number of GPUs (ParaAttn)
  quant_type: "qfloat8_e4m3fn"   # FP8 quantization
  height: 1024
  width: 1024
  num_inference_steps: 30
  guidance_scale: 3.5

# Output settings
output_dir: "results/flux_paraattn"
save_image_dir: "results/flux_paraattn/images"

Key Configuration Parameters

For ParaAttn Models (FLUX, Kontext, WAN)

gpu_num: Number of GPUs to use
quant_type: Quantization type (e.g., qfloat8_e4m3fn)
model_path: Absolute path to model weights

For DDP Models (Qwen)

world_size: Number of GPUs for distributed training
model_path: Absolute path to model weights

For Video Models (WAN)

num_frames: Number of frames to generate (default: 81)
fps: Frames per second (default: 15-16)
high_cpu_memory: Enable CPU offloading (default: true)
parameters_level: Offload granularity (1-3)
keep_lora_gpu: Keep LoRA weights on GPU (default: false)
quant_model: Enable FP8 quantization (default: true)

LoRA Configuration (WAN Video)

loras:
  "0_1": "motion_basic_v1.safetensors"
  "1_1": "talking_animation_v1.safetensors"
  "2_1": "pet_movement_v1.safetensors"
  # ...

Environment Variables

TypemovieInfer uses the following environment variables:

MASTER_ADDR: Master address for multi-GPU (default: '127.0.0.1')
MASTER_PORT: Master port for multi-GPU (default: '23456')
NCCL_DEBUG: NCCL debug level (set to 'INFO' for debugging)
NCCL_IB_DISABLE: Disable InfiniBand (set to '1' if needed)

⚠️ Port Conflict Warning: All handlers use the same default port (23456). To run multiple models concurrently, set different MASTER_PORT for each:

import os

os.environ['MASTER_PORT'] = '23457'
model1 = load_model("flux_paraattn")

os.environ['MASTER_PORT'] = '23458'
model2 = load_model("video_wan_lora_4090")

Troubleshooting

Common Issues

1. Model Not Found

Error: Model 'xxx' not found

Solution:

Check that config.yaml exists in the model directory
Verify the model name in config.yaml matches the name you're using
Use list_models() to see all available models

from typemovie_infer import list_models
print(list_models())

2. CUDA Out of Memory

Error: RuntimeError: CUDA out of memory

Solutions:

Reduce batch_size in config
Enable CPU offloading: high_cpu_memory: true
Increase parameters_level (1-3) for more aggressive offloading
Enable quantization: quant_model: true
Use fewer GPUs with smaller batch size

3. Multi-GPU Communication Failures

Error: RuntimeError: NCCL error or Address already in use

Solutions:

Verify NCCL installation:

import torch
print(torch.cuda.nccl.version())

Check MASTER_ADDR and MASTER_PORT are set correctly
Ensure gpu_num matches available GPUs
For port conflicts (EADDRINUSE), run models sequentially or set different MASTER_PORT

4. LoRA Not Loading (WAN Video)

Error: TypeError: LoraConfig.__init__() got an unexpected keyword argument 'lora_bias'

Cause: PEFT version incompatibility

Solutions:

Solution 1 (Recommended): Upgrade PEFT

pip install --upgrade peft>=0.14.0

Solution 2: Apply patch to /opt/conda/lib/python3.11/site-packages/diffusers/utils/peft_utils.py

# In _create_lora_config function (around line 324)
try:
    lora_config_kwargs.pop("lora_bias", None)  # Remove unsupported parameter
    return LoraConfig(**lora_config_kwargs)
except TypeError as e:
    raise TypeError("`LoraConfig` class could not be instantiated.") from e

5. Qwen Models Import Error

Error: ImportError: cannot import name 'QwenImagePipeline' from 'diffusers'

Cause: Qwen support not available in current diffusers version

Solutions:

Update diffusers to a version with Qwen support
Check model-specific requirements.txt for compatible versions
Install model dependencies: pip install -r typemovie_infer/models/image_qwen_dist/requirements.txt

Debugging Tips

Enable verbose logging:

import os
os.environ['NCCL_DEBUG'] = 'INFO'

Check model info:

model = load_model("flux_paraattn")
print(model.model_info)

Verify GPU availability:

import torch
print(f"GPUs available: {torch.cuda.device_count()}")

Check configuration:

from typemovie_infer import ModelRegistry
config = ModelRegistry.get_model_config("flux_paraattn")
print(config._raw_config)

Advanced Topics

Adding a New Model

Create model directory:
```
mkdir typemovie_infer/models/my_model
```

Create config.yaml:

model:
  name: "my_model"
  display_name: "My Model"
  type: "image_generation"
  description: "My custom model"

my_model:
  model_path: "/path/to/model"
  gpu_num: 2
  # ... other params

output_dir: "results/my_model"

Implement Handler in handler.py:

import os
from typemovie_infer.models.base import BaseHandler

class MyModelHandler(BaseHandler):
    def __init__(self, config):
        self.config = config
        os.environ['MASTER_ADDR'] = os.getenv('MASTER_ADDR', '127.0.0.1')
        os.environ['MASTER_PORT'] = os.getenv('MASTER_PORT', '23456')
        self._setup()

    def _setup(self):
        # Initialize model
        pass

    def inference(self, **kwargs):
        # Inference logic
        return outputs

    @property
    def model_info(self):
        return {
            'name': self.config.name,
            'type': self.config.type,
            'display_name': self.config.display_name,
        }

    def __call__(self, *args, **kwargs):
        return self.inference(*args, **kwargs)

Export Handler in __init__.py:

from .handler import MyModelHandler as Handler

Test your model:

from typemovie_infer import load_model
model = load_model("my_model")
result = model.inference(...)

Model Registry API

from typemovie_infer import ModelRegistry

# List all models
models = ModelRegistry.discover_models()

# Get model config
config = ModelRegistry.get_model_config("flux_paraattn")

# Load handler
handler = ModelRegistry.load_handler("flux_paraattn")

# Clear cache
ModelRegistry.clear_cache()

Performance Tuning

Optimize GPU memory:
- Use FP8 quantization: quant_type: "qfloat8_e4m3fn"
- Enable CPU offloading: high_cpu_memory: true
- Increase offload level: parameters_level: 3
Optimize speed:
- Use more GPUs (linear scaling with ParaAttn)
- Reduce num_inference_steps (trade-off with quality)
- Use RTX 5090 for 27% faster inference
Optimize quality:
- Increase num_inference_steps
- Tune guidance_scale
- Use appropriate LoRA for specific styles

Contributing

We welcome contributions! Please follow these guidelines:

Reporting Issues

Use GitHub Issues for bug reports and feature requests
Provide detailed information: OS, GPU, Python version, error messages
Include minimal reproducible code

Pull Requests

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Make your changes with clear commit messages
Add tests if applicable
Run code quality checks:
```
black .
flake8 .
isort .
pytest
```
Submit pull request with detailed description

Development Setup

# Clone repository
git clone https://github.com/your-org/TypemovieInfer.git
cd TypemovieInfer

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .
isort .

# Lint code
flake8 .

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Built with PyTorch, Diffusers, and Transformers
Para-Attention optimization inspired by xDiT
Thanks to all contributors and the open-source community

Contact & Support

GitHub Issues: Report bugs or request features
Email: leeguandon@gamil.com
Documentation: Full documentation
Developer Guide: CLAUDE.md - Architecture and development
Test Reports: FINAL_TEST_REPORT.md - Comprehensive benchmarks

Made with ❤️ by TypeMovie Team

⭐ Star us on GitHub if this project helps you!

中文文档 | Developer Guide | Test Report