TypemovieInfer

January 26, 2026 · View on GitHub

Version Python PyTorch License

Production-Ready Multi-GPU Inference Framework for Consumer GPUs

中文文档 | Developer Guide | Test Report


Table of Contents


Overview

TypemovieInfer is a unified, production-ready inference framework designed for consumer-grade GPUs (RTX 4090/5090). It provides optimized multi-GPU parallelism for state-of-the-art image and video generation models.

Why TypemovieInfer?

  • 🚀 High Performance: Multi-GPU parallelism with Para-Attention acceleration achieves 27% faster inference on RTX 5090
  • 🔌 Unified API: Single, consistent API for all models - just 3 lines of code to run any model
  • 💾 Memory Optimized: FP8 quantization and smart offloading strategies enable running 8B models on a single RTX 4090
  • 🎛️ Flexible Configuration: YAML-based configuration with hot-reload support for rapid iteration
  • 🔧 Extensible Architecture: Plugin-based system allows adding new models in minutes
  • 📦 Production Ready: 71% model coverage with comprehensive testing and benchmarks

Architecture Highlights

  • Model Registry System: Dynamic model discovery and loading with two-level caching
  • Handler Pattern: Unified interface for all models (inference(), model_info, __call__)
  • Multiple Parallel Strategies:
    • ParaAttn (Para-Attention) for FLUX, Kontext, and WAN models
    • DDP (Distributed Data Parallel) for Qwen models
    • USP (Ulysses Sequence Parallel) support via xfuser
  • LoRA Management: Dynamic LoRA switching for video models with cache management
  • Memory Optimization: CPU offloading, FP8 quantization, and smart parameter management

Key Features

1. Multi-GPU Acceleration

TypemovieInfer supports multiple parallelization strategies:

  • Para-Attention (ParaAttn): Splits attention computation across GPUs for FLUX, Kontext, and WAN models
  • Distributed Data Parallel (DDP): Distributes batch processing for Qwen models
  • Sequence Parallel (USP): Advanced sequence parallelism via xfuser integration

2. Memory Optimization

Run large models on consumer GPUs:

  • FP8 Quantization: Reduce memory footprint by 50% with minimal quality loss
  • Smart Offloading: CPU offloading with configurable granularity (3 levels)
  • LoRA Cache Management: Efficient loading/unloading of LoRA weights

3. Flexible Configuration

YAML-based configuration system:

  • Model-specific settings in config.yaml
  • Hot-reload support for rapid development
  • Easy parameter tuning without code changes

4. Production Ready

Comprehensive testing and validation:

  • 5/7 models tested successfully (71% coverage)
  • Detailed performance benchmarks on 4090/5090
  • Comprehensive error handling and diagnostics

Supported Models

Image Generation Models

FLUX (ParaAttn)

  • Model: flux_paraattn
  • Type: Text-to-Image
  • GPUs: 2× RTX 4090/5090
  • Features:
    • High-quality image generation (1024×1024 default)
    • Multiple LoRA styles (Film, Realistic, Cartoon, Anime, Sketch, Pixar, Ghibli, etc.)
    • FP8 quantization enabled
    • Trigger words automatically prepended
  • Usage:
    model = load_model("flux_paraattn")
    images = model.inference(prompt="A beautiful landscape", style_idx=0)
    

Kontext (ParaAttn)

  • Model: kontext_paraattn
  • Type: Context-aware Image Editing
  • GPUs: 2× RTX 4090/5090
  • Features:
    • Text-to-image and image-to-image generation
    • Auto-concatenate multiple input images
    • Multiple aspect ratios (16:9, 9:16, 1:1, etc.)
  • Usage:
    model = load_model("kontext_paraattn")
    images = model.inference(prompt="Add sunset", image=input_img)
    

Qwen (DDP)

  • Models: qwen_ddp, qwen_edit_ddp
  • Type: Text-to-Image with Chinese Support
  • GPUs: Multiple GPUs via DDP
  • Features:
    • Chinese language support
    • Image editing variant available
    • Various aspect ratios
  • Status: ⚠️ Requires QwenImagePipeline from diffusers
  • Usage:
    model = load_model("qwen_ddp")
    images = model.inference(prompt="一个美丽的风景", ratio="16:9")
    

Video Generation Models

WAN Video (4090/5090 variants)

  • Models: video_wan_lora_4090, `video_wan_lora_5090$
  • \text{Type}: \text{Image}-\text{to}-\text{Video} (\text{I2V})
  • \text{GPUs}: 8 \times \text{RTX} 4090 \text{or} 8 \times \text{RTX} 5090
  • \text{Features}:
    • 81 \text{frames} @ 15-16 \text{fps}
    • \text{Multiple} \text{LoRA} \text{categories}:
      • \text{Basic} (0\text{_x}): \text{General} \text{motion}
      • \text{Talk} (1\text{_x}): \text{Talking} \text{animations}
      • \text{Pet} (2\text{_x}): \text{Pet} \text{movements}
      • \text{Landscape} (3\text{_x}): \text{Landscape} \text{animations}
      • \text{Effects} (5\text{x}, 6\text{x}): \text{Special} \text{effects}
    • 4090 \text{variant}: \text{Accelerated} \text{LoRA} \text{option} (8 \text{steps} \text{vs} 30)
    • 5090 \text{variant}: \text{Enhanced} \text{stability} \text{and} 27% \text{faster} \text{inference}
    • \text{FP8} \text{quantization} \text{enabled}
    • \text{Memory} \text{optimization} \text{with} \text{CPU} \text{offloading}
  • \text{Performance}: \text{RTX} 5090 \text{is} 27% \text{faster} \text{than} \text{RTX} 4090
  • \text{Usage}: $``python model = load_model("video_wan_lora_5090") video_path = model.inference( input_image=Image.open("input.jpg"), lora_id="0_1", # Motion LoRA seed=42 )

Installation

Prerequisites

  • Python: 3.8 or higher
  • CUDA: 11.8 or higher
  • GPU: RTX 4090/5090 or similar (24GB+ VRAM recommended)
  • OS: Linux (tested), Windows (should work)

Basic Installation

# Clone repository
git clone https://github.com/your-org/TypemovieInfer.git
cd TypemovieInfer

# Install from source (recommended)
pip install -e .

Installation with Optional Features

# Install all features
pip install -e ".[all]"

# Install specific feature sets
pip install -e ".[image]"        # Image generation only
pip install -e ".[video]"        # Video generation only
pip install -e ".[distributed]"  # Distributed/parallel features
pip install -e ".[dev]"          # Development dependencies

Verify Installation

python -c "from typemovie_infer import load_model, list_models; print(list_models())"

Quick Start

3-Line Image Generation

from typemovie_infer import load_model

model = load_model("flux_paraattn")
images = model.inference(prompt="A beautiful sunset over mountains", seed=42)

Complete Example

from typemovie_infer import load_model
from PIL import Image

# List available models
from typemovie_infer import list_models
print("Available models:", list_models())

# Load FLUX model
model = load_model("flux_paraattn")

# Check model info
print("Model info:", model.model_info)

# List available styles
print("Available styles:", model.list_styles())

# Generate images
images = model.inference(
    prompt="A serene Japanese garden in autumn, 4K, highly detailed",
    num_images=4,
    seed=42,
    style_idx=0  # Film style
)

# Save images
for i, img in enumerate(images):
    img.save(f"output_{i}.png")
    print(f"Saved: output_{i}.png")

Usage Guide

Image Generation (FLUX)

from typemovie_infer import load_model

# Load model
model = load_model("flux_paraattn")

# List available LoRA styles
styles = model.list_styles()
print(f"Available styles: {styles}")

# Generate with specific style
images = model.inference(
    prompt="A futuristic cityscape",
    num_images=4,
    seed=42,
    style_idx=1,  # Realistic style
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=30
)

# Save results
for i, img in enumerate(images):
    img.save(f"flux_output_{i}.png")

Image Editing (Kontext)

from typemovie_infer import load_model
from PIL import Image

# Load model
model = load_model("kontext_paraattn")

# Load input image
input_img = Image.open("input.jpg")

# Edit image
images = model.inference(
    prompt="Add a beautiful sunset in the background",
    image=input_img,
    num_images=3,
    seed=42
)

# Save results
for i, img in enumerate(images):
    img.save(f"kontext_output_{i}.png")

Video Generation (WAN Video)

from typemovie_infer import load_model
from PIL import Image

# Load model (5090 variant for best performance)
model = load_model("video_wan_lora_5090")

# List available LoRAs
loras = model.list_loras()
print("Available LoRAs:")
for lora_id, info in loras.items():
    print(f"  {lora_id}: {info['description']}")

# Load input image
input_img = Image.open("input.jpg")

# Generate video with LoRA
video_path = model.inference(
    input_image=input_img,
    lora_id="0_1",  # Basic motion LoRA
    seed=42,
    num_frames=81,
    fps=16
)

print(f"Video generated: {video_path}")

# Generate without LoRA (native inference)
video_path_native = model.inference(
    input_image=input_img,
    lora_id="-1",  # No LoRA
    seed=42
)

Chinese Image Generation (Qwen)

from typemovie_infer import load_model

# Load model
model = load_model("qwen_ddp")

# List supported aspect ratios
ratios = model.list_ratios()
print(f"Supported ratios: {ratios}")

# Generate images with Chinese prompt
images = model.inference(
    prompt="一个美丽的中国山水画,水墨风格,高清",
    num_images=4,
    seed=42,
    ratio="16:9"
)

# Save results
for i, img in enumerate(images):
    img.save(f"qwen_output_{i}.png")

Performance Benchmarks

Hardware Tested

  • RTX 4090: 8× GPUs, 24GB VRAM each
  • RTX 5090: 8× GPUs, 32GB VRAM each
  • CPU: Intel Xeon / AMD EPYC
  • RAM: 256GB+

Benchmark Results

Image Generation (FLUX)

ConfigurationResolutionImagesGPU Memory
2× RTX 40901024×102441.5GB/GPU
2× RTX 50901024×102441.5GB/GPU

Video Generation (WAN Video)

ModelGPUsFramesFPSGPU Memory
video_wan_lora_40908× 409081151.64GB/GPU
video_wan_lora_50908× 509081151.65GB/GPU

Performance Comparison (5090 vs 4090):

  • Overall Performance: 12% faster on RTX 5090
  • Inference Phase: 27% faster on RTX 5090
  • Memory efficiency: Both GPUs use similar memory (~1.65GB per GPU for 8-GPU setup)

Key Findings

  1. RTX 5090 provides significant speedup: 27% faster in inference phase for video generation
  2. Memory efficiency: Both GPUs use similar memory (~1.65GB per GPU for 8-GPU setup)
  3. Scalability: Linear scaling with number of GPUs for ParaAttn models
  4. LoRA support: Dynamic LoRA switching enables diverse motion styles

Configuration

Configuration Files

Each model has a config.yaml file located at:

typemovie_infer/models/<model_directory>/config.yaml

Example: typemovie_infer/models/image_flux_paraattn/config.yaml

Configuration Structure

model:
  name: "flux_paraattn"           # Model name (used in load_model())
  display_name: "FLUX ParaAttn"   # Display name
  type: "image_generation"        # Type: image_generation or video_generation
  description: "High-quality text-to-image generation with Para-Attention"

# Model-specific configuration
flux:
  model_path: "/path/to/flux/model"
  gpu_num: 2                      # Number of GPUs (ParaAttn)
  quant_type: "qfloat8_e4m3fn"   # FP8 quantization
  height: 1024
  width: 1024
  num_inference_steps: 30
  guidance_scale: 3.5

# Output settings
output_dir: "results/flux_paraattn"
save_image_dir: "results/flux_paraattn/images"

Key Configuration Parameters

For ParaAttn Models (FLUX, Kontext, WAN)

  • gpu_num: Number of GPUs to use
  • quant_type: Quantization type (e.g., qfloat8_e4m3fn)
  • model_path: Absolute path to model weights

For DDP Models (Qwen)

  • world_size: Number of GPUs for distributed training
  • model_path: Absolute path to model weights

For Video Models (WAN)

  • num_frames: Number of frames to generate (default: 81)
  • fps: Frames per second (default: 15-16)
  • high_cpu_memory: Enable CPU offloading (default: true)
  • parameters_level: Offload granularity (1-3)
  • keep_lora_gpu: Keep LoRA weights on GPU (default: false)
  • quant_model: Enable FP8 quantization (default: true)

LoRA Configuration (WAN Video)

loras:
  "0_1": "motion_basic_v1.safetensors"
  "1_1": "talking_animation_v1.safetensors"
  "2_1": "pet_movement_v1.safetensors"
  # ...

Environment Variables

TypemovieInfer uses the following environment variables:

  • MASTER_ADDR: Master address for multi-GPU (default: '127.0.0.1')
  • MASTER_PORT: Master port for multi-GPU (default: '23456')
  • NCCL_DEBUG: NCCL debug level (set to 'INFO' for debugging)
  • NCCL_IB_DISABLE: Disable InfiniBand (set to '1' if needed)

⚠️ Port Conflict Warning: All handlers use the same default port (23456). To run multiple models concurrently, set different MASTER_PORT for each:

import os

os.environ['MASTER_PORT'] = '23457'
model1 = load_model("flux_paraattn")

os.environ['MASTER_PORT'] = '23458'
model2 = load_model("video_wan_lora_4090")

Troubleshooting

Common Issues

1. Model Not Found

Error: Model 'xxx' not found

Solution:

  • Check that config.yaml exists in the model directory
  • Verify the model name in config.yaml matches the name you're using
  • Use list_models() to see all available models
from typemovie_infer import list_models
print(list_models())

2. CUDA Out of Memory

Error: RuntimeError: CUDA out of memory

Solutions:

  • Reduce batch_size in config
  • Enable CPU offloading: high_cpu_memory: true
  • Increase parameters_level (1-3) for more aggressive offloading
  • Enable quantization: quant_model: true
  • Use fewer GPUs with smaller batch size

3. Multi-GPU Communication Failures

Error: RuntimeError: NCCL error or Address already in use

Solutions:

  • Verify NCCL installation:
    import torch
    print(torch.cuda.nccl.version())
    
  • Check MASTER_ADDR and MASTER_PORT are set correctly
  • Ensure gpu_num matches available GPUs
  • For port conflicts (EADDRINUSE), run models sequentially or set different MASTER_PORT

4. LoRA Not Loading (WAN Video)

Error: TypeError: LoraConfig.__init__() got an unexpected keyword argument 'lora_bias'

Cause: PEFT version incompatibility

Solutions:

Solution 1 (Recommended): Upgrade PEFT

pip install --upgrade peft>=0.14.0

Solution 2: Apply patch to /opt/conda/lib/python3.11/site-packages/diffusers/utils/peft_utils.py

# In _create_lora_config function (around line 324)
try:
    lora_config_kwargs.pop("lora_bias", None)  # Remove unsupported parameter
    return LoraConfig(**lora_config_kwargs)
except TypeError as e:
    raise TypeError("`LoraConfig` class could not be instantiated.") from e

5. Qwen Models Import Error

Error: ImportError: cannot import name 'QwenImagePipeline' from 'diffusers'

Cause: Qwen support not available in current diffusers version

Solutions:

  • Update diffusers to a version with Qwen support
  • Check model-specific requirements.txt for compatible versions
  • Install model dependencies: pip install -r typemovie_infer/models/image_qwen_dist/requirements.txt

Debugging Tips

  1. Enable verbose logging:

    import os
    os.environ['NCCL_DEBUG'] = 'INFO'
    
  2. Check model info:

    model = load_model("flux_paraattn")
    print(model.model_info)
    
  3. Verify GPU availability:

    import torch
    print(f"GPUs available: {torch.cuda.device_count()}")
    
  4. Check configuration:

    from typemovie_infer import ModelRegistry
    config = ModelRegistry.get_model_config("flux_paraattn")
    print(config._raw_config)
    

Advanced Topics

Adding a New Model

  1. Create model directory:

    mkdir typemovie_infer/models/my_model
    
  2. Create config.yaml:

    model:
      name: "my_model"
      display_name: "My Model"
      type: "image_generation"
      description: "My custom model"
    
    my_model:
      model_path: "/path/to/model"
      gpu_num: 2
      # ... other params
    
    output_dir: "results/my_model"
    
  3. Implement Handler in handler.py:

    import os
    from typemovie_infer.models.base import BaseHandler
    
    class MyModelHandler(BaseHandler):
        def __init__(self, config):
            self.config = config
            os.environ['MASTER_ADDR'] = os.getenv('MASTER_ADDR', '127.0.0.1')
            os.environ['MASTER_PORT'] = os.getenv('MASTER_PORT', '23456')
            self._setup()
    
        def _setup(self):
            # Initialize model
            pass
    
        def inference(self, **kwargs):
            # Inference logic
            return outputs
    
        @property
        def model_info(self):
            return {
                'name': self.config.name,
                'type': self.config.type,
                'display_name': self.config.display_name,
            }
    
        def __call__(self, *args, **kwargs):
            return self.inference(*args, **kwargs)
    
  4. Export Handler in __init__.py:

    from .handler import MyModelHandler as Handler
    
  5. Test your model:

    from typemovie_infer import load_model
    model = load_model("my_model")
    result = model.inference(...)
    

Model Registry API

from typemovie_infer import ModelRegistry

# List all models
models = ModelRegistry.discover_models()

# Get model config
config = ModelRegistry.get_model_config("flux_paraattn")

# Load handler
handler = ModelRegistry.load_handler("flux_paraattn")

# Clear cache
ModelRegistry.clear_cache()

Performance Tuning

  1. Optimize GPU memory:

    • Use FP8 quantization: quant_type: "qfloat8_e4m3fn"
    • Enable CPU offloading: high_cpu_memory: true
    • Increase offload level: parameters_level: 3
  2. Optimize speed:

    • Use more GPUs (linear scaling with ParaAttn)
    • Reduce num_inference_steps (trade-off with quality)
    • Use RTX 5090 for 27% faster inference
  3. Optimize quality:

    • Increase num_inference_steps
    • Tune guidance_scale
    • Use appropriate LoRA for specific styles

Contributing

We welcome contributions! Please follow these guidelines:

Reporting Issues

  • Use GitHub Issues for bug reports and feature requests
  • Provide detailed information: OS, GPU, Python version, error messages
  • Include minimal reproducible code

Pull Requests

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Make your changes with clear commit messages
  4. Add tests if applicable
  5. Run code quality checks:
    black .
    flake8 .
    isort .
    pytest
    
  6. Submit pull request with detailed description

Development Setup

# Clone repository
git clone https://github.com/your-org/TypemovieInfer.git
cd TypemovieInfer

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .
isort .

# Lint code
flake8 .

License

This project is licensed under the MIT License. See the LICENSE file for details.


Acknowledgments


Contact & Support


Made with ❤️ by TypeMovie Team

⭐ Star us on GitHub if this project helps you!

中文文档 | Developer Guide | Test Report