TypemovieInfer
January 26, 2026 · View on GitHub
Production-Ready Multi-GPU Inference Framework for Consumer GPUs
Table of Contents
- Overview
- Key Features
- Supported Models
- Installation
- Quick Start
- Usage Guide
- Performance Benchmarks
- Configuration
- Troubleshooting
- Advanced Topics
- Contributing
- License
Overview
TypemovieInfer is a unified, production-ready inference framework designed for consumer-grade GPUs (RTX 4090/5090). It provides optimized multi-GPU parallelism for state-of-the-art image and video generation models.
Why TypemovieInfer?
- 🚀 High Performance: Multi-GPU parallelism with Para-Attention acceleration achieves 27% faster inference on RTX 5090
- 🔌 Unified API: Single, consistent API for all models - just 3 lines of code to run any model
- 💾 Memory Optimized: FP8 quantization and smart offloading strategies enable running 8B models on a single RTX 4090
- 🎛️ Flexible Configuration: YAML-based configuration with hot-reload support for rapid iteration
- 🔧 Extensible Architecture: Plugin-based system allows adding new models in minutes
- 📦 Production Ready: 71% model coverage with comprehensive testing and benchmarks
Architecture Highlights
- Model Registry System: Dynamic model discovery and loading with two-level caching
- Handler Pattern: Unified interface for all models (
inference(),model_info,__call__) - Multiple Parallel Strategies:
- ParaAttn (Para-Attention) for FLUX, Kontext, and WAN models
- DDP (Distributed Data Parallel) for Qwen models
- USP (Ulysses Sequence Parallel) support via xfuser
- LoRA Management: Dynamic LoRA switching for video models with cache management
- Memory Optimization: CPU offloading, FP8 quantization, and smart parameter management
Key Features
1. Multi-GPU Acceleration
TypemovieInfer supports multiple parallelization strategies:
- Para-Attention (ParaAttn): Splits attention computation across GPUs for FLUX, Kontext, and WAN models
- Distributed Data Parallel (DDP): Distributes batch processing for Qwen models
- Sequence Parallel (USP): Advanced sequence parallelism via xfuser integration
2. Memory Optimization
Run large models on consumer GPUs:
- FP8 Quantization: Reduce memory footprint by 50% with minimal quality loss
- Smart Offloading: CPU offloading with configurable granularity (3 levels)
- LoRA Cache Management: Efficient loading/unloading of LoRA weights
3. Flexible Configuration
YAML-based configuration system:
- Model-specific settings in
config.yaml - Hot-reload support for rapid development
- Easy parameter tuning without code changes
4. Production Ready
Comprehensive testing and validation:
- 5/7 models tested successfully (71% coverage)
- Detailed performance benchmarks on 4090/5090
- Comprehensive error handling and diagnostics
Supported Models
Image Generation Models
FLUX (ParaAttn)
- Model:
flux_paraattn - Type: Text-to-Image
- GPUs: 2× RTX 4090/5090
- Features:
- High-quality image generation (1024×1024 default)
- Multiple LoRA styles (Film, Realistic, Cartoon, Anime, Sketch, Pixar, Ghibli, etc.)
- FP8 quantization enabled
- Trigger words automatically prepended
- Usage:
model = load_model("flux_paraattn") images = model.inference(prompt="A beautiful landscape", style_idx=0)
Kontext (ParaAttn)
- Model:
kontext_paraattn - Type: Context-aware Image Editing
- GPUs: 2× RTX 4090/5090
- Features:
- Text-to-image and image-to-image generation
- Auto-concatenate multiple input images
- Multiple aspect ratios (16:9, 9:16, 1:1, etc.)
- Usage:
model = load_model("kontext_paraattn") images = model.inference(prompt="Add sunset", image=input_img)
Qwen (DDP)
- Models:
qwen_ddp,qwen_edit_ddp - Type: Text-to-Image with Chinese Support
- GPUs: Multiple GPUs via DDP
- Features:
- Chinese language support
- Image editing variant available
- Various aspect ratios
- Status: ⚠️ Requires
QwenImagePipelinefrom diffusers - Usage:
model = load_model("qwen_ddp") images = model.inference(prompt="一个美丽的风景", ratio="16:9")
Video Generation Models
WAN Video (4090/5090 variants)
- Models:
video_wan_lora_4090, `video_wan_lora_5090$ - \text{Type}: \text{Image}-\text{to}-\text{Video} (\text{I2V})
- \text{GPUs}: 8 \times \text{RTX} 4090 \text{or} 8 \times \text{RTX} 5090
- \text{Features}:
- 81 \text{frames} @ 15-16 \text{fps}
- \text{Multiple} \text{LoRA} \text{categories}:
- \text{Basic} (0\text{_x}): \text{General} \text{motion}
- \text{Talk} (1\text{_x}): \text{Talking} \text{animations}
- \text{Pet} (2\text{_x}): \text{Pet} \text{movements}
- \text{Landscape} (3\text{_x}): \text{Landscape} \text{animations}
- \text{Effects} (5\text{x}, 6\text{x}): \text{Special} \text{effects}
- 4090 \text{variant}: \text{Accelerated} \text{LoRA} \text{option} (8 \text{steps} \text{vs} 30)
- 5090 \text{variant}: \text{Enhanced} \text{stability} \text{and} 27% \text{faster} \text{inference}
- \text{FP8} \text{quantization} \text{enabled}
- \text{Memory} \text{optimization} \text{with} \text{CPU} \text{offloading}
- \text{Performance}: \text{RTX} 5090 \text{is} 27% \text{faster} \text{than} \text{RTX} 4090
- \text{Usage}:
$``python
model = load_model("video_wan_lora_5090")
video_path = model.inference(
input_image=Image.open("input.jpg"),
lora_id="0_1", # Motion LoRA
seed=42
)
Installation
Prerequisites
- Python: 3.8 or higher
- CUDA: 11.8 or higher
- GPU: RTX 4090/5090 or similar (24GB+ VRAM recommended)
- OS: Linux (tested), Windows (should work)
Basic Installation
# Clone repository
git clone https://github.com/your-org/TypemovieInfer.git
cd TypemovieInfer
# Install from source (recommended)
pip install -e .
Installation with Optional Features
# Install all features
pip install -e ".[all]"
# Install specific feature sets
pip install -e ".[image]" # Image generation only
pip install -e ".[video]" # Video generation only
pip install -e ".[distributed]" # Distributed/parallel features
pip install -e ".[dev]" # Development dependencies
Verify Installation
python -c "from typemovie_infer import load_model, list_models; print(list_models())"
Quick Start
3-Line Image Generation
from typemovie_infer import load_model
model = load_model("flux_paraattn")
images = model.inference(prompt="A beautiful sunset over mountains", seed=42)
Complete Example
from typemovie_infer import load_model
from PIL import Image
# List available models
from typemovie_infer import list_models
print("Available models:", list_models())
# Load FLUX model
model = load_model("flux_paraattn")
# Check model info
print("Model info:", model.model_info)
# List available styles
print("Available styles:", model.list_styles())
# Generate images
images = model.inference(
prompt="A serene Japanese garden in autumn, 4K, highly detailed",
num_images=4,
seed=42,
style_idx=0 # Film style
)
# Save images
for i, img in enumerate(images):
img.save(f"output_{i}.png")
print(f"Saved: output_{i}.png")
Usage Guide
Image Generation (FLUX)
from typemovie_infer import load_model
# Load model
model = load_model("flux_paraattn")
# List available LoRA styles
styles = model.list_styles()
print(f"Available styles: {styles}")
# Generate with specific style
images = model.inference(
prompt="A futuristic cityscape",
num_images=4,
seed=42,
style_idx=1, # Realistic style
height=1024,
width=1024,
guidance_scale=3.5,
num_inference_steps=30
)
# Save results
for i, img in enumerate(images):
img.save(f"flux_output_{i}.png")
Image Editing (Kontext)
from typemovie_infer import load_model
from PIL import Image
# Load model
model = load_model("kontext_paraattn")
# Load input image
input_img = Image.open("input.jpg")
# Edit image
images = model.inference(
prompt="Add a beautiful sunset in the background",
image=input_img,
num_images=3,
seed=42
)
# Save results
for i, img in enumerate(images):
img.save(f"kontext_output_{i}.png")
Video Generation (WAN Video)
from typemovie_infer import load_model
from PIL import Image
# Load model (5090 variant for best performance)
model = load_model("video_wan_lora_5090")
# List available LoRAs
loras = model.list_loras()
print("Available LoRAs:")
for lora_id, info in loras.items():
print(f" {lora_id}: {info['description']}")
# Load input image
input_img = Image.open("input.jpg")
# Generate video with LoRA
video_path = model.inference(
input_image=input_img,
lora_id="0_1", # Basic motion LoRA
seed=42,
num_frames=81,
fps=16
)
print(f"Video generated: {video_path}")
# Generate without LoRA (native inference)
video_path_native = model.inference(
input_image=input_img,
lora_id="-1", # No LoRA
seed=42
)
Chinese Image Generation (Qwen)
from typemovie_infer import load_model
# Load model
model = load_model("qwen_ddp")
# List supported aspect ratios
ratios = model.list_ratios()
print(f"Supported ratios: {ratios}")
# Generate images with Chinese prompt
images = model.inference(
prompt="一个美丽的中国山水画,水墨风格,高清",
num_images=4,
seed=42,
ratio="16:9"
)
# Save results
for i, img in enumerate(images):
img.save(f"qwen_output_{i}.png")
Performance Benchmarks
Hardware Tested
- RTX 4090: 8× GPUs, 24GB VRAM each
- RTX 5090: 8× GPUs, 32GB VRAM each
- CPU: Intel Xeon / AMD EPYC
- RAM: 256GB+
Benchmark Results
Image Generation (FLUX)
| Configuration | Resolution | Images | GPU Memory |
|---|---|---|---|
| 2× RTX 4090 | 1024×1024 | 4 | 1.5GB/GPU |
| 2× RTX 5090 | 1024×1024 | 4 | 1.5GB/GPU |
Video Generation (WAN Video)
| Model | GPUs | Frames | FPS | GPU Memory |
|---|---|---|---|---|
| video_wan_lora_4090 | 8× 4090 | 81 | 15 | 1.64GB/GPU |
| video_wan_lora_5090 | 8× 5090 | 81 | 15 | 1.65GB/GPU |
Performance Comparison (5090 vs 4090):
- Overall Performance: 12% faster on RTX 5090
- Inference Phase: 27% faster on RTX 5090
- Memory efficiency: Both GPUs use similar memory (~1.65GB per GPU for 8-GPU setup)
Key Findings
- RTX 5090 provides significant speedup: 27% faster in inference phase for video generation
- Memory efficiency: Both GPUs use similar memory (~1.65GB per GPU for 8-GPU setup)
- Scalability: Linear scaling with number of GPUs for ParaAttn models
- LoRA support: Dynamic LoRA switching enables diverse motion styles
Configuration
Configuration Files
Each model has a config.yaml file located at:
typemovie_infer/models/<model_directory>/config.yaml
Example: typemovie_infer/models/image_flux_paraattn/config.yaml
Configuration Structure
model:
name: "flux_paraattn" # Model name (used in load_model())
display_name: "FLUX ParaAttn" # Display name
type: "image_generation" # Type: image_generation or video_generation
description: "High-quality text-to-image generation with Para-Attention"
# Model-specific configuration
flux:
model_path: "/path/to/flux/model"
gpu_num: 2 # Number of GPUs (ParaAttn)
quant_type: "qfloat8_e4m3fn" # FP8 quantization
height: 1024
width: 1024
num_inference_steps: 30
guidance_scale: 3.5
# Output settings
output_dir: "results/flux_paraattn"
save_image_dir: "results/flux_paraattn/images"
Key Configuration Parameters
For ParaAttn Models (FLUX, Kontext, WAN)
gpu_num: Number of GPUs to usequant_type: Quantization type (e.g.,qfloat8_e4m3fn)model_path: Absolute path to model weights
For DDP Models (Qwen)
world_size: Number of GPUs for distributed trainingmodel_path: Absolute path to model weights
For Video Models (WAN)
num_frames: Number of frames to generate (default: 81)fps: Frames per second (default: 15-16)high_cpu_memory: Enable CPU offloading (default: true)parameters_level: Offload granularity (1-3)keep_lora_gpu: Keep LoRA weights on GPU (default: false)quant_model: Enable FP8 quantization (default: true)
LoRA Configuration (WAN Video)
loras:
"0_1": "motion_basic_v1.safetensors"
"1_1": "talking_animation_v1.safetensors"
"2_1": "pet_movement_v1.safetensors"
# ...
Environment Variables
TypemovieInfer uses the following environment variables:
MASTER_ADDR: Master address for multi-GPU (default: '127.0.0.1')MASTER_PORT: Master port for multi-GPU (default: '23456')NCCL_DEBUG: NCCL debug level (set to 'INFO' for debugging)NCCL_IB_DISABLE: Disable InfiniBand (set to '1' if needed)
⚠️ Port Conflict Warning: All handlers use the same default port (23456). To run multiple models concurrently, set different MASTER_PORT for each:
import os
os.environ['MASTER_PORT'] = '23457'
model1 = load_model("flux_paraattn")
os.environ['MASTER_PORT'] = '23458'
model2 = load_model("video_wan_lora_4090")
Troubleshooting
Common Issues
1. Model Not Found
Error: Model 'xxx' not found
Solution:
- Check that
config.yamlexists in the model directory - Verify the model name in
config.yamlmatches the name you're using - Use
list_models()to see all available models
from typemovie_infer import list_models
print(list_models())
2. CUDA Out of Memory
Error: RuntimeError: CUDA out of memory
Solutions:
- Reduce
batch_sizein config - Enable CPU offloading:
high_cpu_memory: true - Increase
parameters_level(1-3) for more aggressive offloading - Enable quantization:
quant_model: true - Use fewer GPUs with smaller batch size
3. Multi-GPU Communication Failures
Error: RuntimeError: NCCL error or Address already in use
Solutions:
- Verify NCCL installation:
import torch print(torch.cuda.nccl.version()) - Check
MASTER_ADDRandMASTER_PORTare set correctly - Ensure
gpu_nummatches available GPUs - For port conflicts (EADDRINUSE), run models sequentially or set different
MASTER_PORT
4. LoRA Not Loading (WAN Video)
Error: TypeError: LoraConfig.__init__() got an unexpected keyword argument 'lora_bias'
Cause: PEFT version incompatibility
Solutions:
Solution 1 (Recommended): Upgrade PEFT
pip install --upgrade peft>=0.14.0
Solution 2: Apply patch to /opt/conda/lib/python3.11/site-packages/diffusers/utils/peft_utils.py
# In _create_lora_config function (around line 324)
try:
lora_config_kwargs.pop("lora_bias", None) # Remove unsupported parameter
return LoraConfig(**lora_config_kwargs)
except TypeError as e:
raise TypeError("`LoraConfig` class could not be instantiated.") from e
5. Qwen Models Import Error
Error: ImportError: cannot import name 'QwenImagePipeline' from 'diffusers'
Cause: Qwen support not available in current diffusers version
Solutions:
- Update diffusers to a version with Qwen support
- Check model-specific
requirements.txtfor compatible versions - Install model dependencies:
pip install -r typemovie_infer/models/image_qwen_dist/requirements.txt
Debugging Tips
-
Enable verbose logging:
import os os.environ['NCCL_DEBUG'] = 'INFO' -
Check model info:
model = load_model("flux_paraattn") print(model.model_info) -
Verify GPU availability:
import torch print(f"GPUs available: {torch.cuda.device_count()}") -
Check configuration:
from typemovie_infer import ModelRegistry config = ModelRegistry.get_model_config("flux_paraattn") print(config._raw_config)
Advanced Topics
Adding a New Model
-
Create model directory:
mkdir typemovie_infer/models/my_model -
Create
config.yaml:model: name: "my_model" display_name: "My Model" type: "image_generation" description: "My custom model" my_model: model_path: "/path/to/model" gpu_num: 2 # ... other params output_dir: "results/my_model" -
Implement Handler in
handler.py:import os from typemovie_infer.models.base import BaseHandler class MyModelHandler(BaseHandler): def __init__(self, config): self.config = config os.environ['MASTER_ADDR'] = os.getenv('MASTER_ADDR', '127.0.0.1') os.environ['MASTER_PORT'] = os.getenv('MASTER_PORT', '23456') self._setup() def _setup(self): # Initialize model pass def inference(self, **kwargs): # Inference logic return outputs @property def model_info(self): return { 'name': self.config.name, 'type': self.config.type, 'display_name': self.config.display_name, } def __call__(self, *args, **kwargs): return self.inference(*args, **kwargs) -
Export Handler in
__init__.py:from .handler import MyModelHandler as Handler -
Test your model:
from typemovie_infer import load_model model = load_model("my_model") result = model.inference(...)
Model Registry API
from typemovie_infer import ModelRegistry
# List all models
models = ModelRegistry.discover_models()
# Get model config
config = ModelRegistry.get_model_config("flux_paraattn")
# Load handler
handler = ModelRegistry.load_handler("flux_paraattn")
# Clear cache
ModelRegistry.clear_cache()
Performance Tuning
-
Optimize GPU memory:
- Use FP8 quantization:
quant_type: "qfloat8_e4m3fn" - Enable CPU offloading:
high_cpu_memory: true - Increase offload level:
parameters_level: 3
- Use FP8 quantization:
-
Optimize speed:
- Use more GPUs (linear scaling with ParaAttn)
- Reduce
num_inference_steps(trade-off with quality) - Use RTX 5090 for 27% faster inference
-
Optimize quality:
- Increase
num_inference_steps - Tune
guidance_scale - Use appropriate LoRA for specific styles
- Increase
Contributing
We welcome contributions! Please follow these guidelines:
Reporting Issues
- Use GitHub Issues for bug reports and feature requests
- Provide detailed information: OS, GPU, Python version, error messages
- Include minimal reproducible code
Pull Requests
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Make your changes with clear commit messages
- Add tests if applicable
- Run code quality checks:
black . flake8 . isort . pytest - Submit pull request with detailed description
Development Setup
# Clone repository
git clone https://github.com/your-org/TypemovieInfer.git
cd TypemovieInfer
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
isort .
# Lint code
flake8 .
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgments
- Built with PyTorch, Diffusers, and Transformers
- Para-Attention optimization inspired by xDiT
- Thanks to all contributors and the open-source community
Contact & Support
- GitHub Issues: Report bugs or request features
- Email: leeguandon@gamil.com
- Documentation: Full documentation
- Developer Guide: CLAUDE.md - Architecture and development
- Test Reports: FINAL_TEST_REPORT.md - Comprehensive benchmarks
Made with ❤️ by TypeMovie Team
⭐ Star us on GitHub if this project helps you!