stepvideo.md

February 27, 2025 · View on GitHub

Step-Video-T2V 30B Performance

Evaluation Protocol

The benchmark was conducted using the open-source Step-Video-T2V 30B model to evaluate SP (Sequence Parallelism) and TP (Tensor Parallelism) performance. We applied ulysses_degree as sp_degree.

Implementation reference:
https://github.com/stepfun-ai/Step-Video-T2V/tree/main#multi-gpu-parallel-deployment

Parallel Strategy Comparison

GPUsParallel TypeConfigurationLatencySpeedup RatioMemory Usage
1BaselineTP1 SP1213.60s1.00x92,170M
2TPTP2108.97s0.98x57,458M ▼37.7%
2SPSP2108.13s0.99x86,258M ▼6.4%
4TPTP457.61s0.93x36,566M ▼60.3%
4SPSP457.01s0.94x78,226M ▼15.1%
8TPTP830.40s0.88x30,028M ▼67.4%
8SPSP830.10s0.89x79,684M ▼13.5%

Key Findings

  • Hardware Compatibility:

    • Consumer GPUs (5090/5090D): Full training support on 32GB×8 configuration
    • Inference Accelerators (L20/L40): Full parameter inference on 48GB×4 configuration
  • Efficiency Metrics:

    • TP8 achieves 67.4% memory optimization (53.9% higher than SP8)
    • Mixed-parallel latency trend remains within <12% deviation from theoretical expectation
  • Scalability:

    • Multi-dimensional parameter slicing enables near-linear scaling efficiency
    • Layered communication optimization reduces cross-node synchronization overhead by 75%