Parallel Scaling Law for Language Model

May 17, 2025 ยท View on GitHub

Parallel Scaling Law for Language Model

Yet Another Scaling Law beyond Parameters and Inference Time Scaling

Paper huggingface

๐Ÿ’กย Key Findings | ๐Ÿ“ˆย Scaling Law | โšกย Cost Analysis | ๐Ÿ”ฅย Models | ๐Ÿ“šย Citation

๐ŸŒŸ About

  • Most believe that scaling language models requires a heavy cost in either space (parameter scaling) or time (inference-time scaling).
  • We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale).
  • We apply PP diverse and learnable transformations to the input, execute forward passes of the model in parallel, and dynamically aggregate the PP outputs.

๐Ÿ’ก Key Findings

Here are the core insights and benefits distilled from our theoretical analysis and empirical evaluations:

๐Ÿ“ˆ Logarithmic Scaling Law: We theoretically and empirically establish that scaling with PP parallel streams is comparable to scaling the number of parameters by O(logโกP)O(\log P). This suggests that parallel computation can serve as an efficient substitute for parameter growth, especially for larger models.

โœ… Universal Applicability: Unlike inference-time scaling which requires specialized data and limited application, it works with any model architecture, optimization method, data, or downstream task.

๐Ÿง  Stronger Performance on Reasoning Tasks: Reasoning-intensive tasks (e.g., coding or math) benefit more from ParScale, which suggests that scaling computation can effectively push the boundary of reasoning.

โšก Superior Inference Efficiency: ParScale can use up to 22x less memory increase and 6x less latency increase compared to parameter scaling that achieves the same performance improvement (batch size=1).

๐Ÿงฑ Cost-Efficient Training via Two-Stage Strategy: Training a parallel-scaled model doesn't require starting from scratch. With a two-stage training strategy, we can post-train ithe parallel components using only a small amount of data.

๐Ÿ” Dynamic Adaptation at Inference Time: We find that ParScale remains effective with frozen main parameters for different PP. This illustrates the potential of dynamic parallel scaling: switching PP to dynamically adapt model capabilities during inference.

We release the inference code in modeling_qwen2_parscale.py and configuration_qwen2_parscale.py. Our 67 checkpoints is available at ๐Ÿค— HuggingFace.


๐Ÿ“ˆ Scaling Law

  • We carry out large-scale pre-training experiments on the Stack-V2 and Pile corpus, by ranging PP from 1 to 8 and model parameters from 500M to 4.4B.
  • We use the results to fit a new parallel scaling law that generalizes the Chinchilla scaling law.
  • We release our parametric fitting code in parametric_fit.py.
  • Feel free to try ๐Ÿค— HuggingFace Space for a nice visualization for the parallel scaling law!

โšก Cost Analysis

  • We further compare the inference efficiency between parallel scaling and parameter scaling at equivalent performance levels.
  • We release our analysis code in cost_analysis.py. Before using it, you should first install llm-analysis:
git clone https://github.com/cli99/llm-analysis.git
cd llm-analysis
pip install .
  • You can use the following command to analyze the inference memory and latency cost for our 4.4B model, with P=2P=2 and batch size=2:
python cost_analysis.py --hidden_size 2560 --intermediate_size 13824 --P 2 --batch_size 2

๐Ÿ”ฅ Models

โœจ are our recommendation for strong models!

Base models for scaling training data to 1T tokens

These models demonstrate strong competitiveness among existing small models, including SmolLM, gemma, and Llama-3.2.

ModelDescriptionDownload
ParScale-1.8B-P1โœจ Baseline P=1P=1๐Ÿค— ParScale/ParScale-1.8B-P1
ParScale-1.8B-P2โœจ ParScale P=2P=2๐Ÿค— ParScale/ParScale-1.8B-P2
ParScale-1.8B-P4โœจ ParScale P=4P=4๐Ÿค— ParScale/ParScale-1.8B-P4
ParScale-1.8B-P8โœจ ParScale P=8P=8๐Ÿค— ParScale/ParScale-1.8B-P8

Instruct models for scaling training data to 1T tokens

We post-trained the aforementioned base model on SmolTalk-1M to enable conversational capabilities.

ModelDescriptionDownload
ParScale-1.8B-P1-Instโœจ Baseline P=1P=1๐Ÿค— ParScale/ParScale-1.8B-P1-Inst
ParScale-1.8B-P2-Instโœจ ParScale P=2P=2๐Ÿค— ParScale/ParScale-1.8B-P2-Inst
ParScale-1.8B-P4-Instโœจ ParScale P=4P=4๐Ÿค— ParScale/ParScale-1.8B-P4-Inst
ParScale-1.8B-P8-Instโœจ ParScale P=8P=8๐Ÿค— ParScale/ParScale-1.8B-P8-Inst

Continual Pretraining Qwen-2.5-3B

We froze the parameters of Qwen-2.5-3B and only fine-tuned the newly introduced parameters on Stack-V2-Python. Since the following models share the same backbone parameters as Qwen-2.5-3B, they have the potential for dynamic ParScale: switching P to adapt model capabilities during inference.

ModelDescriptionDownload
ParScale-Qwen-3B-P2-Pythonโœจ ParScale P=2P=2๐Ÿค— ParScale/ParScale-Qwen-3B-P2-Python
ParScale-Qwen-3B-P4-Pythonโœจ ParScale P=4P=4๐Ÿค— ParScale/ParScale-Qwen-3B-P4-Python
ParScale-Qwen-3B-P8-Pythonโœจ ParScale P=8P=8๐Ÿค— ParScale/ParScale-Qwen-3B-P8-Python
  • For full continual pretraining on Stack-V2-Python
ModelDescriptionDownload
ParScale-QwenInit-3B-P1-PythonBaseline P=1P=1๐Ÿค— ParScale/ParScale-QwenInit-3B-P1-Python
ParScale-QwenInit-3B-P2-PythonParScale P=2P=2๐Ÿค— ParScale/ParScale-QwenInit-3B-P2-Python
ParScale-QwenInit-3B-P4-PythonParScale P=4P=4๐Ÿค— ParScale/ParScale-QwenInit-3B-P4-Python
ParScale-QwenInit-3B-P8-PythonParScale P=8P=8๐Ÿค— ParScale/ParScale-QwenInit-3B-P8-Python
  • For full continual pretraining on Pile
ModelDescriptionDownload
ParScale-QwenInit-3B-P1-PileBaseline P=1P=1๐Ÿค— ParScale/ParScale-QwenInit-3B-P1-Pile
ParScale-QwenInit-3B-P2-PileParScale P=2P=2๐Ÿค— ParScale/ParScale-QwenInit-3B-P2-Pile
ParScale-QwenInit-3B-P4-PileParScale P=4P=4๐Ÿค— ParScale/ParScale-QwenInit-3B-P4-Pile
ParScale-QwenInit-3B-P8-PileParScale P=8P=8๐Ÿค— ParScale/ParScale-QwenInit-3B-P8-Pile

Checkpoints Used to Fit the Scaling Law

Download link: https://huggingface.co/ParScale/ParScale-{size}-{P}-{dataset}

  • {size}: model size, from {0.7B, 0.9B, 1.3B, 1.8B, 3B, 4.7B}
  • {P}: number of parallels, from {P1, P2, P4, P8}
  • {dataset}: training dataset, from {Python, Pile}
  • $6\times 4 \times 2=48$ checkpoints in total.

Usage Example with ๐Ÿค— Hugging Face

from transformers import AutoModelForCausalLM, AutoTokenizer
name = "ParScale/ParScale-1.8B-P8" # or anything else you like
model = AutoModelForCausalLM.from_pretrained(name, trust_remote_code=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(name)
inputs = tokenizer.encode("Hello, how are you today?", return_tensors="pt").to("cuda")
outputs = model.generate(inputs, max_new_tokens=128)[0]
print(tokenizer.decode(outputs))

๐Ÿ“š Citation

@article{ParScale,
      title={Parallel Scaling Law for Language Models}, 
      author={Mouxiang Chen and Binyuan Hui and Zeyu Cui and Jiaxi Yang and Dayiheng Liu and Jianling Sun and Junyang Lin and Zhongxin Liu},
      year={2025},
      eprint={2505.10475},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      journal={arXiv preprint arXiv:2505.10475},
      url={https://arxiv.org/abs/2505.10475}, 
}