Qwen3 Next

March 5, 2026 · View on GitHub

Qwen3-Next is Alibaba 80B Mixture-of-Experts (MoE) model (activating only 3B parameters) that features a novel hybrid attention architecture combining Gated DeltaNet (linear attention) and Gated Attention (full attention) for massive context scaling. This documentation covers the integration of Qwen3-Next-80B-A3B into MaxText:

For more details on the architecture, see the Qwen3 Technical Blog.

Pre-Training

You can train from scratch to generate a new checkpoint. One example command to run pretraining with Qwen3-Next on v5p-64.

python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
    base_output_directory=${BASE_OUTPUT_DIRECTORY} \
    run_name=q3_next_pre_training \
    per_device_batch_size=1 \
    enable_checkpointing=false \
    model_name=qwen3-next-80b-a3b \
    ici_fsdp_parallelism=-1 \
    steps=5 \
    max_target_length=1024 \
    async_checkpointing=false \
    tokenizer_type=huggingface \
    tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer \
    attention=flash \
    dtype=bfloat16 \
    weight_dtype=bfloat16 \
    megablox=False \
    sparse_matmul=False \
    dataset_type=synthetic

Checkpoint Conversion

To get started, you first need a MaxText-compatible checkpoint.

Download the Model: Download the official model from Hugging Face. You can use a tool like hf_transfer for a fast download.

# Example for Qwen3-Next-80B-A3B-Instruct
hf_transfer download Qwen/Qwen3-Next-80B-A3B-Instruct --local-dir /path/to/qwen3_next_hf_checkpoint

Convert the Checkpoint: Run the convert_qwen3_next_scanned.py script to convert the downloaded Hugging Face weights into the Orbax format required by MaxText.

JAX_PLATFORMS=cpu python3 -m maxtext.checkpoint_conversion.to_maxtext src/maxtext/configs/base.yml \
    model_name=qwen3-next-80b-a3b \
    base_output_directory=gs://your-gcs-bucket/qwen3_next_maxtext_ckpt \
    hf_access_token=${HF_TOKEN} \
    scan_layers=true \ # Set to false for unscanned checkpoint
    use_multimodal=false

Fine-tuning

After converting the checkpoint, you can use it for fine-tuning. The command below is an example for fine-tuning on a v5p-64 slice.

python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
    base_output_directory=${BASE_OUTPUT_DIRECTORY} \
    dataset_path=${DATASET_PATH} \
    load_parameters_path=gs://your-gcs-bucket/qwen3_next_maxtext_ckpt/0/items \
    run_name=qwen3_next_finetuning \
    per_device_batch_size=1 \
    model_name=qwen3-next-80b-a3b \
    steps=30 \
    max_target_length=4096 \
    ici_fsdp_parallelism=-1 \
    tokenizer_type=huggingface \
    tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer

Decoding

One example command to run decoding with Qwen3-Next on v5p-64 with unscanned checkpoint for fast decoding.

python3 -m maxtext.inference.decode src/maxtext/configs/base.yml \
    base_output_directory=${BASE_OUTPUT_DIRECTORY} \
    load_parameters_path=${CONVERTED_CHECKPOINT} \
    run_name=q3-next-decode \
    per_device_batch_size=1 \
    enable_checkpointing=false \
    model_name=qwen3-next-80b-a3b \
    max_prefill_predict_length=64 \
    max_target_length=1024 \
    tokenizer_type=huggingface \
    tokenizer_path=src/maxtext/assets/tokenizers/qwen3-tokenizer \
    attention=dot_product \
    dtype=bfloat16 \
    weight_dtype=bfloat16 \
    megablox=False \
    sparse_matmul=False \
    ici_tensor_parallelism=1 \
    ici_fsdp_parallelism=1 \
    ici_expert_parallelism=-1 \
    prompt="An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and outputs are all vectors. The output is " \
    scan_layers=False

Correctness Validation

we perform two primary checks:

Logit Comparison: We compare the logits generated by our implementation against those from a HuggingFace implementation for a set of given prompts.
MMLU Score Validation: We validate the MMLU score against established benchmarks.

One example command to generate golden logits from HuggingFace for Qwen3-Next:

python3 -m tests.assets.logits_generation.generate_hf_golden_logits \
    --model-id=Qwen/Qwen3-Next-80B-A3B-Instruct \
    --output-path=golden_Qwen3_Next.jsonl \
    --prompts='I love to;Today is a;What is the'

You should be able to see logs like below:

...
File is stored locally at golden_Qwen3_Next.jsonl.

Run command below to compare logits between HuggingFace and MaxText.

python3 -m tests.utils.forward_pass_logit_checker \
    src/maxtext/configs/base.yml \
    tokenizer_type=huggingface \
    tokenizer_path=Qwen/Qwen3-Next-80B-A3B-Instruct \
    load_parameters_path=${CONVERTED_CHECKPOINT} \
    run_name=forward_pass_test_qwen3_next \
    per_device_batch_size=1 \
    model_name=qwen3-next-80b-a3b \
    max_prefill_predict_length=4 \
    max_target_length=4 \
    scan_layers=false \
    sparse_matmul=False \
    dtype=float32 \
    activations_in_float32=true \
    matmul_precision=high \
    --max_kl_div=2e-4 \
    --golden_logits_path=${PWD}/golden_Qwen3_Next.jsonl

To run MMLU benchmarks and validate the model's performance, follow the instructions provided here.

Supported MoE Strategies

This model implementation supports both Token Dropping and Dropless strategies for Mixture of Experts routing. Take a look at the MaxText documentation on MoE configs and flags to set based on desired strategy.