Example Run Script

September 9, 2025 ยท View on GitHub

To build and run AutoDeploy example, use the examples/auto_deploy/build_and_run_ad.py script:

cd examples/auto_deploy
python build_and_run_ad.py --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

You can configure your experiment with various options. Use the -h/--help flag to see available options:

python build_and_run_ad.py --help

The following is a non-exhaustive list of common configuration options:

Configuration KeyDescription
--modelThe HF model card or path to a HF checkpoint folder
--args.model-factoryChoose model factory implementation ("AutoModelForCausalLM", ...)
--args.skip-loading-weightsOnly load the architecture, not the weights
--args.model-kwargsExtra kwargs that are being passed to the model initializer in the model factory
--args.tokenizer-kwargsExtra kwargs that are being passed to the tokenizer initializer in the model factory
--args.world-sizeThe number of GPUs used for auto-sharding the model
--args.runtimeSpecifies which type of Engine to use during runtime ("demollm" or "trtllm")
--args.compile-backendSpecifies how to compile the graph at the end
--args.attn-backendSpecifies kernel implementation for attention
--args.mla-backendSpecifies implementation for multi-head latent attention
--args.max-seq-lenMaximum sequence length for inference/cache
--args.max-batch-sizeMaximum dimension for statically allocated KV cache
--args.attn-page-sizePage size for attention
--prompt.batch-sizeNumber of queries to generate
--benchmark.enabledWhether to run the built-in benchmark (true/false)

For default values and additional configuration options, refer to the ExperimentConfig class in examples/auto_deploy/build_and_run_ad.py file.

The following is a more complete example of using the script:

cd examples/auto_deploy
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size 2 \
--args.runtime "demollm" \
--args.compile-backend "torch-compile" \
--args.attn-backend "flashinfer" \
--benchmark.enabled True