Profile Hugging Face TGI Models with AIPerf

March 7, 2026 · View on GitHub

AIPerf can benchmark Large Language Models (LLMs) served through the Hugging Face Text Generation Inference (TGI) generate API. TGI exposes two standard HTTP endpoints for text generation:

Endpoint	Description	AIPerf Flag
`/generate`	Returns the full text completion in one response (non-streaming).	(default)
`/generate_stream`	Streams generated tokens as they are produced (SSE).	`--streaming`

Start a Hugging Face TGI Server

To launch a Hugging Face TGI server, use the official ghcr.io image:

docker run --gpus all --rm -it \
  -p 8080:80 \
  -e MODEL_ID=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  ghcr.io/huggingface/text-generation-inference:latest

# Verify the server is running
curl -s http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"inputs":"Hello world"}' | jq

Profile with AIPerf

You can benchmark TGI models in either non-streaming or streaming, and with either synthetic inputs or a custom input file.

Non-Streaming (`/generate`)

Profile with synthetic inputs

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate endpoint (non-streaming)
INFO     AIPerf System is PROFILING

Profiling: 10/10 |████████████████████████| 100% [00:08<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1234.56 │ 987.34 │ 1567.89 │ 1567.89 │ 1198.45 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.34 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

You can also provide your own text prompts using the --input-file option. The file should be in JSONL format and contain text entries.

cat > inputs.jsonl <<'EOF'
{"text": "Hello TinyLlama!"}
{"text": "Tell me a joke."}
EOF

Then run:

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --input-file ./inputs.jsonl \
    --custom-dataset-type single_turn \
    --request-count 10

Streaming (`/generate_stream`)

When the --streaming flag is enabled, AIPerf automatically sends requests to the /generate_stream endpoint of the TGI server.

Profile with synthetic inputs

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --streaming \
    --request-count 10

Sample Output (Successful Run):

INFO     Starting AIPerf System
INFO     Using Hugging Face TGI /generate_stream endpoint (streaming)
INFO     AIPerf System is PROFILING

Profiling: 10/10 |████████████████████████| 100% [00:09<00:00]

INFO     Benchmark completed successfully
INFO     Results saved to: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/

            NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃                      Metric ┃     avg ┃    min ┃     max ┃     p99 ┃     p50 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│        Request Latency (ms) │ 1189.45 │ 945.67 │ 1498.34 │ 1498.34 │ 1156.78 │
│    Time to First Token (ms) │  234.56 │ 189.34 │  298.45 │  298.45 │  228.90 │
│    Inter Token Latency (ms) │   14.23 │  11.45 │   18.90 │   18.90 │   13.89 │
│ Output Token Count (tokens) │  256.00 │ 200.00 │  300.00 │  300.00 │  254.00 │
│  Request Throughput (req/s) │    2.56 │      - │       - │       - │       - │
└─────────────────────────────┴─────────┴────────┴─────────┴─────────┴─────────┘

JSON Export: artifacts/TinyLlama_TinyLlama-1.1B-Chat-v1.0-generate-concurrency1/profile_export_aiperf.json

Profile with custom input file

Create your own prompt file in JSONL format:

cat > inputs.jsonl <<'EOF'
{"text": "Explain quantum computing in simple terms."}
{"text": "Write a haiku about rain."}
{"text": "Summarize the causes of the French Revolution."}
EOF

Then run:

aiperf profile \
    -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --endpoint-type huggingface_generate \
    --url localhost:8080 \
    --input-file ./inputs.jsonl \
    --custom-dataset-type single_turn \
    --streaming \
    --request-count 10