Chapter 3: Command Line Interface
April 13, 2026 · View on GitHub
Welcome to Chapter 3: Command Line Interface. In this part of llama.cpp Tutorial: Local LLM Inference, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Master llama-cli with advanced options, interactive modes, and conversation management.
CLI Execution Modes
flowchart TD
CLI[llama-cli] --> M1[Single Prompt Mode\n-p text -n tokens]
CLI --> M2[Interactive Mode\n--interactive]
CLI --> M3[Instruction Mode\n--instruct]
CLI --> M4[Conversation Mode\n--conversation]
M1 --> O1[Generate and exit]
M2 --> O2[REPL loop with user input]
M3 --> O3[Instruction-tuned chat loop]
M4 --> O4[Multi-turn with full history]
Overview
The llama-cli tool provides comprehensive control over model inference. This chapter covers all major options, interactive modes, and advanced usage patterns.
Basic Command Structure
llama-cli [options] -m model.gguf [prompt_options]
Essential Options
Model and Context
# Specify model file
-m, --model <path> # Path to GGUF model file
# Context management
-c, --ctx-size <n> # Context size (default: 4096)
-n, --n-predict <n> # Number of tokens to predict (default: -1, unlimited)
--keep <n> # Number of tokens to keep from initial prompt
# Memory management
--memory-f32 # Use f32 instead of f16 for memory KV
--mlock # Force system to keep model in RAM
--no-mmap # Don't memory-map model (slower but uses less RAM)
Generation Parameters
# Sampling parameters
--temp <float> # Temperature (0.0-2.0, default: 0.8)
--top-k <int> # Top-k sampling (default: 40)
--top-p <float> # Top-p (nucleus) sampling (default: 0.9)
--min-p <float> # Minimum probability (default: 0.0)
--tfs-z <float> # Tail free sampling (default: 1.0)
--typical-p <float> # Typical sampling (default: 1.0)
# Penalty parameters
--repeat-last-n <int> # Lookback for repetition penalty (default: 64)
--repeat-penalty <float> # Repetition penalty (default: 1.1)
--presence-penalty <float> # Presence penalty (default: 0.0)
--frequency-penalty <float> # Frequency penalty (default: 0.0)
# Mirostat sampling (alternative to top-k/top-p)
--mirostat <int> # Use Mirostat sampling (1 or 2)
--mirostat-lr <float> # Mirostat learning rate (default: 0.1)
--mirostat-ent <float> # Mirostat target entropy (default: 5.0)
Interactive Modes
Basic Interactive Chat
# Start interactive session
./llama-cli -m model.gguf --interactive
# With custom settings
./llama-cli -m model.gguf \
--interactive \
--ctx-size 4096 \
--temp 0.8 \
--top-k 40 \
--top-p 0.9 \
--repeat-penalty 1.1 \
--color # Enable colored output
Conversation Mode
# Start conversation with system prompt
./llama-cli -m model.gguf \
--interactive \
--conversation \
--in-prefix "Human: " \
--out-prefix "Assistant: "
Chat Template Mode
# Use chat templates (for chat models)
./llama-cli -m model.gguf \
--interactive \
--chat-template chatml # or llama2, mistral, etc.
Input Methods
Direct Prompt
# Simple prompt
./llama-cli -m model.gguf --prompt "Hello, how are you?"
# Multi-line prompt
./llama-cli -m model.gguf \
--prompt "You are a helpful assistant.
User: What is the capital of France?
Assistant: The capital of France is Paris.
User: What about Italy?
Assistant:"
File Input
# Read prompt from file
./llama-cli -m model.gguf --file prompt.txt
# Read from stdin
echo "What is machine learning?" | ./llama-cli -m model.gguf
# Process multiple files
for file in prompts/*.txt; do
echo "Processing $file..."
./llama-cli -m model.gguf --file "$file" --n-predict 100
done
Reverse Prompt (Stopping)
# Stop generation when certain text appears
./llama-cli -m model.gguf \
--prompt "Write a function to sort an array" \
--reverse-prompt "```" # Stop when code block ends
# Multiple reverse prompts
./llama-cli -m model.gguf \
--prompt "Explain photosynthesis" \
--reverse-prompt "Human:" \
--reverse-prompt "AI:"
Output Control
Formatting
# Enable colored output
./llama-cli -m model.gguf --color
# Show timing information
./llama-cli -m model.gguf --prompt "Hello" --verbose-prompt
# Show token probabilities
./llama-cli -m model.gguf --prompt "Hello" --logits
# JSON output mode
./llama-cli -m model.gguf --prompt "Hello" --simple-io
Logging and Debugging
# Enable debug logging
./llama-cli -m model.gguf --log-disable # Actually enables more logging
# Show system information
./llama-cli -m model.gguf --verbose-prompt
# Performance profiling
./llama-cli -m model.gguf --prompt "Hello" --perf
Advanced Features
Multi-Turn Conversation
# Manual conversation management
./llama-cli -m model.gguf \
--prompt "You are a helpful assistant. Human: Hello\nAssistant:" \
--keep 50 # Keep first 50 tokens
# Interactive with history
./llama-cli -m model.gguf \
--interactive \
--ctx-size 2048 \
--keep 256 # Keep system prompt and some history
Grammar-Based Generation
# Use GBNF grammar (covered in Chapter 7)
./llama-cli -m model.gguf \
--grammar-file grammar.gbnf \
--prompt "Generate a JSON object with name and age"
# Example grammar file (grammar.gbnf)
root ::= object
object ::= "{" ws string ":" ws value "}"
string ::= "\"" [a-zA-Z0-9 ]+ "\""
value ::= [0-9]+
ws ::= [ \t\n]*
Seed Control
# Deterministic output
./llama-cli -m model.gguf \
--prompt "Write a haiku" \
--seed 42 \
--temp 0.8
# Different seeds for variety
for seed in 1 2 3; do
echo "Seed $seed:"
./llama-cli -m model.gguf --prompt "Joke about programming" --seed $seed --n-predict 50
echo
done
Batch Processing
Multiple Prompts
# Process multiple prompts
cat prompts.txt | while read prompt; do
echo "Prompt: $prompt"
echo "Response:"
./llama-cli -m model.gguf --prompt "$prompt" --n-predict 100
echo "---"
done
Parallel Processing
#!/bin/bash
# parallel_inference.sh
model="model.gguf"
prompts=("prompt1.txt" "prompt2.txt" "prompt3.txt")
# Process in parallel (adjust based on your CPU)
for prompt_file in "${prompts[@]}"; do
./llama-cli -m "$model" --file "$prompt_file" --n-predict 200 &
done
wait # Wait for all to complete
Performance Optimization
CPU Settings
# Use all available threads
threads=$(nproc)
./llama-cli -m model.gguf --threads $threads
# Batch processing for better throughput
./llama-cli -m model.gguf --batch-size 512
# CPU-specific optimizations
./llama-cli -m model.gguf --threads $threads --batch-size 512
Memory Optimization
# For large models with limited RAM
./llama-cli -m model.gguf \
--ctx-size 1024 \ # Smaller context
--no-mmap \ # Don't memory map
--memory-f32 \ # Use f32 for KV cache
--mlock # Lock in RAM if you have enough
# Monitor memory usage
./llama-cli -m model.gguf --prompt "Hello" --perf
Custom Scripts and Automation
Response Quality Checker
#!/bin/bash
# quality_check.sh
model="\$1"
prompt="\$2"
min_length=50
response=$(./llama-cli -m "$model" --prompt "$prompt" --n-predict 200 --simple-io)
# Check response quality
if [ ${#response} -lt $min_length ]; then
echo "❌ Response too short"
exit 1
fi
if echo "$response" | grep -q "I don't know\|I cannot"; then
echo "❌ Model refused to answer"
exit 1
fi
echo "✅ Response quality check passed"
echo "$response"
Benchmarking Script
#!/bin/bash
# benchmark.sh
model="\$1"
prompts=("simple.txt" "medium.txt" "complex.txt")
echo "Benchmarking $model"
echo "Prompt | Tokens/sec | Total Time | Memory"
for prompt_file in "${prompts[@]}"; do
if [ -f "$prompt_file" ]; then
start_time=$(date +%s.%3N)
# Run with performance logging
./llama-cli -m "$model" \
--file "$prompt_file" \
--n-predict 100 \
--perf \
--threads $(nproc) \
> /dev/null 2>&1
end_time=$(date +%s.%3N)
duration=$(echo "$end_time - $start_time" | bc)
echo "$prompt_file | TBD | ${duration}s | TBD"
fi
done
Error Handling
Common Errors and Solutions
# Model file not found
if [ ! -f "$model" ]; then
echo "Error: Model file $model not found"
exit 1
fi
# Context too large
./llama-cli -m model.gguf --ctx-size 2048 # Reduce if getting errors
# Timeout handling
timeout 300 ./llama-cli -m model.gguf --prompt "Long prompt"
# Check exit codes
./llama-cli -m model.gguf --prompt "test"
exit_code=$?
if [ $exit_code -ne 0 ]; then
echo "llama-cli failed with exit code $exit_code"
exit $exit_code
fi
Configuration Files
Save common settings:
# Create config files for different use cases
# chat_config.txt
cat > chat_config.txt << EOF
--interactive
--ctx-size 4096
--temp 0.8
--top-k 40
--top-p 0.9
--repeat-penalty 1.1
--color
EOF
# code_config.txt
cat > code_config.txt << EOF
--temp 0.2
--top-k 20
--repeat-penalty 1.1
--ctx-size 2048
EOF
# Use configs
./llama-cli -m model.gguf $(cat chat_config.txt) --prompt "Hello"
# Or create wrapper scripts
cat > chat_model.sh << 'EOF'
#!/bin/bash
model="\$1"
./llama-cli -m "$model" \
--interactive \
--ctx-size 4096 \
--temp 0.8 \
--color
EOF
chmod +x chat_model.sh
./chat_model.sh model.gguf
Integration with Other Tools
With fzf for Interactive Selection
#!/bin/bash
# interactive_model.sh
# List available models
model=$(find models -name "*.gguf" | fzf --prompt="Select model: ")
if [ -n "$model" ]; then
echo "Selected: $model"
./llama-cli -m "$model" --interactive
fi
With tmux for Persistent Sessions
#!/bin/bash
# persistent_chat.sh
session_name="llama-chat"
# Create new tmux session if it doesn't exist
tmux has-session -t $session_name 2>/dev/null
if [ $? != 0 ]; then
tmux new-session -d -s $session_name "./llama-cli -m model.gguf --interactive"
fi
# Attach to session
tmux attach-session -t $session_name
Best Practices
- Start Simple: Use default parameters, then tune for your use case
- Test Parameters: Always test parameter combinations on your specific model
- Monitor Performance: Use
--perfto understand bottlenecks - Save Good Configs: Keep working parameter sets for different tasks
- Batch When Possible: Use batch processing for multiple requests
- Handle Errors: Always check exit codes and handle failures
- Version Control: Track which parameter combinations work best
Quick Reference
Most Used Options
# Chat mode
./llama-cli -m model.gguf --interactive --color
# Code generation
./llama-cli -m model.gguf --temp 0.2 --repeat-penalty 1.1
# Creative writing
./llama-cli -m model.gguf --temp 0.9 --top-p 0.95
# Analytical tasks
./llama-cli -m model.gguf --temp 0.1 --top-k 20
# Fast responses
./llama-cli -m model.gguf --temp 0.8 --top-k 40 --top-p 0.9
The CLI provides complete control over model behavior. Experiment with different parameters to find what works best for your specific use case and model.
What Problem Does This Solve?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for model, llama, gguf so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 3: Command Line Interface as an operating subsystem inside llama.cpp Tutorial: Local LLM Inference, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around prompt, float, file as your checklist when adapting these patterns to your own repository.
How it Works Under the Hood
Under the hood, Chapter 3: Command Line Interface usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
model. - Input normalization: shape incoming data so
llamareceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
gguf. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Source Walkthrough
Key source files in ggerganov/llama.cpp:
examples/main/main.cpp-- argument parsing for allllama-cliflags; shows which params map to whichllama_context_paramsfieldscommon/arg.cpp-- full CLI argument registry; authoritative reference for all flags and their defaultscommon/sampling.cpp-- sampling pipeline: top-k, top-p, temperature, repetition penalty logic
Suggested trace: search arg.cpp for any flag (e.g., --temp) to find its default, type, and description, then find how it flows into the sampling or context params struct.