Chapter 2: Model Gallery and Management
April 13, 2026 ยท View on GitHub
Welcome to Chapter 2: Model Gallery and Management. In this part of LocalAI Tutorial: Self-Hosted OpenAI Alternative, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Discover available models, install different architectures, and manage your local model collection.
Model Gallery and Installation Flow
flowchart LR
GALLERY[LocalAI Model Gallery\ngallery.yaml definitions] --> API[POST /models/apply\nwith model id]
API --> DOWNLOAD[Auto-download GGUF / weights]
DOWNLOAD --> CONFIG[Generate model config YAML]
CONFIG --> READY[Model available at /v1/models]
MANUAL[Manual: place .gguf + config.yaml\nin models/ directory] --> READY
Overview
LocalAI supports a wide variety of models through its gallery system. This chapter covers model discovery, installation, and management of different model types.
Model Gallery
Accessing the Gallery
# View available models
curl http://localhost:8080/models/available
# Get detailed model information
curl http://localhost:8080/models/gallery
Model Categories
LocalAI supports several model types:
Large Language Models (LLMs)
# Install popular LLMs
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "mistral-7b-instruct"}'
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "llama-2-7b-chat"}'
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "codellama-7b-instruct"}'
Small and Fast Models
# Phi-2 (2.7B parameters, fast inference)
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "phi-2"}'
# TinyLlama (1.1B parameters, very fast)
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "tinyllama"}'
# Orca Mini (3B parameters, good quality/speed balance)
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "orca-mini"}'
Model Status and Progress
# Check installation status
curl http://localhost:8080/models/jobs
# Get specific model status
curl http://localhost:8080/models/jobs/phi-2
# View installed models
curl http://localhost:8080/v1/models
Manual Model Installation
Downloading from HuggingFace
# Create models directory
mkdir -p models
# Download Phi-2 GGUF
cd models
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf
wget https://huggingface.co/microsoft/phi-2/resolve/main/tokenizer.json
wget https://huggingface.co/microsoft/phi-2/resolve/main/tokenizer_config.json
Creating Model Configuration
# models/phi-2.yaml
name: phi-2
backend: llama
parameters:
model: phi-2.Q4_K_M.gguf
temperature: 0.7
top_p: 0.9
top_k: 40
max_tokens: 512
context_size: 2048
# For chat models
chat_template: chatml # or llama2, mistral, etc.
Model Configuration Options
# Comprehensive model config
name: my-custom-model
backend: llama # or gpt4all, transformers, etc.
parameters:
model: model.gguf
temperature: 0.8
top_p: 0.95
top_k: 50
repeat_penalty: 1.1
repeat_last_n: 64
context_size: 4096
max_tokens: 2048
threads: 4
batch_size: 512
f16: false # Use f32 for compatibility
# Backend-specific settings
backend_settings:
mmap: true
mlock: false
gpu_layers: 0 # For GPU acceleration
# Chat template
chat_template: llama2
Model Backends
Llama.cpp Backend (GGUF Models)
Best for most LLM use cases:
# GGUF model config
name: llama-model
backend: llama
parameters:
model: llama-2-7b-chat.Q4_K_M.gguf
context_size: 4096
threads: 8
batch_size: 512
GPT4All Backend
For older GPT-J/GPT-NeoX models:
name: gpt4all-model
backend: gpt4all
parameters:
model: gpt4all-model.bin
context_size: 2048
Transformers Backend
For PyTorch models (slower, more memory):
name: bert-model
backend: transformers
parameters:
model: bert-base-uncased
task: text-classification
Model Management
Listing Models
# Get all models
curl http://localhost:8080/v1/models
# Get specific model info
curl http://localhost:8080/v1/models/phi-2
# Check model health
curl http://localhost:8080/models/health/phi-2
Updating Models
# Update model configuration
curl -X POST http://localhost:8080/models/config/phi-2 \
-H "Content-Type: application/json" \
-d '{
"parameters": {
"temperature": 0.8,
"max_tokens": 1024
}
}'
Removing Models
# Remove model from memory
curl -X DELETE http://localhost:8080/models/phi-2
# Remove model files (manual)
rm -rf models/phi-2/
Model Performance Tuning
Memory Optimization
# Low-memory configuration
name: phi-2-memory-optimized
backend: llama
parameters:
model: phi-2.Q4_K_M.gguf
context_size: 1024 # Smaller context
threads: 2 # Fewer threads
batch_size: 256 # Smaller batch
f16: false # Use f32
mmap: true # Memory mapping
mlock: false # Don't lock in RAM
Speed Optimization
# High-speed configuration
name: phi-2-fast
backend: llama
parameters:
model: phi-2.Q4_K_M.gguf
context_size: 2048
threads: 8 # More threads
batch_size: 1024 # Larger batch
gpu_layers: 20 # GPU acceleration
flash_attn: true # Flash attention
Quality Optimization
# High-quality configuration
name: phi-2-quality
backend: llama
parameters:
model: phi-2.Q4_K_M.gguf
temperature: 0.1 # More deterministic
top_p: 0.1 # Focused sampling
repeat_penalty: 1.2 # Reduce repetition
context_size: 4096 # Larger context
Custom Model Training
Fine-tuning with Axolotl
# Install Axolotl
pip install axolotl
# Prepare dataset
# Train model
axolotl train config.yml
# Convert to GGUF
python convert.py /path/to/trained/model \
--outfile fine-tuned.gguf \
--outtype f16
LoRA Training
# Use Unsloth for efficient LoRA training
pip install unsloth
# Train LoRA adapter
# Apply to base model
Model Validation
Testing Model Quality
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
def test_model(model_name):
"""Test model with various prompts."""
test_cases = [
"Explain quantum computing in simple terms",
"Write a Python function to reverse a string",
"What are the benefits of renewable energy?",
"Tell me a joke about programming"
]
for prompt in test_cases:
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=150
)
print(f"Prompt: {prompt}")
print(f"Response: {response.choices[0].message.content[:100]}...")
print("-" * 50)
# Test different models
test_model("phi-2")
test_model("mistral-7b-instruct")
Performance Benchmarking
import time
import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
def benchmark_model(model_name, prompt, runs=5):
"""Benchmark model performance."""
times = []
token_counts = []
for i in range(runs):
start_time = time.time()
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=100
)
end_time = time.time()
times.append(end_time - start_time)
token_counts.append(response.usage.completion_tokens)
avg_time = sum(times) / len(times)
avg_tokens = sum(token_counts) / len(token_counts)
tokens_per_sec = avg_tokens / avg_time
print(f"Model: {model_name}")
print(".2f")
print(".1f")
print(".1f")
print("-" * 30)
# Benchmark models
benchmark_model("phi-2", "Write a haiku about artificial intelligence")
benchmark_model("mistral-7b-instruct", "Write a haiku about artificial intelligence")
Model Gallery API
Programmatic Model Management
import requests
class LocalAIModelManager:
def __init__(self, base_url="http://localhost:8080"):
self.base_url = base_url
def list_available_models(self):
"""List available models in gallery."""
response = requests.get(f"{self.base_url}/models/available")
return response.json()
def install_model(self, model_id, config=None):
"""Install a model from gallery."""
data = {"id": model_id}
if config:
data.update(config)
response = requests.post(
f"{self.base_url}/models/apply",
json=data
)
return response.json()
def get_model_status(self, model_id):
"""Get model installation status."""
response = requests.get(f"{self.base_url}/models/jobs/{model_id}")
return response.json()
def list_installed_models(self):
"""List currently installed models."""
response = requests.get(f"{self.base_url}/v1/models")
return response.json()
def delete_model(self, model_id):
"""Remove a model."""
response = requests.delete(f"{self.base_url}/models/{model_id}")
return response.status_code == 200
# Usage
manager = LocalAIModelManager()
# List available models
available = manager.list_available_models()
print("Available models:", len(available))
# Install a model
manager.install_model("phi-2")
# Check status
status = manager.get_model_status("phi-2")
print("Installation status:", status)
# List installed models
installed = manager.list_installed_models()
print("Installed models:", [m["id"] for m in installed["data"]])
Troubleshooting Model Issues
Common Installation Problems
# Check disk space
df -h
# Check download progress
curl http://localhost:8080/models/jobs
# Restart LocalAI if download stuck
docker restart localai-container
# Check logs for errors
docker logs localai-container 2>&1 | tail -50
Model Loading Issues
# Verify model file exists
ls -la models/model.gguf
# Check model file integrity
file models/model.gguf
# Test with simple config
curl -X POST http://localhost:8080/models/config/test-model \
-H "Content-Type: application/json" \
-d '{"parameters": {"model": "phi-2.Q4_K_M.gguf"}}'
Performance Issues
# Monitor system resources
top -p $(pgrep localai)
free -h
nvidia-smi # If using GPU
# Adjust model parameters
curl -X POST http://localhost:8080/models/config/model-name \
-H "Content-Type: application/json" \
-d '{
"parameters": {
"threads": 4,
"batch_size": 256,
"context_size": 1024
}
}'
Best Practices
- Start Small: Begin with smaller models like Phi-2 or TinyLlama
- Monitor Resources: Track RAM/VRAM usage when installing larger models
- Test Quality: Always validate model outputs for your use case
- Version Control: Keep track of model versions and configurations
- Backup Models: Maintain backups of working model configurations
- Update Regularly: Check for updated model versions in the gallery
Model Compatibility Matrix
| Model Type | Backend | Requirements | Performance |
|---|---|---|---|
| GGUF (Llama/Mistral) | llama | CPU/GPU | Excellent |
| GPT4All format | gpt4all | CPU | Good |
| PyTorch models | transformers | CPU/GPU | Variable |
| Custom models | varies | depends | depends |
Next: Learn how to use LocalAI for text generation with different parameters and chat formats.
What Problem Does This Solve?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for models, model, http so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 2: Model Gallery and Management as an operating subsystem inside LocalAI Tutorial: Self-Hosted OpenAI Alternative, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around localhost, curl, json as your checklist when adapting these patterns to your own repository.
How it Works Under the Hood
Under the hood, Chapter 2: Model Gallery and Management usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
models. - Input normalization: shape incoming data so
modelreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
http. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Source Walkthrough
Use the following upstream sources to verify implementation details while reading this chapter:
-
core/gallery/gallery.goGallery system for discovering and installing model presets. ImplementsInstallModel()which downloads YAML model configs and model weights from configured gallery URLs (including the officiallocalai-gallery). -
core/config/backend_config.goBackendConfigstruct that maps to per-model YAML configuration files. Fields includeBackend(llama-cpp, whisper, diffusers),Parameters(temperature, top-p, max_tokens),GPUlayers,Threads, andContextSize. This is the schema for everymodels/*.yamlfile. -
core/services/gallery.goGallery service layer connecting HTTP endpoints to gallery operations: list available models, install from gallery, get install status. Powers the/models/availableand/models/applyAPI endpoints. -
core/config/config.goConfig loader that scans the models directory, parses YAML backend configs, and builds the model registry. Shows how LocalAI auto-discovers model files without explicit registration.
Suggested trace strategy:
- Read
core/config/backend_config.goto understand every field available in a model's YAML config file - Trace
core/gallery/gallery.goInstallModel()to understand how gallery install downloads both the YAML config and model weights - Check
core/config/config.goLoadConfigs()to see how the models directory is scanned and which YAML fields are required vs optional