Chapter 3: Text Generation and Chat Completions
April 13, 2026 · View on GitHub
Welcome to Chapter 3: Text Generation and Chat Completions. In this part of LocalAI Tutorial: Self-Hosted OpenAI Alternative, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Master text generation with LocalAI using OpenAI-compatible APIs, chat formats, and advanced parameters.
Text Generation Request Flow
sequenceDiagram
participant C as Client (openai SDK)
participant L as LocalAI Server
participant B as llama.cpp Backend
C->>L: POST /v1/chat/completions\n{model, messages, stream}
L->>B: Forward to loaded model
B->>L: Token stream
L->>C: SSE chunks (if stream=true)
L->>C: Final JSON response
Overview
LocalAI provides complete OpenAI API compatibility for text generation. This chapter covers chat completions, parameter tuning, and conversation management.
Chat Completions API
Basic Chat Completion
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
response = client.chat.completions.create(
model="phi-2",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
)
print(response.choices[0].message.content)
Response Structure
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "phi-2",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 7,
"total_tokens": 32
}
}
Advanced Parameters
Generation Control
response = client.chat.completions.create(
model="phi-2",
messages=[{"role": "user", "content": "Write a creative story"}],
max_tokens=500, # Maximum response length
temperature=0.8, # Randomness (0.0-2.0)
top_p=0.9, # Nucleus sampling (0.0-1.0)
top_k=40, # Top-k sampling
frequency_penalty=0.0, # Reduce repetition (-2.0 to 2.0)
presence_penalty=0.0, # Encourage diversity (-2.0 to 2.0)
repeat_penalty=1.1, # Repetition penalty
seed=42 # For reproducible results
)
Parameter Guide
| Parameter | Purpose | Recommended Values |
|---|---|---|
temperature | Controls randomness | 0.1-0.3 (factual), 0.7-1.0 (creative) |
top_p | Nucleus sampling | 0.1-0.5 (focused), 0.9-1.0 (diverse) |
top_k | Top-k sampling | 10-50 (most use cases) |
max_tokens | Response length limit | 100-2000 depending on use case |
frequency_penalty | Reduce repetition | 0.0-0.5 (slight reduction) |
presence_penalty | Encourage new topics | 0.0-0.3 (moderate encouragement) |
Chat Formats and Templates
Supported Chat Templates
LocalAI automatically detects and applies chat templates:
# Llama 2 chat format (automatically applied)
response = client.chat.completions.create(
model="llama-2-7b-chat",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
# Mistral chat format
response = client.chat.completions.create(
model="mistral-7b-instruct",
messages=[
{"role": "user", "content": "Explain recursion"}
]
)
Custom Chat Templates
# model-config.yaml
name: custom-model
backend: llama
parameters:
model: model.gguf
# Custom chat template
chat_template: |
{% for message in messages %}
{% if message.role == "system" %}{{ message.content }}{% endif %}
{% if message.role == "user" %}[INST] {{ message.content }} [/INST]{% endif %}
{% if message.role == "assistant" %}{{ message.content }}{% endif %}
{% endfor %}
Streaming Responses
Basic Streaming
response = client.chat.completions.create(
model="phi-2",
messages=[{"role": "user", "content": "Tell me a long story"}],
stream=True,
max_tokens=1000
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print() # New line
Streaming with Event Handling
def stream_with_events():
response = client.chat.completions.create(
model="phi-2",
messages=[{"role": "user", "content": "Explain machine learning"}],
stream=True
)
full_response = ""
for chunk in response:
delta = chunk.choices[0].delta
# Handle different content types
if delta.content:
content = delta.content
print(content, end="", flush=True)
full_response += content
# Check for completion
if chunk.choices[0].finish_reason:
print(f"\n\nFinished: {chunk.choices[0].finish_reason}")
return full_response
# Usage
result = stream_with_events()
Conversation Management
Multi-Turn Conversations
class ConversationManager:
def __init__(self, model="phi-2"):
self.model = model
self.messages = []
def add_message(self, role, content):
"""Add a message to the conversation."""
self.messages.append({"role": role, "content": content})
def send_message(self, user_message):
"""Send user message and get AI response."""
self.add_message("user", user_message)
response = client.chat.completions.create(
model=self.model,
messages=self.messages,
max_tokens=500
)
ai_response = response.choices[0].message.content
self.add_message("assistant", ai_response)
return ai_response
def get_history(self):
"""Get conversation history."""
return self.messages.copy()
# Usage
chat = ConversationManager()
print("AI:", chat.send_message("Hello! My name is Alice."))
print("AI:", chat.send_message("What's my name?"))
print("AI:", chat.send_message("Can you remind me what we talked about?"))
System Messages and Personas
def create_persona_chat(persona_description, model="phi-2"):
"""Create a chat session with a specific persona."""
chat_manager = ConversationManager(model)
# Add system message
chat_manager.add_message("system", persona_description)
def chat(user_input):
return chat_manager.send_message(user_input)
return chat
# Create different personas
coding_assistant = create_persona_chat(
"You are an expert Python programmer. Provide clear, well-commented code examples."
)
creative_writer = create_persona_chat(
"You are a creative writing assistant. Help users develop stories and characters."
)
# Use personas
print("Coding Assistant:", coding_assistant("Write a function to reverse a string"))
print("Creative Writer:", creative_writer("Help me name a character for a sci-fi story"))
Advanced Text Generation
Structured Output
def generate_structured_response(schema_description, user_query):
"""Generate responses that follow a specific structure."""
system_prompt = f"""
You must respond in a structured format. {schema_description}
Always follow the exact format specified. Be concise but complete.
"""
response = client.chat.completions.create(
model="phi-2",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_query}
],
temperature=0.1 # Low temperature for consistency
)
return response.choices[0].message.content
# Examples
# JSON output
json_schema = "Respond with valid JSON containing 'name', 'age', and 'occupation' fields."
json_response = generate_structured_response(json_schema, "Create a profile for a software engineer")
# List format
list_schema = "Respond with a numbered list of exactly 5 items."
list_response = generate_structured_response(list_schema, "List the benefits of exercise")
Few-Shot Prompting
def few_shot_generation(examples, task_description, model="phi-2"):
"""Use few-shot prompting for better results."""
system_message = "You are an AI that learns from examples. Respond in the same style as the examples."
# Build prompt with examples
prompt = "Here are some examples:\n\n"
for example in examples:
prompt += f"Input: {example['input']}\n"
prompt += f"Output: {example['output']}\n\n"
prompt += f"Now respond to: {task_description}"
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": prompt}
],
temperature=0.7
)
return response.choices[0].message.content
# Example usage
examples = [
{
"input": "The sky is blue",
"output": "The sky appears blue due to light scattering"
},
{
"input": "Plants need sunlight",
"output": "Plants use sunlight for photosynthesis to create energy"
}
]
result = few_shot_generation(
examples,
"Why do we have seasons?",
model="phi-2"
)
Chain of Thought Prompting
def chain_of_thought_reasoning(problem, model="phi-2"):
"""Use chain of thought prompting for complex reasoning."""
cot_prompt = f"""
Solve this problem step by step. Show your reasoning clearly.
Problem: {problem}
Think through this systematically:
1. Understand what is being asked
2. Identify the key information provided
3. Consider what approach or formula to use
4. Perform the necessary calculations
5. Provide the final answer
Let's work through this step by step:
"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": cot_prompt}],
max_tokens=1000,
temperature=0.3 # Balanced creativity and focus
)
return response.choices[0].message.content
# Usage
math_problem = "If a train travels at 60 mph for 2.5 hours, how far does it go?"
solution = chain_of_thought_reasoning(math_problem)
logic_problem = "All roses are flowers. Some flowers fade quickly. Can we conclude that some roses fade quickly?"
logic_solution = chain_of_thought_reasoning(logic_problem)
Batch Processing
Multiple Requests
import asyncio
import aiohttp
async def async_chat_completion(session, model, messages, **kwargs):
"""Async chat completion."""
async with session.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": model,
"messages": messages,
**kwargs
}
) as response:
return await response.json()
async def batch_process(prompts, model="phi-2"):
"""Process multiple prompts concurrently."""
async with aiohttp.ClientSession() as session:
tasks = []
for prompt in prompts:
messages = [{"role": "user", "content": prompt}]
task = async_chat_completion(session, model, messages, max_tokens=200)
tasks.append(task)
# Wait for all to complete
results = await asyncio.gather(*tasks)
return [result["choices"][0]["message"]["content"] for result in results]
# Usage
prompts = [
"Explain recursion simply",
"What is machine learning?",
"Write a Python hello world",
"What are cloud services?",
"Explain quantum computing"
]
results = await asyncio.run(batch_process(prompts))
for i, (prompt, result) in enumerate(zip(prompts, results)):
print(f"{i+1}. {prompt}")
print(f" {result[:100]}...")
print()
Error Handling and Validation
Robust API Calls
import time
import requests
from typing import Optional
def robust_chat_completion(
messages,
model="phi-2",
max_retries=3,
timeout=30
):
"""Robust chat completion with error handling."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=500,
timeout=timeout
)
# Validate response
if not response.choices:
raise ValueError("No choices in response")
content = response.choices[0].message.content
if not content or not content.strip():
raise ValueError("Empty response content")
return response
except client.APIError as e:
if attempt == max_retries - 1:
raise
print(f"API error (attempt {attempt + 1}): {e}")
time.sleep(2 ** attempt) # Exponential backoff
except client.RateLimitError as e:
if attempt == max_retries - 1:
raise
print(f"Rate limited (attempt {attempt + 1}): {e}")
time.sleep(5) # Fixed delay for rate limits
except Exception as e:
print(f"Unexpected error (attempt {attempt + 1}): {e}")
if attempt == max_retries - 1:
raise
time.sleep(1)
# Usage
try:
response = robust_chat_completion([
{"role": "user", "content": "Hello, world!"}
])
print("Success:", response.choices[0].message.content)
except Exception as e:
print(f"Failed after retries: {e}")
Performance Optimization
Parameter Tuning
# Fast generation (lower quality)
fast_response = client.chat.completions.create(
model="phi-2",
messages=[{"role": "user", "content": "Summarize this article"}],
max_tokens=100,
temperature=0.1, # More deterministic
top_p=0.5, # Focused sampling
top_k=20 # Smaller top-k
)
# Quality generation (slower)
quality_response = client.chat.completions.create(
model="phi-2",
messages=[{"role": "user", "content": "Write a detailed analysis"}],
max_tokens=1000,
temperature=0.8, # More creative
top_p=0.95, # Diverse sampling
top_k=50 # Larger top-k
)
Context Management
def manage_context(messages, max_context_length=4000, model_context=4096):
"""Manage conversation context to stay within limits."""
# Reserve space for response
available_context = model_context - 1000 # Leave room for response
total_length = sum(len(msg["content"]) for msg in messages)
if total_length <= available_context:
return messages
# Truncate older messages
truncated_messages = []
current_length = 0
# Always keep system message if present
system_msg = None
if messages and messages[0]["role"] == "system":
system_msg = messages[0]
messages = messages[1:]
truncated_messages.append(system_msg)
current_length += len(system_msg["content"])
# Add recent messages
for msg in reversed(messages):
msg_length = len(msg["content"])
if current_length + msg_length <= available_context:
truncated_messages.insert(-1 if system_msg else 0, msg)
current_length += msg_length
else:
break
return truncated_messages
# Usage
long_conversation = [
{"role": "system", "content": "You are a helpful assistant."},
# ... many messages ...
]
optimized_messages = manage_context(long_conversation)
response = client.chat.completions.create(
model="phi-2",
messages=optimized_messages
)
Best Practices
- Parameter Tuning: Start with conservative parameters and adjust based on results
- Context Management: Monitor context length to avoid truncation
- Error Handling: Always implement retry logic for production use
- Streaming: Use streaming for better user experience with long responses
- Validation: Validate response content and structure
- Performance: Balance quality vs speed based on use case requirements
Next: Explore image generation capabilities with Stable Diffusion models.
What Problem Does This Solve?
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for content, messages, model so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 3: Text Generation and Chat Completions as an operating subsystem inside LocalAI Tutorial: Self-Hosted OpenAI Alternative, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around response, chat, role as your checklist when adapting these patterns to your own repository.
How it Works Under the Hood
Under the hood, Chapter 3: Text Generation and Chat Completions usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
content. - Input normalization: shape incoming data so
messagesreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
model. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Source Walkthrough
Use the following upstream sources to verify implementation details while reading this chapter:
-
core/http/endpoints/openai/chat.goHTTP handler forPOST /v1/chat/completions. Parses the OpenAIChatCompletionRequest, resolves the model backend, dispatches to the inference engine, and formats the streaming or non-streaming response. The critical file for understanding OpenAI API compatibility. -
core/http/endpoints/openai/completion.goHandler forPOST /v1/completions(legacy text completion API). Shows howpromptparameter maps to the backend inference call, distinct from the chat completions message format. -
backend/python/transformers/Python gRPC backend for HuggingFace Transformers models. Thebackend.pyfile shows how LocalAI calls a subprocess gRPC server for Python-based backends, enabling use of any HuggingFace model. -
core/backend/llm.goCore LLM inference dispatcher. Routes text generation requests to the appropriate backend (llama-cpp, transformers, vllm, etc.) based on model config. Shows how streaming token callbacks are implemented across backends.
Suggested trace strategy:
- Trace
core/http/endpoints/openai/chat.go→core/backend/llm.goto follow a chat completion request from HTTP parse to backend inference - Compare the streaming response format in
chat.gowithcompletion.goto understand SSE chunking implementation - Check
backend/python/transformers/backend.pyto see how Python backends communicate with the Go server via gRPC protobuf