SQLite-AI: API Reference
March 6, 2026 · View on GitHub
This document provides reference-level documentation for all public SQLite-AI functions, virtual tables, and metadata properties exposed to SQL. These functions enable loading and interacting with LLMs, configuring samplers, generating embeddings and text, and managing chat sessions.
ai_version()
Returns: TEXT
Description: Returns the current version of the SQLite-AI extension.
Example:
SELECT ai_version();
-- e.g., '1.0.0'
ai_log_info(extended_enable BOOLEAN)
Returns: NULL
Description:
Enables or disables extended logging information. Use 1 to enable, 0 to disable.
Example:
SELECT ai_log_info(1);
llm_model_load(path TEXT, options TEXT)
Returns: NULL
Description:
Loads a GGUF model from the specified file path with optional comma separated key=value configuration.
If no options are provided the following default value is used: gpu_layers=99
The following keys are available:
gpu_layers=N (N is the number of layers to store in VRAM)
main_gpu=K (K is the GPU that is used for the entire model when split_mode is 0)
split_mode=N (how to split the model across multiple GPUs, 0 means none, 1 means layer, 2 means rows)
vocab_only=1/0 (only load the vocabulary, no weights)
use_mmap=1/0 (use mmap if possible)
use_mlock=1/0 (force system to keep model in RAM)
check_tensors=1/0 (validate model tensor data)
log_info=1/0 (enable/disable the logging of info)
Example:
SELECT llm_model_load('./models/llama.gguf', 'gpu_layers=99');
llm_model_free()
Returns: NULL
Description: Unloads the current model and frees associated memory.
Example:
SELECT llm_model_free();
llm_context_create(context_settings TEXT)
Parameters: context_settings: comma-separated key=value pairs (see [context settings](#context settings)).
Returns: NULL
Description: Creates a new inference context with comma separated key=value configuration.
Context must explicitly created before performing any AI operation!
context_settings
The following keys are available in context_settings:
General
| Key | Type | Meaning |
|---|---|---|
generate_embedding | 1 or 0 | Force the model to generate embeddings. |
normalize_embedding | 1 or 0 | Force normalization during embedding generation (default to 1). |
json_output | 1 or 0 | Force JSON output in embedding generation (default to 0). |
max_tokens | number | Set a maximum number of tokens in input. If input is too large then an error is returned. |
n_predict | number | Control the maximum number of tokens generated during text generation. |
embedding_type | FLOAT32, FLOAT16, BFLOAT16, UINT8, INT8 | Set the model native type, mandatory during embedding generation. |
Core sizing & threading
| Key | Type | Meaning |
|---|---|---|
context_size | number | Equivalent to n_ctx = N and n_batch = N. |
n_ctx | number | Text context length (tokens). 0 = from model. |
n_batch | number | Logical max batch size submitted to llama_decode. |
n_ubatch | number | Physical max micro-batch size. |
n_seq_max | number | Max concurrent sequences (parallel states for recurrent models). |
n_threads | number | Threads for generation. |
n_threads_batch | number | Threads for batch processing. |
Attention, pooling & flash-attention
| Key | Type | Meaning |
|---|---|---|
pooling_type | none, unspecified, mean, cls, last or rank | How to aggregate token embeddings (e.g., mean). |
attention_type | unspecified, causal, non_causal | Attention algorithm for embeddings. |
flash_attn_type | auto, disabled, enabled | Controls when/if Flash-Attention is used. |
RoPE & YaRN (positional scaling)
| Key | Type | Meaning |
|---|---|---|
rope_scaling_type | unspecified, none, linear, yarn, longrope | RoPE scaling strategy. |
rope_freq_base | float number | RoPE base frequency. 0 = from model. |
rope_freq_scale | float number | RoPE frequency scaling factor. 0 = from model. |
yarn_ext_factor | float number | YaRN extrapolation mix factor. <0 = from model. |
yarn_attn_factor | float number | YaRN magnitude scaling. |
yarn_beta_fast | float number | YaRN low correction dimension. |
yarn_beta_slow | float number | YaRN high correction dimension. |
yarn_orig_ctx | number | YaRN original context size. |
KV cache types (experimental)
| Key | Type | Meaning |
|---|---|---|
type_k | ggml_type | Data type for K cache. |
type_v | ggml_type | Data type for V cache. |
Flags
Place booleans at the end of your option string if you’re copy-by-value mirroring a struct; otherwise order doesn’t matter.
| Key | Type | Meaning |
|---|---|---|
embeddings | 1 or 0 | If 1, extract embeddings (with logits). Used by the embedding preset. |
offload_kqv | 1 or 0 | Offload KQV ops (incl. KV cache) to GPU. |
no_perf | 1 or 0 | Disable performance timing. |
op_offload | 1 or 0 | Offload host tensor ops to device. |
swa_full | 1 or 0 | Use full-size SWA cache. When false and n_seq_max > 1, performance may degrade. |
kv_unified | 1 or 0 | Use a unified buffer across input sequences during attention. Try disabling when n_seq_max > 1 and sequences do not share a long prefix. |
defrag_thold | float number | Deprecated. Defragment KV cache if holes/size > thold. <= 0 disables. |
Example:
SELECT llm_context_create('n_ctx=2048,n_threads=6,n_batch=256');
llm_context_create_embedding(context_settings TEXT)
Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).
Returns: NULL
Description: Creates a new inference context specifically set for embedding generation.
It is equivalent to SELECT llm_context_create('generate_embedding=1,normalize_embedding=1,pooling_type=mean');
Context must explicitly created before performing any AI operation!
Example:
SELECT llm_context_create_embedding();
llm_context_create_chat(context_settings TEXT)
Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).
Returns: NULL
Description: Creates a new inference context specifically set for chat conversation.
It is equivalent to SELECT llm_context_create('context_size=4096');
Context must explicitly created before performing any AI operation!
Example:
SELECT llm_context_create_chat();
llm_context_create_textgen(context_settings TEXT)
Parameters: context_settings (optional): Comma-separated key=value pairs to override or extend default settings (see context settings in llm_context_create).
Returns: NULL
Description: Creates a new inference context specifically set for text generation.
It is equivalent to SELECT llm_context_create('context_size=4096');
Context must explicitly created before performing any AI operation!
Example:
SELECT llm_context_create_textgen();
llm_context_free()
Returns: NULL
Description: Frees the current inference context.
Example:
SELECT llm_context_free();
llm_context_size()
Returns: INTEGER
Description:
Returns the total token capacity (context window) of the current llama context. Use this after llm_context_create to confirm the configured context_size. Raises an error if no context is active.
SELECT llm_context_size();
-- 4096
llm_context_used()
Returns: INTEGER
Description:
Returns how many tokens of the current llama context have already been consumed. Combine this with llm_context_size() to monitor usage. Raises an error if no context is active.
Example:
SELECT llm_context_used();
-- 1024
llm_sampler_create()
Returns: NULL
Description: Initializes a new sampling strategy for text generation. A sampler is the mechanism that determines how the model selects the next token (word or subword) during text generation. If no sampler is explicitly created, one will be created automatically when needed.
Example:
SELECT llm_sampler_create();
llm_sampler_free()
Returns: NULL
Description: Frees resources associated with the current sampler.
Example:
SELECT llm_sampler_free();
llm_lora_load(path TEXT, scale REAL)
Returns: NULL
Description: Loads a LoRA adapter from the given file path with a mandatory scale value. LoRA (Low-Rank Adaptation) is a technique to inject trainable, low-rank layers into a pre-trained model.
Example:
SELECT llm_lora_load('./adapters/adapter.lora', 1.0);
llm_lora_free()
Returns: NULL
Description: Unloads any currently loaded LoRA adapter.
Example:
SELECT llm_lora_free();
llm_sampler_init_greedy()
Returns: NULL
Description: Configures the sampler to use greedy decoding (always pick most probable token).
Example:
SELECT llm_sampler_init_greedy();
llm_sampler_init_dist(seed INT)
Returns: NULL
Description: Initializes a random distribution-based sampler with the given seed. If a seed value in not specified, a default 0xFFFFFFFF value will be used.
Example:
SELECT llm_sampler_init_dist(42);
llm_sampler_init_top_k(k INT)
Returns: NULL
Description:
Limits sampling to the top k most likely tokens.
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
Example:
SELECT llm_sampler_init_top_k(40);
llm_sampler_init_top_p(p REAL, min_keep INT)
Returns: NULL
Description:
Top-p sampling retains tokens with cumulative probability >= p. Always keeps at least min_keep tokens.
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
Example:
SELECT llm_sampler_init_top_p(0.9, 1);
llm_sampler_init_min_p(p REAL, min_keep INT)
Returns: NULL
Description:
Like top-p but with a minimum token probability threshold p.
Minimum P sampling as described in https://github.com/ggml-org/llama.cpp/pull/3841
Example:
SELECT llm_sampler_init_min_p(0.05, 1);
llm_sampler_init_typical(p REAL, min_keep INT)
Returns: NULL
Description: Typical sampling prefers tokens near the expected entropy level. Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666
Example:
SELECT llm_sampler_init_typical(0.95, 1);
llm_sampler_init_temp(t REAL)
Returns: NULL
Description: Adjusts the sampling temperature to control randomness.
Example:
SELECT llm_sampler_init_temp(0.8);
llm_sampler_init_temp_ext(t REAL, delta REAL, exponent REAL)
Returns: NULL
Description: Advanced temperature control using exponential scaling. Dynamic temperature implementation (a.k.a. entropy) described in the paper https://arxiv.org/abs/2309.02772
Example:
SELECT llm_sampler_init_temp_ext(0.8, 0.1, 2.0);
llm_sampler_init_xtc(p REAL, t REAL, min_keep INT, seed INT)
Returns: NULL
Description: Combines top-p, temperature, and seed-based sampling with a minimum token count. XTC sampler as described in https://github.com/oobabooga/text-generation-webui/pull/6335
Example:
SELECT llm_sampler_init_xtc(0.9, 0.8, 1, 42);
llm_sampler_init_top_n_sigma(n REAL)
Returns: NULL
Description:
Limits sampling to tokens within n standard deviations.
Top n sigma sampling as described in academic paper "Top-nσ: Not All Logits Are You Need" https://arxiv.org/pdf/2411.07641
Example:
SELECT llm_sampler_init_top_n_sigma(1.5);
llm_sampler_init_mirostat(seed INT, tau REAL, eta REAL, m INT)
Returns: NULL
Description: Initializes Mirostat sampling with entropy control. Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
Example:
SELECT llm_sampler_init_mirostat(42, 5.0, 0.1, 100);
llm_sampler_init_mirostat_v2(seed INT, tau REAL, eta REAL)
Returns: NULL
Description: Mirostat v2 entropy-based sampling. Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
Example:
SELECT llm_sampler_init_mirostat_v2(42, 5.0, 0.1);
llm_sampler_init_grammar(grammar_str TEXT, grammar_root TEXT)
Returns: NULL
Description: Constrains output to match a specified grammar. Grammar syntax described in https://github.com/ggml-org/llama.cpp/tree/master/grammars
Example:
SELECT llm_sampler_init_grammar('...BNF...', 'root');
llm_sampler_init_infill()
Returns: NULL
Description: Enables infill (prefix-suffix) mode for completions.
Example:
SELECT llm_sampler_init_infill();
llm_sampler_init_penalties(n INT, repeat REAL, freq REAL, present REAL)
Returns: NULL
Description: Applies repetition, frequency, and presence penalties.
Example:
SELECT llm_sampler_init_penalties(64, 1.2, 0.5, 0.8);
llm_token_count(text TEXT)
Returns: INTEGER
Description:
Returns how many tokens the current model would consume for the supplied text, using the active context’s vocabulary. Requires a context created via llm_context_create.
Example:
SELECT llm_token_count('Hello world!');
-- 5
llm_embed_generate(text TEXT, options TEXT)
Returns: BLOB or TEXT
Description:
Generates a text embedding as a BLOB vector, with optional configuration provided as a comma-separated list of key=value pairs.
By default, the embedding is normalized unless normalize_embedding=0 is specified.
If json_output=1 is set, the function returns a JSON object instead of a BLOB.
Example:
SELECT llm_embed_generate('hello world', 'json_output=1');
llm_text_generate(text TEXT, [image1, image2, ...], options TEXT)
Returns: TEXT
Description: Generates a full-text completion based on input, with optional configuration provided as a comma-separated list of key=value pairs.
When a vision model is loaded via llm_vision_load(), you can pass one or more images as additional arguments. Images can be file paths (TEXT) or raw image data (BLOB). Supported image formats: JPG, PNG, BMP, GIF.
Examples:
-- Text-only generation
SELECT llm_text_generate('Once upon a time', 'n_predict=1024');
-- Vision: describe an image
SELECT llm_text_generate('Describe this image', './photos/cat.jpg');
-- Vision: compare multiple images
SELECT llm_text_generate('What is different between these images?', './img1.jpg', './img2.jpg');
-- Vision: image from BLOB column
SELECT llm_text_generate('What do you see?', image_data) FROM photos WHERE id = 1;
llm_chat(prompt TEXT)
Returns: VIRTUAL TABLE
Description: Streams a chat-style reply one token per row.
Example:
SELECT reply FROM llm_chat('Tell me a joke.');
llm_chat_create()
Returns: TEXT
Description: Starts a new in-memory chat session. Returns unique chat UUIDv7 value. If no chat is explicitly created, one will be created automatically when needed.
Example:
SELECT llm_chat_create();
llm_chat_free()
Returns: NULL
Description: Ends the current chat session.
Example:
SELECT llm_chat_free();
llm_chat_save(title TEXT, meta TEXT)
Returns: TEXT
Description: Saves the current chat session with optional title and meta into the ai_chat_history and ai_chat_messages tables and returns a UUID.
Example:
SELECT llm_chat_save('Support Chat', '{"user": "Marco"}');
llm_chat_restore(uuid TEXT)
Returns: NULL
Description: Restores a previously saved chat session by UUID.
Example:
SELECT llm_chat_restore('b59e...');
llm_chat_respond(text TEXT, [image1, image2, ...])
Returns: TEXT
Description: Generates a context-aware reply using chat memory, returned as a single, complete response. For a streaming model reply, use the llm_chat virtual table.
When a vision model is loaded via llm_vision_load(), you can pass one or more images as additional arguments. Images can be file paths (TEXT) or raw image data (BLOB). Supported image formats: JPG, PNG, BMP, GIF.
Examples:
-- Text-only chat
SELECT llm_chat_respond('What are the most visited cities in Italy?');
-- Vision: ask about an image
SELECT llm_chat_respond('What is in this photo?', './photos/landscape.jpg');
-- Vision: multiple images
SELECT llm_chat_respond('Compare these two charts', './chart1.png', './chart2.png');
llm_chat_system_prompt(text TEXT)
Returns: TEXT or NULL
Description:
Gets or sets the system prompt for chat sessions. When called without arguments, returns the current system prompt (or NULL if none is set). When called with a text argument, sets the system prompt and returns NULL. The system prompt is automatically prepended as a system-role message at the beginning of chat conversations.
Example:
-- Set a system prompt
SELECT llm_chat_system_prompt('You are a helpful assistant that speaks Italian.');
-- Get the current system prompt
SELECT llm_chat_system_prompt();
Vision Functions
llm_vision_load(path TEXT, options TEXT)
Returns: NULL
Description:
Loads a multimodal projector (mmproj) model for vision capabilities. This requires a text model to already be loaded via llm_model_load(). The mmproj file is a separate GGUF file that contains the vision encoder and projector weights.
Once loaded, vision capabilities are available through llm_text_generate() and llm_chat_respond() by passing image arguments.
The following option keys are available:
| Key | Type | Default | Meaning |
|---|---|---|---|
use_gpu | 1 or 0 | 1 | Use GPU for vision encoding. |
n_threads | number | 4 | Number of threads for vision processing. |
warmup | 1 or 0 | 1 | Run a warmup pass on load for faster first use. |
flash_attn_type | auto, disabled, enabled | auto | Controls Flash Attention for the vision encoder. |
image_min_tokens | number | 0 | Minimum image tokens for dynamic resolution models (0 = from model). |
image_max_tokens | number | 0 | Maximum image tokens for dynamic resolution models (0 = from model). |
Example:
-- Load text model first
SELECT llm_model_load('./models/Gemma-3-4B-IT-Q4_K_M.gguf', 'gpu_layers=99');
SELECT llm_context_create_textgen();
-- Load vision projector
SELECT llm_vision_load('./models/mmproj-Gemma-3-4B-IT-f16.gguf');
-- Now use vision with llm_text_generate or llm_chat_respond
SELECT llm_text_generate('Describe this image', './photos/cat.jpg');
llm_vision_free()
Returns: NULL
Description: Unloads the current vision (mmproj) model and frees associated memory. The text model remains loaded.
Example:
SELECT llm_vision_free();
Audio Functions
audio_model_load(path TEXT, options TEXT)
Returns: NULL
Description:
Loads a Whisper model from the specified file path with optional comma-separated key=value configuration. The model is used for audio transcription via audio_model_transcribe. Only one whisper model can be loaded at a time per connection.
Example:
-- Load with defaults
SELECT audio_model_load('./models/ggml-tiny.bin');
-- Load with options
SELECT audio_model_load('./models/ggml-base.bin', 'gpu_layers=0');
audio_model_free()
Returns: NULL
Description: Unloads the current Whisper model and frees associated memory.
Example:
SELECT audio_model_free();
audio_model_transcribe(input TEXT/BLOB, options TEXT)
Returns: TEXT
Description: Transcribes audio to text using the loaded Whisper model. The input can be either:
- TEXT: A file path to an audio file (WAV, MP3, or FLAC)
- BLOB: Raw audio data (format auto-detected from magic bytes)
An optional second parameter accepts comma-separated key=value pairs to configure transcription behavior.
Supported audio formats: WAV, MP3, FLAC. Audio is automatically converted to mono 16kHz PCM as required by Whisper.
Transcription options:
| Key | Type | Default | Meaning |
|---|---|---|---|
language | text | en | Language code (e.g., en, it, fr, de). |
translate | 1 or 0 | 0 | Translate to English. |
n_threads | number | 4 | Number of threads for decoding. |
offset_ms | number | 0 | Start transcription at this offset (milliseconds). |
duration_ms | number | 0 | Transcribe only this duration (0 = full audio). |
no_timestamps | 1 or 0 | 0 | Suppress timestamps in output. |
single_segment | 1 or 0 | 0 | Force single segment output. |
token_timestamps | 1 or 0 | 0 | Enable token-level timestamps. |
initial_prompt | text | Initial prompt to guide the model. | |
temperature | float | 0.0 | Sampling temperature. |
beam_size | number | -1 | Beam search size (-1 = use default). |
audio_ctx | number | 0 | Audio context size (0 = use default). |
suppress_regex | text | Regex pattern for suppressing tokens. | |
max_len | number | 0 | Maximum segment length in characters (0 = no limit). |
print_timestamps | 1 or 0 | 0 | Include timestamps in transcribed text. |
Examples:
-- Transcribe from a file path
SELECT audio_model_transcribe('./audio/speech.wav');
-- Transcribe from a BLOB column
SELECT audio_model_transcribe(audio_data) FROM recordings WHERE id = 1;
-- Transcribe with options
SELECT audio_model_transcribe('./audio/speech.mp3', 'language=it,translate=1');
-- Transcribe a single segment with no timestamps
SELECT audio_model_transcribe('./audio/clip.flac', 'single_segment=1,no_timestamps=1');
Model Metadata
These functions return internal model properties:
SELECT
llm_model_n_params(),
llm_model_size(),
llm_model_n_ctx_train(),
llm_model_n_embd(),
llm_model_n_layer(),
llm_model_n_head(),
llm_model_n_head_kv(),
llm_model_n_swa(),
llm_model_rope_freq_scale_train(),
llm_model_n_cls_out(),
llm_model_cls_label(),
llm_model_desc(),
llm_model_has_encoder(),
llm_model_has_decoder(),
llm_model_is_recurrent(),
llm_model_chat_template();
All return INTEGER, REAL, or TEXT values depending on the property.