Multimodal Memory
June 10, 2026 · View on GitHub
EverOS turns non-text content — images, PDFs, audio, office documents,
HTML, email — into the same structured, searchable memory as plain
text. You attach the asset to a message at ingest time; a vision/audio
capable LLM parses it into text, and from there it flows through the
identical extraction → markdown → index pipeline as any text turn. The
result is fully retrievable with the same /search stack.
Table of contents
- How it works
- Prerequisites
- Supported modalities
- Sending multimodal content
- Configuration reference
- Errors and limits
- Searching multimodal memory
How it works
POST /api/v1/memory/add
messages[].content = [ ContentItem, ContentItem, ... ]
│
│ text items → used verbatim
│ non-text items → multimodal LLM (everalgo-parser)
▼
parsed text merged back into the session buffer (in original order)
│
▼
boundary detector → extraction LLM → MemCell
│
▼
markdown (truth) + SQLite (state) + LanceDB (vector + BM25)
│
▼
retrievable via /search and /get like any text memory
Each non-text ContentItem is routed through the parser, which calls
a separate, vision/audio capable LLM (configured independently from the
main extraction [llm], so parsing can target a multimodal endpoint
without changing boundary or extraction behaviour). Visual/audio formats
(image / pdf / audio / office) always go through that LLM; a few
text-bearing formats can be parsed without it (e.g. a plain email with no
inline images). The parser returns text; that text takes the place of the
asset in the message buffer. Nothing downstream of the parser
knows or cares that the content originated as an image or PDF — the raw
bytes are not persisted past extraction (the episode and memory cell
store only the parsed text).
Prerequisites
Install the extra
Multimodal parsing lives behind an optional dependency group so the base install stays lean:
uv pip install 'everos[multimodal]' # or: pip install 'everos[multimodal]'
This pulls in everalgo-parser[svg] — the [svg] bundle adds cairosvg
so SVG works out of the box.
LibreOffice (office documents only)
Office formats (.doc / .docx / .ppt / .pptx / .xls / .xlsx)
are converted to PDF before being fed to the multimodal LLM. The parser
shells out to soffice, LibreOffice's headless renderer, so LibreOffice
must be present on the server host:
brew install --cask libreoffice # macOS
sudo apt-get install -y libreoffice # Debian / Ubuntu
Without LibreOffice, office uploads return 415 with a clear error;
image / PDF / audio / HTML / email parsing is unaffected.
Configure the multimodal LLM
The parser uses its own LLM section, independent from [llm]. The model
must accept OpenAI image_url parts. everos init writes these into the
generated .env:
EVEROS_MULTIMODAL__MODEL=google/gemini-3-flash-preview
EVEROS_MULTIMODAL__API_KEY=<your key>
EVEROS_MULTIMODAL__BASE_URL=https://openrouter.ai/api/v1
The default targets Gemini via OpenRouter so a single key covers both chat extraction and multimodal parsing. See Configuration reference for the full list.
Supported modalities
type | Typical formats | Payload | Notes |
|---|---|---|---|
text | — | text | Plain text; the string shorthand also maps here |
image | PNG / JPG / GIF / WebP / SVG | uri or base64 | SVG via the bundled cairosvg |
pdf | uri or base64 | — | |
audio | MP3 / WAV / … | uri or base64 | Endpoint must accept audio parts |
doc | DOC / DOCX / PPT / PPTX / XLS / XLSX | uri or base64 | Requires LibreOffice (converted to PDF first) |
html | HTML | uri or base64 | To inline HTML as plain text instead, send it as type: "text" |
email | EML / MSG | uri or base64 | — |
A non-text item must carry a fetchable/decodable payload (uri or
base64). A non-text item that only carries text returns 415 — the
parser has nothing to parse.
Sending multimodal content
Multimodal input is a content array of ContentItem objects on a
MessageItem. A bare string content is shorthand
for a single text item; switch to the array form when you mix text with
non-text assets. Field-level rules are in
api.md → ContentItem; the essentials:
| Field | Purpose |
|---|---|
type | One of the modalities above |
text | The literal text — only for type: "text" |
uri | http(s):// (fetched server-side) or file:// (read from the server fs) |
base64 | Inline payload, plain base64 (no data: prefix) |
ext | Extension hint ("pdf", "png", …); effectively required for base64 |
name | Display filename for logs |
Carry the payload in exactly one of text / uri / base64.
Payload: uri vs base64
uri (http(s)://) | `base64$ | |
|---|---|---|
| \text{Where} \text{the} \text{bytes} \text{live} | \text{Fetched} \text{transiently} \text{at} \text{parse} \text{time} | \text{Held} \text{verbatim} \text{in} \text{the} \text{SQLite} \text{session} \text{buffer} \text{until} \text{flush} |
| \text{Wire} \text{size} | \text{URL} \text{only} | ~4/3 \times \text{the} \text{raw} \text{size} (\text{base64} \text{inflation}) |
| \text{Best} \text{for} | \text{Large} \text{assets}, \text{S3}/\text{OSS} \text{presigned} \text{URLs} | \text{Small} \text{assets}, \text{or} \text{when} \text{no} \text{reachable} \text{URL} \text{exists} |
\text{Prefer} $uri` for anything large. A multi-MB base64 blob becomes multi-MB of SQLite buffer text for the buffer's lifetime and slows request parsing. The bytes are never persisted past extraction either way — only the parsed text is.
Example: image by URL
TS=$(($(date +%s) * 1000)) # v1 contract: timestamp in ms
curl -X POST http://127.0.0.1:8000/api/v1/memory/add \
-H 'Content-Type: application/json' \
-d "{
\"session_id\": \"mm-001\",
\"messages\": [
{
\"sender_id\": \"alice\",
\"role\": \"user\",
\"timestamp\": $TS,
\"content\": [
{ \"type\": \"image\", \"uri\": \"https://example.com/whiteboard.png\" }
]
}
]
}"
Example: mixed text + image in one turn
{
"session_id": "mm-001",
"messages": [
{
"sender_id": "alice",
"role": "user",
"timestamp": 1748390400000,
"content": [
{ "type": "text", "text": "Here's the whiteboard from today's planning session." },
{ "type": "image", "uri": "https://example.com/whiteboard.png", "name": "whiteboard.png" }
]
}
]
}
Example: inline PDF via base64
{
"session_id": "mm-001",
"messages": [
{
"sender_id": "alice",
"role": "user",
"timestamp": 1748390400000,
"content": [
{ "type": "text", "text": "Quarterly report attached." },
{ "type": "pdf", "base64": "JVBERi0xLjQK...", "ext": "pdf", "name": "q3.pdf" }
]
}
]
}
ext is effectively required for base64 payloads — it drives
modality dispatch. Without it the server falls back to MIME inference and
otherwise 415s.
Example: local file via file://
A file:// URI is read from the server's local filesystem (the path
must be reachable by the server process), guardrailed by size and an
optional allowlist:
{ "type": "pdf", "uri": "file:///srv/uploads/q3.pdf" }
Guardrails (a violation surfaces as 415):
- the resolved path (symlinks followed) must be an existing regular file;
- size ≤
EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES(default 50 MiB); - if
EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRSis set, the path must lie within one of the listed roots (unset = any readable file, the local-first default — confine this when exposing the API beyond loopback).
Calling from Python (plain HTTP)
There is no EverOS Python client; call the HTTP API directly with any HTTP library:
import httpx
httpx.post(
"http://127.0.0.1:8000/api/v1/memory/add",
json={
"session_id": "mm-001",
"messages": [
{
"sender_id": "alice",
"role": "user",
"timestamp": 1748390400000,
"content": [
{"type": "text", "text": "Here's the whiteboard from today's meeting."},
{"type": "image", "uri": "https://example.com/whiteboard.png"},
],
}
],
},
)
Configuration reference
All fields bind from the environment via the parent Settings
(EVEROS_MULTIMODAL__<FIELD>) or the [multimodal] TOML section.
| Env var | Default | Meaning |
|---|---|---|
EVEROS_MULTIMODAL__MODEL | google/gemini-3-flash-preview | Parsing model; must accept image_url parts |
EVEROS_MULTIMODAL__API_KEY | — | API key for the multimodal endpoint |
EVEROS_MULTIMODAL__BASE_URL | https://openrouter.ai/api/v1 | OpenAI-compatible base URL |
EVEROS_MULTIMODAL__MAX_CONCURRENCY | 4 | Cap on parallel multimodal calls within one extraction |
EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES | 52428800 (50 MiB) | Max size of a file:// asset |
EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS | [] (any) | JSON list of allowlisted base dirs for file:// URIs |
Errors and limits
Two failure classes behave differently. Deterministic problems
(nothing to parse, no handler, missing system dependency) abort the
whole /add batch with 415. A transient multimodal-LLM failure
(timeout, rate-limit, the model rejecting the asset) degrades just that
item — the request still returns 200, the item is marked
parse_status="failed" and contributes no text, and the rest of the
batch extracts normally.
| Condition | Result |
|---|---|
Non-text item carries only text (no uri / base64) | 415 (batch aborted) |
| Extension / modality the parser has no handler for | 415 (batch aborted) |
base64 without a resolvable ext / MIME to dispatch on | 415 (batch aborted) |
Office document but no LibreOffice (soffice) on host | 415 (batch aborted) |
file:// fails a guardrail (missing / non-regular / too large / outside allowlist) | 415 (batch aborted) |
| Multimodal LLM call fails (timeout / rate-limit / model rejects the asset) | 200 — that item is skipped (parse_status="failed"), the rest of the batch still extracts |
The 415 body uses the standard error envelope with the parse-failure
reason in error.message — see
api.md → POST /add.
Searching multimodal memory
Nothing special is required. Because parsed text is folded into the same episodes and memory cells as text turns, every retrieval method works across multimodal-derived memory unchanged:
curl -X POST http://127.0.0.1:8000/api/v1/memory/search \
-H 'Content-Type: application/json' \
-d '{
"user_id": "alice",
"query": "whiteboard from the planning session",
"method": "hybrid"
}'
keyword, vector, hybrid (default), and agentic all apply — see
api.md → SearchMethod.