Multimodal Memory

June 10, 2026 · View on GitHub

EverOS turns non-text content — images, PDFs, audio, office documents, HTML, email — into the same structured, searchable memory as plain text. You attach the asset to a message at ingest time; a vision/audio capable LLM parses it into text, and from there it flows through the identical extraction → markdown → index pipeline as any text turn. The result is fully retrievable with the same /search stack.

Table of contents

How it works

POST /api/v1/memory/add
  messages[].content = [ ContentItem, ContentItem, ... ]

        │  text items      → used verbatim
        │  non-text items  → multimodal LLM (everalgo-parser)

  parsed text merged back into the session buffer (in original order)


  boundary detector → extraction LLM → MemCell


  markdown (truth)  +  SQLite (state)  +  LanceDB (vector + BM25)


  retrievable via /search and /get like any text memory

Each non-text ContentItem is routed through the parser, which calls a separate, vision/audio capable LLM (configured independently from the main extraction [llm], so parsing can target a multimodal endpoint without changing boundary or extraction behaviour). Visual/audio formats (image / pdf / audio / office) always go through that LLM; a few text-bearing formats can be parsed without it (e.g. a plain email with no inline images). The parser returns text; that text takes the place of the asset in the message buffer. Nothing downstream of the parser knows or cares that the content originated as an image or PDF — the raw bytes are not persisted past extraction (the episode and memory cell store only the parsed text).

Prerequisites

Install the extra

Multimodal parsing lives behind an optional dependency group so the base install stays lean:

uv pip install 'everos[multimodal]'    # or: pip install 'everos[multimodal]'

This pulls in everalgo-parser[svg] — the [svg] bundle adds cairosvg so SVG works out of the box.

LibreOffice (office documents only)

Office formats (.doc / .docx / .ppt / .pptx / .xls / .xlsx) are converted to PDF before being fed to the multimodal LLM. The parser shells out to soffice, LibreOffice's headless renderer, so LibreOffice must be present on the server host:

brew install --cask libreoffice          # macOS
sudo apt-get install -y libreoffice       # Debian / Ubuntu

Without LibreOffice, office uploads return 415 with a clear error; image / PDF / audio / HTML / email parsing is unaffected.

Configure the multimodal LLM

The parser uses its own LLM section, independent from [llm]. The model must accept OpenAI image_url parts. everos init writes these into the generated .env:

EVEROS_MULTIMODAL__MODEL=google/gemini-3-flash-preview
EVEROS_MULTIMODAL__API_KEY=<your key>
EVEROS_MULTIMODAL__BASE_URL=https://openrouter.ai/api/v1

The default targets Gemini via OpenRouter so a single key covers both chat extraction and multimodal parsing. See Configuration reference for the full list.

Supported modalities

typeTypical formatsPayloadNotes
texttextPlain text; the string shorthand also maps here
imagePNG / JPG / GIF / WebP / SVGuri or base64SVG via the bundled cairosvg
pdfPDFuri or base64
audioMP3 / WAV / …uri or base64Endpoint must accept audio parts
docDOC / DOCX / PPT / PPTX / XLS / XLSXuri or base64Requires LibreOffice (converted to PDF first)
htmlHTMLuri or base64To inline HTML as plain text instead, send it as type: "text"
emailEML / MSGuri or base64

A non-text item must carry a fetchable/decodable payload (uri or base64). A non-text item that only carries text returns 415 — the parser has nothing to parse.

Sending multimodal content

Multimodal input is a content array of ContentItem objects on a MessageItem. A bare string content is shorthand for a single text item; switch to the array form when you mix text with non-text assets. Field-level rules are in api.md → ContentItem; the essentials:

FieldPurpose
typeOne of the modalities above
textThe literal text — only for type: "text"
urihttp(s):// (fetched server-side) or file:// (read from the server fs)
base64Inline payload, plain base64 (no data: prefix)
extExtension hint ("pdf", "png", …); effectively required for base64
nameDisplay filename for logs

Carry the payload in exactly one of text / uri / base64.

Payload: uri vs base64

uri (http(s)://)`base64$
\text{Where} \text{the} \text{bytes} \text{live}\text{Fetched} \text{transiently} \text{at} \text{parse} \text{time}\text{Held} \text{verbatim} \text{in} \text{the} \text{SQLite} \text{session} \text{buffer} \text{until} \text{flush}
\text{Wire} \text{size}\text{URL} \text{only}~4/3 \times \text{the} \text{raw} \text{size} (\text{base64} \text{inflation})
\text{Best} \text{for}\text{Large} \text{assets}, \text{S3}/\text{OSS} \text{presigned} \text{URLs}\text{Small} \text{assets}, \text{or} \text{when} \text{no} \text{reachable} \text{URL} \text{exists}

\text{Prefer} $uri` for anything large. A multi-MB base64 blob becomes multi-MB of SQLite buffer text for the buffer's lifetime and slows request parsing. The bytes are never persisted past extraction either way — only the parsed text is.

Example: image by URL

TS=$(($(date +%s) * 1000))     # v1 contract: timestamp in ms
curl -X POST http://127.0.0.1:8000/api/v1/memory/add \
  -H 'Content-Type: application/json' \
  -d "{
    \"session_id\": \"mm-001\",
    \"messages\": [
      {
        \"sender_id\": \"alice\",
        \"role\": \"user\",
        \"timestamp\": $TS,
        \"content\": [
          { \"type\": \"image\", \"uri\": \"https://example.com/whiteboard.png\" }
        ]
      }
    ]
  }"

Example: mixed text + image in one turn

{
  "session_id": "mm-001",
  "messages": [
    {
      "sender_id": "alice",
      "role": "user",
      "timestamp": 1748390400000,
      "content": [
        { "type": "text",  "text": "Here's the whiteboard from today's planning session." },
        { "type": "image", "uri": "https://example.com/whiteboard.png", "name": "whiteboard.png" }
      ]
    }
  ]
}

Example: inline PDF via base64

{
  "session_id": "mm-001",
  "messages": [
    {
      "sender_id": "alice",
      "role": "user",
      "timestamp": 1748390400000,
      "content": [
        { "type": "text", "text": "Quarterly report attached." },
        { "type": "pdf",  "base64": "JVBERi0xLjQK...", "ext": "pdf", "name": "q3.pdf" }
      ]
    }
  ]
}

ext is effectively required for base64 payloads — it drives modality dispatch. Without it the server falls back to MIME inference and otherwise 415s.

Example: local file via file://

A file:// URI is read from the server's local filesystem (the path must be reachable by the server process), guardrailed by size and an optional allowlist:

{ "type": "pdf", "uri": "file:///srv/uploads/q3.pdf" }

Guardrails (a violation surfaces as 415):

  • the resolved path (symlinks followed) must be an existing regular file;
  • size ≤ EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES (default 50 MiB);
  • if EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS is set, the path must lie within one of the listed roots (unset = any readable file, the local-first default — confine this when exposing the API beyond loopback).

Calling from Python (plain HTTP)

There is no EverOS Python client; call the HTTP API directly with any HTTP library:

import httpx

httpx.post(
    "http://127.0.0.1:8000/api/v1/memory/add",
    json={
        "session_id": "mm-001",
        "messages": [
            {
                "sender_id": "alice",
                "role": "user",
                "timestamp": 1748390400000,
                "content": [
                    {"type": "text", "text": "Here's the whiteboard from today's meeting."},
                    {"type": "image", "uri": "https://example.com/whiteboard.png"},
                ],
            }
        ],
    },
)

Configuration reference

All fields bind from the environment via the parent Settings (EVEROS_MULTIMODAL__<FIELD>) or the [multimodal] TOML section.

Env varDefaultMeaning
EVEROS_MULTIMODAL__MODELgoogle/gemini-3-flash-previewParsing model; must accept image_url parts
EVEROS_MULTIMODAL__API_KEYAPI key for the multimodal endpoint
EVEROS_MULTIMODAL__BASE_URLhttps://openrouter.ai/api/v1OpenAI-compatible base URL
EVEROS_MULTIMODAL__MAX_CONCURRENCY4Cap on parallel multimodal calls within one extraction
EVEROS_MULTIMODAL__FILE_URI_MAX_BYTES52428800 (50 MiB)Max size of a file:// asset
EVEROS_MULTIMODAL__FILE_URI_ALLOW_DIRS[] (any)JSON list of allowlisted base dirs for file:// URIs

Errors and limits

Two failure classes behave differently. Deterministic problems (nothing to parse, no handler, missing system dependency) abort the whole /add batch with 415. A transient multimodal-LLM failure (timeout, rate-limit, the model rejecting the asset) degrades just that item — the request still returns 200, the item is marked parse_status="failed" and contributes no text, and the rest of the batch extracts normally.

ConditionResult
Non-text item carries only text (no uri / base64)415 (batch aborted)
Extension / modality the parser has no handler for415 (batch aborted)
base64 without a resolvable ext / MIME to dispatch on415 (batch aborted)
Office document but no LibreOffice (soffice) on host415 (batch aborted)
file:// fails a guardrail (missing / non-regular / too large / outside allowlist)415 (batch aborted)
Multimodal LLM call fails (timeout / rate-limit / model rejects the asset)200 — that item is skipped (parse_status="failed"), the rest of the batch still extracts

The 415 body uses the standard error envelope with the parse-failure reason in error.message — see api.md → POST /add.

Searching multimodal memory

Nothing special is required. Because parsed text is folded into the same episodes and memory cells as text turns, every retrieval method works across multimodal-derived memory unchanged:

curl -X POST http://127.0.0.1:8000/api/v1/memory/search \
  -H 'Content-Type: application/json' \
  -d '{
    "user_id": "alice",
    "query": "whiteboard from the planning session",
    "method": "hybrid"
  }'

keyword, vector, hybrid (default), and agentic all apply — see api.md → SearchMethod.