Create embeddings
May 8, 2025 ยท View on GitHub

High-throughput, OpenAI-compatible text embedding & reranker powered by Infinity
Quickstart
- ๐ณ Pull an image โ use the tag shown on the latest GitHub release page (e.g.
runpod/worker-infinity-embedding:<version>) - ๐ง Configure โ set at least
MODEL_NAMES(see Endpoint Configuration) - ๐ Deploy โ create a RunPod Serverless endpoint
- ๐งช Call the API โ follow the example in the Usage section
Endpoint Configuration
All behaviour is controlled through environment variables:
| Variable | Required | Default | Description |
|---|---|---|---|
MODEL_NAMES | Yes | โ | One or more Hugging-Face model IDs. Separate multiple IDs with a semicolon. Example: BAAI/bge-small-en-v1.5 |
BATCH_SIZES | No | 32 | Per-model batch size; semicolon-separated list matching MODEL_NAMES. |
BACKEND | No | torch | Inference engine for all models: torch, optimum, or ctranslate2. |
DTYPES | No | auto | Precision per model (auto, fp16, fp8). Semicolon-separated, must match MODEL_NAMES. |
INFINITY_QUEUE_SIZE | No | 48000 | Max items queueable inside the Infinity engine. |
RUNPOD_MAX_CONCURRENCY | No | 300 | Max concurrent requests the RunPod wrapper will accept. |
API Specification
Two flavours, one schema.
- OpenAI-compatible โ drop-in replacement for
/v1/models,/v1/embeddings, so you can use this endpoint instead of the API from OpenAI by replacing the base url with the URL of your endpoint:https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1and use your API key from RunPod instead of the one from OpenAI - Standard RunPod โ call
/runor/runsyncwith a JSON body under theinputkey.
Base URL:https://api.runpod.ai/v2/<ENDPOINT_ID>
Except for transport (path + wrapper object) the JSON you send/receive is identical. The tables below describe the shared payload.
List Models
| Method | Path | Body |
|---|---|---|
GET | /openai/v1/models | โ |
POST | /runsync | { "input": { "openai_route": "/v1/models" } } |
Response
{
"data": [
{ "id": "BAAI/bge-small-en-v1.5", "stats": {} },
{ "id": "intfloat/e5-large-v2", "stats": {} }
]
}
Create Embeddings
Request Fields (shared)
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | One of the IDs supplied via MODEL_NAMES. |
input | string | array | Yes | A single text string or list of texts to embed. |
OpenAI route vs. Standard:
| Flavour | Method | Path | Body |
|---|---|---|---|
| OpenAI | POST | /v1/embeddings | { "model": "โฆ", "input": "โฆ" } |
| Standard | POST | /runsync | { "input": { "model": "โฆ", "input": "โฆ" } } |
Response (both flavours)
{
"object": "list",
"model": "BAAI/bge-small-en-v1.5",
"data": [
{ "object": "embedding", "embedding": [0.01, -0.02 /* โฆ */], "index": 0 }
],
"usage": { "prompt_tokens": 2, "total_tokens": 2 }
}
Rerank Documents (Standard only)
| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Any deployed reranker model |
query | string | Yes | The search/query text |
docs | array | Yes | List of documents to rerank |
return_docs | bool | No | If true, return the documents in ranked order (default false) |
Call pattern
POST /runsync
Content-Type: application/json
{
"input": {
"model": "BAAI/bge-reranker-large",
"query": "Which product has warranty coverage?",
"docs": [
"Product A comes with a 2-year warranty",
"Product B is available in red and blue colors",
"All electronics include a standard 1-year warranty"
],
"return_docs": true
}
}
Response contains either scores or the full docs list, depending on return_docs.
Usage
Below are minimal curl snippets so you can copy-paste from any machine.
Replace
<ENDPOINT_ID>with your endpoint ID and<API_KEY>with a RunPod API key.
OpenAI-Compatible Calls
# List models
curl -H "Authorization: Bearer <API_KEY>" \
https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1/models
# Create embeddings
curl -X POST \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{"model":"BAAI/bge-small-en-v1.5","input":"Hello world"}' \
https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1/embeddings
Standard RunPod Calls
# Create embeddings (wait for result)
curl -X POST \
-H "Content-Type: application/json" \
-d '{"input":{"model":"BAAI/bge-small-en-v1.5","input":"Hello world"}}' \
https://api.runpod.ai/v2/<ENDPOINT_ID>/runsync
# Rerank
curl -X POST \
-H "Content-Type: application/json" \
-d '{"input":{"model":"BAAI/bge-reranker-large","query":"Which product has warranty coverage?","docs":["Product A comes with a 2-year warranty","Product B is available in red and blue colors","All electronics include a standard 1-year warranty"],"return_docs":true}}' \
https://api.runpod.ai/v2/<ENDPOINT_ID>/runsync
Further Documentation
- Infinity Engine โ how the ultra-fast backend works.
- RunPod Docs โ serverless concepts, limits, and API reference.
Acknowledgements
Special thanks to Michael Feil for creating the Infinity engine and for his ongoing support of this project.