Whisper Speech-to-Text Auto Setup Script

June 28, 2026 · View on GitHub

English | 简体中文 | 繁體中文 | Русский

Whisper Speech-to-Text Auto Setup Script

Build Status  License: MIT

Whisper speech-to-text server installer for Ubuntu, Debian, AlmaLinux, Rocky Linux, CentOS, RHEL and Fedora.

This script installs and configures a self-hosted Whisper speech-to-text API server powered by faster-whisper, providing OpenAI-compatible /v1/audio/transcriptions and /v1/audio/translations endpoints. Transcribe and translate audio files using any app that supports the OpenAI audio API.

Features:

  • Fully automated Whisper server setup, no user input needed
  • Supports interactive install using custom options
  • Supports pre-downloading models and managing the server
  • OpenAI-compatible POST /v1/audio/transcriptions and POST /v1/audio/translations endpoints — switch any app with a one-line change
  • Streaming transcription — receive segments via SSE as they are decoded, with no waiting for the full file
  • Word-level timestamps — per-word start/end times and confidence scores in verbose_json output
  • Multiple output formats: json, text, verbose_json, srt, vtt
  • Offline/air-gapped mode — run without internet access using pre-cached models (WHISPER_LOCAL_ONLY)
  • Audio stays on your server — no data sent to third parties
  • Installs Whisper as a systemd service with a dedicated system user
  • Models downloaded from HuggingFace and cached in /var/lib/whisper

Also available:

Community

  • 📬 Subscribe for project updates (1–2 emails/month) — get free AI and VPN deployment guides (PDF)
  • 💬 Join the r/selfhostedstack community for discussions and showcases
  • ⭐ Star the repository if you find it useful — it helps others discover it

Other self-hosted projects: Setup IPsec VPN, IPsec VPN on Docker, WireGuard, OpenVPN, Headscale.

Requirements

  • A Linux server (cloud server, VPS, dedicated server or home server)
  • Python 3.9 or higher (the script installs it automatically on supported distros)
  • At least 700 MB RAM for the default base model (see model table)
  • Internet access for the initial model download (the model is cached locally afterwards). Not required if using WHISPER_LOCAL_ONLY with pre-cached models.

Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. When using a reverse proxy, set WHISPER_LISTEN_ADDR=127.0.0.1 in /etc/whisper/whisper.conf to prevent direct access to the unencrypted port.

Installation

Download the script on your Linux server:

wget -O whisper.sh https://github.com/hwdsl2/whisper-install/raw/main/whisper-install.sh

Option 1: Auto install with default options.

sudo bash whisper.sh --auto

This installs the base model (~145 MB) on port 9000. The model is downloaded from HuggingFace on first start.

Option 2: Auto install with custom options.

sudo bash whisper.sh --auto --model small --port 9000

Option 3: Interactive install using custom options.

sudo bash whisper.sh
Click here if you are unable to download.

You may also use curl to download:

curl -fL -o whisper.sh https://github.com/hwdsl2/whisper-install/raw/main/whisper-install.sh

If you are unable to download, open whisper-install.sh, then click the Raw button on the right. Press Ctrl/Cmd+A to select all, Ctrl/Cmd+C to copy, then paste into your favorite editor.

View usage information for the script.
Usage: bash whisper.sh [options]

Options:

  --showinfo                           show server info (model, endpoint, API docs)
  --showkey                            show the API key, if configured
  --getkey                             output the API key (machine-readable, no decoration)
  --listmodels                         list available Whisper model names and sizes
  --downloadmodel <model>              pre-download a model to the cache directory
  --uninstall                          remove Whisper and delete all configuration
  -y, --yes                            assume "yes" as answer to prompts
  -h, --help                           show this help message and exit

Install options (optional):

  --auto                               auto install using default or custom options
  --model      <name>                  Whisper model to use (default: base)
  --port       <number>                TCP port for the API server (default: 9000)
  --listenaddr [address]               listen address (default: 0.0.0.0, use 127.0.0.1 for local only)

Available models: tiny, tiny.en, base, base.en, small, small.en,
                  medium, medium.en, large-v1, large-v2, large-v3,
                  large-v3-turbo (or: turbo)

After installation

On first run, the script:

  1. Installs system packages: python3, python3-venv, curl
  2. Creates a whisper system user and group
  3. Creates a Python virtual environment at /opt/whisper/venv
  4. Installs faster-whisper, fastapi, uvicorn, and python-multipart
  5. Generates an API key for fresh installs
  6. Writes the configuration to /etc/whisper/whisper.conf
  7. Installs and starts the whisper systemd service

The first start will download the selected model from HuggingFace. This can take several minutes depending on the model size and network speed. The model is cached in /var/lib/whisper and reused on subsequent starts.

Check service status and logs:

sudo systemctl status whisper
sudo journalctl -u whisper -n 50

Once you see "Whisper speech-to-text server is ready", transcribe your first audio file:

API_KEY=$(sudo bash whisper.sh --getkey)

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@audio.mp3 -F model=whisper-1

Response:

{"text": "Your transcribed text appears here."}

Tip: Need a sample audio file to test? Download this English speech sample (WAV, MIT License) from the Azure Samples repository:

curl -L -o sample_speech.wav \
    "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/sampledata/audiofiles/katiesteve.wav"

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@sample_speech.wav \
  -F model=whisper-1

API reference

The API is compatible with OpenAI's audio transcription and audio translation endpoints. Any application already calling https://api.openai.com/v1/audio/transcriptions can switch to self-hosted by setting:

OpenAI-only transcription options such as gpt-4o-transcribe-diarize, response_format=diarized_json, include=logprobs, chunking_strategy, known_speaker_names, and known_speaker_references are not supported and return 400.

OPENAI_BASE_URL=http://<server-ip>:9000

Transcribe audio

POST /v1/audio/transcriptions
Content-Type: multipart/form-data

Parameters:

ParameterTypeRequiredDescription
filefileAudio file. Supported formats: mp3, mp4, m4a, wav, webm, ogg, flac, and all other formats supported by ffmpeg.
modelstringPass whisper-1 (value is accepted but the active model is always used).
languagestringBCP-47 language code (e.g. en, fr, zh). Overrides WHISPER_LANGUAGE for this request.
promptstringOptional text to guide the model's style or continue a previous segment.
response_formatstringOutput format. Default: json. See response formats. Ignored when stream=true. OpenAI-only diarized_json is not supported.
temperaturefloatSampling temperature (0–1). Default: 0.
streambooleanEnable SSE streaming. When true, segments are returned as text/event-stream events as they are decoded. Default: false.
timestamp_granularities[]arrayTimestamp granularities to populate. Values: word, segment. When word is included, verbose_json output includes a top-level words array with per-word timing and confidence. Default: ["segment"].

Example:

API_KEY=$(sudo bash whisper.sh --getkey)

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@meeting.m4a \
  -F model=whisper-1 \
  -F language=en

If API key authentication is disabled, omit the Authorization header.

Response formats

response_formatDescription
json{"text": "..."} — default, matches OpenAI's basic response
textPlain text, no JSON wrapper
verbose_jsonFull JSON with language, duration, per-segment timestamps, log-probabilities
srtSubRip subtitle format (.srt)
vttWebVTT subtitle format (.vtt)

Example — stream segments as they are decoded:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@long-audio.mp3 \
  -F model=whisper-1 \
  -F stream=true

SSE response (uses the OpenAI streaming transcription protocol):

data: {"type":"transcript.text.delta","delta":"Hello, how are you?"}

data: {"type":"transcript.text.delta","delta":" I'm doing well, thank you."}

data: {"type":"transcript.text.done","text":"Hello, how are you? I'm doing well, thank you."}

data: [DONE]

The first delta typically arrives within 1–3 seconds of upload. Each transcript.text.delta event contains the incremental text for the segment just decoded. The final transcript.text.done event contains the full assembled transcript, equivalent to the standard json response.

Example — stream from a browser using fetch
const form = new FormData();
form.append("file", audioBlob, "audio.webm");
form.append("model", "whisper-1");
form.append("stream", "true");

const res = await fetch("http://<server-ip>:9000/v1/audio/transcriptions", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiKey}` },
  body: form,
});

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  // SSE frames are separated by "\n\n"; split and process complete frames
  const frames = buffer.split("\n\n");
  buffer = frames.pop(); // keep any incomplete trailing frame
  for (const frame of frames) {
    if (!frame.startsWith("data: ")) continue;
    const payload = frame.slice(6);
    if (payload.startsWith("[DONE]")) break;
    const event = JSON.parse(payload);
    if (event.type === "transcript.text.delta") console.log(event.delta);
    if (event.type === "transcript.text.done") console.log("Full text:", event.text);
  }
}

Example — get SRT subtitles:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@video.mp4 \
  -F model=whisper-1 \
  -F response_format=srt

Example — verbose JSON with timestamps:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1 \
  -F response_format=verbose_json

Example — word-level timestamps:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word"

When timestamp_granularities[] includes word, the verbose_json response includes a top-level words array:

{
  "text": "Hello world.",
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.42, "probability": 0.98},
    {"word": "world.", "start": 0.42, "end": 0.88, "probability": 0.97}
  ],
  "segments": [...]
}

Translate audio

POST /v1/audio/translations
Content-Type: multipart/form-data

Translates audio in any language to English text. Compatible with OpenAI's audio translation endpoint. Accepts the common translation parameters. The output is always in English.

Note: Translation is not supported with English-only (.en) models. Use a multilingual model such as base, small, or large-v3-turbo.

Example:

curl http://<server-ip>:9000/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@french-audio.mp3 \
  -F model=whisper-1

List models

GET /v1/models

Returns the active model in OpenAI-compatible format.

curl http://<server-ip>:9000/v1/models \
  -H "Authorization: Bearer $API_KEY"

Interactive API docs

An interactive Swagger UI is available at:

http://<server-ip>:9000/docs

Available models

NameDiskRAM (approx)Notes
tiny~75 MB~250 MBFastest; lower accuracy
tiny.en~75 MB~250 MBEnglish-only variant
base~145 MB~700 MBGood balance — default
base.en~145 MB~700 MBEnglish-only variant
small~465 MB~1.5 GBBetter accuracy
small.en~465 MB~1.5 GBEnglish-only variant
medium~1.5 GB~5 GBHigh accuracy
medium.en~1.5 GB~5 GBEnglish-only variant
large-v1~3 GB~10 GBOlder large model
large-v2~3 GB~10 GBVery high accuracy
large-v3~3 GB~10 GBBest accuracy
large-v3-turbo~1.6 GB~6 GBFast + high accuracy ⭐
turbo~1.6 GB~6 GBAlias for large-v3-turbo

Tip: large-v3-turbo offers accuracy close to large-v3 at roughly half the resource cost. It is the recommended upgrade path from base for most deployments.

Notes:

  • English-only (.en) variants are slightly faster for English audio.
  • INT8 quantization (default) reduces RAM usage by approximately 50%.

Managing Whisper

After setup, run the script again to manage your server.

Show server info:

sudo bash whisper.sh --showinfo

Show API key:

sudo bash whisper.sh --showkey

For scripts, output only the raw key:

sudo bash whisper.sh --getkey

List available models:

sudo bash whisper.sh --listmodels

Pre-download a model:

sudo bash whisper.sh --downloadmodel large-v3-turbo

Pre-downloading a model avoids a delay when switching models. After downloading, update WHISPER_MODEL in the configuration file and restart the service.

Remove Whisper:

sudo bash whisper.sh --uninstall

Model files in /var/lib/whisper are preserved. To also remove them:

sudo rm -rf /var/lib/whisper

Show help:

sudo bash whisper.sh --help

You may also run the script without arguments for an interactive management menu.

Configuration

The configuration file is at /etc/whisper/whisper.conf. Edit this file to change settings, then restart the service:

sudo systemctl restart whisper

All variables are optional. If not set, defaults are used automatically.

VariableDescriptionDefault
WHISPER_MODELWhisper model to use. See model table for options.base
WHISPER_PORTTCP port for the API server (1–65535).9000
WHISPER_LISTEN_ADDRListen address for the API server. Use 0.0.0.0 to listen on all interfaces, or 127.0.0.1 for local access only.0.0.0.0
WHISPER_LANGUAGEDefault transcription language. BCP-47 code (e.g. en, fr, zh) or auto to autodetect.auto
WHISPER_DEVICECompute device.cpu
WHISPER_COMPUTE_TYPEQuantization type. int8 is recommended for CPU.int8
WHISPER_THREADSCPU threads for inference. Set to the number of physical cores for best latency.2
WHISPER_BEAMBeam size for decoding. Higher values may improve accuracy at the cost of speed. Use 1 for fastest (greedy) decoding.5
WHISPER_MAX_UPLOAD_MBMaximum uploaded audio file size in MB. Requests above this limit return HTTP 413. Set to 0 to disable the limit.1024
WHISPER_API_KEYOptional Bearer token. Fresh installs auto-generate one. If set, all API requests must include Authorization: Bearer <key>. Set explicitly empty to disable authentication.Auto-generated for fresh installs
WHISPER_LOG_LEVELLog level: DEBUG, INFO, WARNING, ERROR, CRITICAL.INFO
WHISPER_LOCAL_ONLYWhen set to any non-empty value, disables all HuggingFace model downloads. For offline or air-gapped deployments with pre-cached models.(not set)
WHISPER_WORD_TIMESTAMPSWhen set to true, enables word-level timestamps globally for all requests. The verbose_json output will include a top-level words array with per-word timing and confidence. Can also be enabled per-request via timestamp_granularities[]=word.(not set)

Switching models

  1. Pre-download the new model (optional but recommended):
    sudo bash whisper.sh --downloadmodel small
    
  2. Edit the configuration file:
    sudo nano /etc/whisper/whisper.conf
    # Set: WHISPER_MODEL=small
    
  3. Restart the service:
    sudo systemctl restart whisper
    

Securing your server

If your Whisper server is reachable from the public internet — even briefly — apply at minimum these protections. Whisper is CPU/GPU-intensive, so an unauthenticated endpoint can be abused to burn your compute resources.

1. Use an API key. Fresh installs auto-generate an API key. Display it with sudo bash whisper.sh --showkey, or use sudo bash whisper.sh --getkey in scripts. Existing configuration files are not modified; if an existing install has no key, set WHISPER_API_KEY in /etc/whisper/whisper.conf to enable authentication manually. All authenticated requests must include Authorization: Bearer <key>.

# Generate a 32-byte random key
openssl rand -hex 32

2. Bind to localhost when fronted by a reverse proxy. Set WHISPER_LISTEN_ADDR=127.0.0.1 in /etc/whisper/whisper.conf so the unencrypted port is not reachable directly from outside the host. Restart with sudo systemctl restart whisper.

3. Limit upload size. The server rejects uploads above WHISPER_MAX_UPLOAD_MB (default 1024). For internet-facing deployments, also configure your reverse proxy to reject oversized uploads (e.g. nginx client_max_body_size 100M;) before they reach the app.

4. Mind the log level. WHISPER_LOG_LEVEL=DEBUG may write transcript text to logs. Keep it at INFO or higher on shared systems.

5. Enable CORS at the proxy if calling from a browser. The server does not set Access-Control-Allow-Origin headers by default; add them at your reverse proxy if you intend to call the API directly from a web page on a different origin.

6. Consider rate limiting. Place a rate-limit (e.g. nginx limit_req_zone, Caddy rate_limit) in front of the server to cap concurrent transcriptions per client IP.

Using a reverse proxy

For internet-facing deployments, place a reverse proxy in front of Whisper to handle HTTPS termination.

Example with Caddy (automatic TLS via Let's Encrypt):

whisper.example.com {
  reverse_proxy localhost:9000
}

Example with nginx:

server {
    listen 443 ssl;
    server_name whisper.example.com;

    ssl_certificate     /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    # Audio files can be large — increase the upload limit as needed
    client_max_body_size 100M;

    location / {
        proxy_pass http://127.0.0.1:9000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;       # required for chunked streaming (SSE)
        proxy_read_timeout 300s;
    }
}

Using with other AI services

Whisper can be used as the speech-to-text service in a broader self-hosted AI setup.

For full and lightweight Docker Compose stacks, manual docker run examples, and voice/RAG/MCP pipeline examples with Kokoro, Embeddings, LiteLLM, Ollama, Docling, and MCP Gateway, see Self-Hosted AI Stack.

Auto install using custom options

sudo bash whisper.sh --auto --model base --port 9000

All install options are optional when using --auto. Defaults: model base, port 9000, listen address 0.0.0.0.

Technical details

  • OS support: Ubuntu 22.04+, Debian 11+, AlmaLinux/Rocky/CentOS 9+, RHEL 9+, Fedora
  • Runtime: Python 3.9+ (virtual environment at /opt/whisper/venv)
  • STT engine: faster-whisper with CTranslate2 (INT8 by default)
  • API framework: FastAPI + Uvicorn
  • API server: api_server.py (installed to /opt/whisper/api_server.py)
  • Audio decoding: PyAV (bundled FFmpeg libraries — no system ffmpeg required)
  • Data directory: /var/lib/whisper (model cache, persistent across upgrades)
  • Config file: /etc/whisper/whisper.conf
  • Service: whisper.service (systemd, runs as dedicated whisper system user)

License

Copyright (C) 2026 Lin Song
This work is licensed under the MIT License.

faster-whisper is Copyright (C) SYSTRAN, and is distributed under the MIT License.

This project is an independent setup for Whisper and is not affiliated with, endorsed by, or sponsored by OpenAI or SYSTRAN.