Whisper Speech-to-Text Auto Setup Script

June 28, 2026 · View on GitHub

Whisper Speech-to-Text Auto Setup Script

Whisper speech-to-text server installer for Ubuntu, Debian, AlmaLinux, Rocky Linux, CentOS, RHEL and Fedora.

This script installs and configures a self-hosted Whisper speech-to-text API server powered by faster-whisper, providing OpenAI-compatible /v1/audio/transcriptions and /v1/audio/translations endpoints. Transcribe and translate audio files using any app that supports the OpenAI audio API.

Features:

Fully automated Whisper server setup, no user input needed
Supports interactive install using custom options
Supports pre-downloading models and managing the server
OpenAI-compatible POST /v1/audio/transcriptions and POST /v1/audio/translations endpoints — switch any app with a one-line change
Streaming transcription — receive segments via SSE as they are decoded, with no waiting for the full file
Word-level timestamps — per-word start/end times and confidence scores in verbose_json output
Multiple output formats: json, text, verbose_json, srt, vtt
Offline/air-gapped mode — run without internet access using pre-cached models (WHISPER_LOCAL_ONLY)
Audio stays on your server — no data sent to third parties
Installs Whisper as a systemd service with a dedicated system user
Models downloaded from HuggingFace and cached in /var/lib/whisper

Also available:

AI stack: Self-Hosted AI Stack
Docker-based AI services: Whisper (STT), Kokoro (TTS), Embeddings, LiteLLM, Ollama (LLM), Docling, MCP Gateway

Community

📬 Subscribe for project updates (1–2 emails/month) — get free AI and VPN deployment guides (PDF)
💬 Join the r/selfhostedstack community for discussions and showcases
⭐ Star the repository if you find it useful — it helps others discover it

Other self-hosted projects: Setup IPsec VPN, IPsec VPN on Docker, WireGuard, OpenVPN, Headscale.

Requirements

A Linux server (cloud server, VPS, dedicated server or home server)
Python 3.9 or higher (the script installs it automatically on supported distros)
At least 700 MB RAM for the default base model (see model table)
Internet access for the initial model download (the model is cached locally afterwards). Not required if using WHISPER_LOCAL_ONLY with pre-cached models.

Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. When using a reverse proxy, set WHISPER_LISTEN_ADDR=127.0.0.1 in /etc/whisper/whisper.conf to prevent direct access to the unencrypted port.

Installation

Download the script on your Linux server:

wget -O whisper.sh https://github.com/hwdsl2/whisper-install/raw/main/whisper-install.sh

Option 1: Auto install with default options.

sudo bash whisper.sh --auto

This installs the base model (~145 MB) on port 9000. The model is downloaded from HuggingFace on first start.

Option 2: Auto install with custom options.

sudo bash whisper.sh --auto --model small --port 9000

Option 3: Interactive install using custom options.

sudo bash whisper.sh

Click here if you are unable to download.

You may also use curl to download:

curl -fL -o whisper.sh https://github.com/hwdsl2/whisper-install/raw/main/whisper-install.sh

If you are unable to download, open whisper-install.sh, then click the Raw button on the right. Press Ctrl/Cmd+A to select all, Ctrl/Cmd+C to copy, then paste into your favorite editor.

View usage information for the script.

Usage: bash whisper.sh [options]

Options:

  --showinfo                           show server info (model, endpoint, API docs)
  --showkey                            show the API key, if configured
  --getkey                             output the API key (machine-readable, no decoration)
  --listmodels                         list available Whisper model names and sizes
  --downloadmodel <model>              pre-download a model to the cache directory
  --uninstall                          remove Whisper and delete all configuration
  -y, --yes                            assume "yes" as answer to prompts
  -h, --help                           show this help message and exit

Install options (optional):

  --auto                               auto install using default or custom options
  --model      <name>                  Whisper model to use (default: base)
  --port       <number>                TCP port for the API server (default: 9000)
  --listenaddr [address]               listen address (default: 0.0.0.0, use 127.0.0.1 for local only)

Available models: tiny, tiny.en, base, base.en, small, small.en,
                  medium, medium.en, large-v1, large-v2, large-v3,
                  large-v3-turbo (or: turbo)

After installation

On first run, the script:

Installs system packages: python3, python3-venv, curl
Creates a whisper system user and group
Creates a Python virtual environment at /opt/whisper/venv
Installs faster-whisper, fastapi, uvicorn, and python-multipart
Generates an API key for fresh installs
Writes the configuration to /etc/whisper/whisper.conf
Installs and starts the whisper systemd service

The first start will download the selected model from HuggingFace. This can take several minutes depending on the model size and network speed. The model is cached in /var/lib/whisper and reused on subsequent starts.

Check service status and logs:

sudo systemctl status whisper
sudo journalctl -u whisper -n 50

Once you see "Whisper speech-to-text server is ready", transcribe your first audio file:

API_KEY=$(sudo bash whisper.sh --getkey)

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@audio.mp3 -F model=whisper-1

Response:

{"text": "Your transcribed text appears here."}

Tip: Need a sample audio file to test? Download this English speech sample (WAV, MIT License) from the Azure Samples repository:

curl -L -o sample_speech.wav \
    "https://github.com/Azure-Samples/cognitive-services-speech-sdk/raw/master/sampledata/audiofiles/katiesteve.wav"

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@sample_speech.wav \
  -F model=whisper-1

API reference

The API is compatible with OpenAI's audio transcription and audio translation endpoints. Any application already calling https://api.openai.com/v1/audio/transcriptions can switch to self-hosted by setting:

OpenAI-only transcription options such as gpt-4o-transcribe-diarize, response_format=diarized_json, include=logprobs, chunking_strategy, known_speaker_names, and known_speaker_references are not supported and return 400.

OPENAI_BASE_URL=http://<server-ip>:9000

Transcribe audio

POST /v1/audio/transcriptions
Content-Type: multipart/form-data

Parameters:

Parameter	Type	Required	Description
`file`	file	✅	Audio file. Supported formats: `mp3`, `mp4`, `m4a`, `wav`, `webm`, `ogg`, `flac`, and all other formats supported by ffmpeg.
`model`	string	✅	Pass `whisper-1` (value is accepted but the active model is always used).
`language`	string	—	BCP-47 language code (e.g. `en`, `fr`, `zh`). Overrides `WHISPER_LANGUAGE` for this request.
`prompt`	string	—	Optional text to guide the model's style or continue a previous segment.
`response_format`	string	—	Output format. Default: `json`. See response formats. Ignored when `stream=true`. OpenAI-only `diarized_json` is not supported.
`temperature`	float	—	Sampling temperature (0–1). Default: `0`.
`stream`	boolean	—	Enable SSE streaming. When `true`, segments are returned as `text/event-stream` events as they are decoded. Default: `false`.
`timestamp_granularities[]`	array	—	Timestamp granularities to populate. Values: `word`, `segment`. When `word` is included, `verbose_json` output includes a top-level `words` array with per-word timing and confidence. Default: `["segment"]`.

Example:

API_KEY=$(sudo bash whisper.sh --getkey)

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@meeting.m4a \
  -F model=whisper-1 \
  -F language=en

If API key authentication is disabled, omit the Authorization header.

Response formats

`response_format`	Description
`json`	`{"text": "..."}` — default, matches OpenAI's basic response
`text`	Plain text, no JSON wrapper
`verbose_json`	Full JSON with language, duration, per-segment timestamps, log-probabilities
`srt`	SubRip subtitle format (`.srt`)
`vtt`	WebVTT subtitle format (`.vtt`)

Example — stream segments as they are decoded:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@long-audio.mp3 \
  -F model=whisper-1 \
  -F stream=true

SSE response (uses the OpenAI streaming transcription protocol):

data: {"type":"transcript.text.delta","delta":"Hello, how are you?"}

data: {"type":"transcript.text.delta","delta":" I'm doing well, thank you."}

data: {"type":"transcript.text.done","text":"Hello, how are you? I'm doing well, thank you."}

data: [DONE]

The first delta typically arrives within 1–3 seconds of upload. Each transcript.text.delta event contains the incremental text for the segment just decoded. The final transcript.text.done event contains the full assembled transcript, equivalent to the standard json response.

Example — stream from a browser using fetch

const form = new FormData();
form.append("file", audioBlob, "audio.webm");
form.append("model", "whisper-1");
form.append("stream", "true");

const res = await fetch("http://<server-ip>:9000/v1/audio/transcriptions", {
  method: "POST",
  headers: { Authorization: `Bearer ${apiKey}` },
  body: form,
});

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  // SSE frames are separated by "\n\n"; split and process complete frames
  const frames = buffer.split("\n\n");
  buffer = frames.pop(); // keep any incomplete trailing frame
  for (const frame of frames) {
    if (!frame.startsWith("data: ")) continue;
    const payload = frame.slice(6);
    if (payload.startsWith("[DONE]")) break;
    const event = JSON.parse(payload);
    if (event.type === "transcript.text.delta") console.log(event.delta);
    if (event.type === "transcript.text.done") console.log("Full text:", event.text);
  }
}

Example — get SRT subtitles:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@video.mp4 \
  -F model=whisper-1 \
  -F response_format=srt

Example — verbose JSON with timestamps:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1 \
  -F response_format=verbose_json

Example — word-level timestamps:

curl http://<server-ip>:9000/v1/audio/transcriptions \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@audio.mp3 \
  -F model=whisper-1 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word"

When timestamp_granularities[] includes word, the verbose_json response includes a top-level words array:

{
  "text": "Hello world.",
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.42, "probability": 0.98},
    {"word": "world.", "start": 0.42, "end": 0.88, "probability": 0.97}
  ],
  "segments": [...]
}

Translate audio

POST /v1/audio/translations
Content-Type: multipart/form-data

Translates audio in any language to English text. Compatible with OpenAI's audio translation endpoint. Accepts the common translation parameters. The output is always in English.

Note: Translation is not supported with English-only (.en) models. Use a multilingual model such as base, small, or large-v3-turbo.

Example:

curl http://<server-ip>:9000/v1/audio/translations \
  -H "Authorization: Bearer $API_KEY" \
  -F file=@french-audio.mp3 \
  -F model=whisper-1

List models

GET /v1/models

Returns the active model in OpenAI-compatible format.

curl http://<server-ip>:9000/v1/models \
  -H "Authorization: Bearer $API_KEY"

Interactive API docs

An interactive Swagger UI is available at:

http://<server-ip>:9000/docs

Available models

Name	Disk	RAM (approx)	Notes
`tiny`	~75 MB	~250 MB	Fastest; lower accuracy
`tiny.en`	~75 MB	~250 MB	English-only variant
`base`	~145 MB	~700 MB	Good balance — default
`base.en`	~145 MB	~700 MB	English-only variant
`small`	~465 MB	~1.5 GB	Better accuracy
`small.en`	~465 MB	~1.5 GB	English-only variant
`medium`	~1.5 GB	~5 GB	High accuracy
`medium.en`	~1.5 GB	~5 GB	English-only variant
`large-v1`	~3 GB	~10 GB	Older large model
`large-v2`	~3 GB	~10 GB	Very high accuracy
`large-v3`	~3 GB	~10 GB	Best accuracy
`large-v3-turbo`	~1.6 GB	~6 GB	Fast + high accuracy ⭐
`turbo`	~1.6 GB	~6 GB	Alias for `large-v3-turbo`

Tip: large-v3-turbo offers accuracy close to large-v3 at roughly half the resource cost. It is the recommended upgrade path from base for most deployments.

Notes:

English-only (.en) variants are slightly faster for English audio.
INT8 quantization (default) reduces RAM usage by approximately 50%.

Managing Whisper

After setup, run the script again to manage your server.

Show server info:

sudo bash whisper.sh --showinfo

Show API key:

sudo bash whisper.sh --showkey

For scripts, output only the raw key:

sudo bash whisper.sh --getkey

List available models:

sudo bash whisper.sh --listmodels

Pre-download a model:

sudo bash whisper.sh --downloadmodel large-v3-turbo

Pre-downloading a model avoids a delay when switching models. After downloading, update WHISPER_MODEL in the configuration file and restart the service.

Remove Whisper:

sudo bash whisper.sh --uninstall

Model files in /var/lib/whisper are preserved. To also remove them:

sudo rm -rf /var/lib/whisper

Show help:

sudo bash whisper.sh --help

You may also run the script without arguments for an interactive management menu.

Configuration

The configuration file is at /etc/whisper/whisper.conf. Edit this file to change settings, then restart the service:

sudo systemctl restart whisper

All variables are optional. If not set, defaults are used automatically.

Variable	Description	Default
`WHISPER_MODEL`	Whisper model to use. See model table for options.	`base`
`WHISPER_PORT`	TCP port for the API server (1–65535).	`9000`
`WHISPER_LISTEN_ADDR`	Listen address for the API server. Use `0.0.0.0` to listen on all interfaces, or `127.0.0.1` for local access only.	`0.0.0.0`
`WHISPER_LANGUAGE`	Default transcription language. BCP-47 code (e.g. `en`, `fr`, `zh`) or `auto` to autodetect.	`auto`
`WHISPER_DEVICE`	Compute device.	`cpu`
`WHISPER_COMPUTE_TYPE`	Quantization type. `int8` is recommended for CPU.	`int8`
`WHISPER_THREADS`	CPU threads for inference. Set to the number of physical cores for best latency.	`2`
`WHISPER_BEAM`	Beam size for decoding. Higher values may improve accuracy at the cost of speed. Use `1` for fastest (greedy) decoding.	`5`
`WHISPER_MAX_UPLOAD_MB`	Maximum uploaded audio file size in MB. Requests above this limit return HTTP 413. Set to `0` to disable the limit.	`1024`
`WHISPER_API_KEY`	Optional Bearer token. Fresh installs auto-generate one. If set, all API requests must include `Authorization: Bearer <key>`. Set explicitly empty to disable authentication.	Auto-generated for fresh installs
`WHISPER_LOG_LEVEL`	Log level: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`.	`INFO`
`WHISPER_LOCAL_ONLY`	When set to any non-empty value, disables all HuggingFace model downloads. For offline or air-gapped deployments with pre-cached models.	(not set)
`WHISPER_WORD_TIMESTAMPS`	When set to `true`, enables word-level timestamps globally for all requests. The `verbose_json` output will include a top-level `words` array with per-word timing and confidence. Can also be enabled per-request via `timestamp_granularities[]=word`.	(not set)

Switching models

Pre-download the new model (optional but recommended):
```
sudo bash whisper.sh --downloadmodel small
```

Edit the configuration file:

sudo nano /etc/whisper/whisper.conf
# Set: WHISPER_MODEL=small

Restart the service:
```
sudo systemctl restart whisper
```

Securing your server

If your Whisper server is reachable from the public internet — even briefly — apply at minimum these protections. Whisper is CPU/GPU-intensive, so an unauthenticated endpoint can be abused to burn your compute resources.

1. Use an API key. Fresh installs auto-generate an API key. Display it with sudo bash whisper.sh --showkey, or use sudo bash whisper.sh --getkey in scripts. Existing configuration files are not modified; if an existing install has no key, set WHISPER_API_KEY in /etc/whisper/whisper.conf to enable authentication manually. All authenticated requests must include Authorization: Bearer <key>.

# Generate a 32-byte random key
openssl rand -hex 32

2. Bind to localhost when fronted by a reverse proxy. Set WHISPER_LISTEN_ADDR=127.0.0.1 in /etc/whisper/whisper.conf so the unencrypted port is not reachable directly from outside the host. Restart with sudo systemctl restart whisper.

3. Limit upload size. The server rejects uploads above WHISPER_MAX_UPLOAD_MB (default 1024). For internet-facing deployments, also configure your reverse proxy to reject oversized uploads (e.g. nginx client_max_body_size 100M;) before they reach the app.

4. Mind the log level. WHISPER_LOG_LEVEL=DEBUG may write transcript text to logs. Keep it at INFO or higher on shared systems.

5. Enable CORS at the proxy if calling from a browser. The server does not set Access-Control-Allow-Origin headers by default; add them at your reverse proxy if you intend to call the API directly from a web page on a different origin.

6. Consider rate limiting. Place a rate-limit (e.g. nginx limit_req_zone, Caddy rate_limit) in front of the server to cap concurrent transcriptions per client IP.

Using a reverse proxy

For internet-facing deployments, place a reverse proxy in front of Whisper to handle HTTPS termination.

Example with Caddy (automatic TLS via Let's Encrypt):

whisper.example.com {
  reverse_proxy localhost:9000
}

Example with nginx:

server {
    listen 443 ssl;
    server_name whisper.example.com;

    ssl_certificate     /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    # Audio files can be large — increase the upload limit as needed
    client_max_body_size 100M;

    location / {
        proxy_pass http://127.0.0.1:9000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;       # required for chunked streaming (SSE)
        proxy_read_timeout 300s;
    }
}

Using with other AI services

Whisper can be used as the speech-to-text service in a broader self-hosted AI setup.

For full and lightweight Docker Compose stacks, manual docker run examples, and voice/RAG/MCP pipeline examples with Kokoro, Embeddings, LiteLLM, Ollama, Docling, and MCP Gateway, see Self-Hosted AI Stack.

Auto install using custom options

sudo bash whisper.sh --auto --model base --port 9000

All install options are optional when using --auto. Defaults: model base, port 9000, listen address 0.0.0.0.

Technical details

OS support: Ubuntu 22.04+, Debian 11+, AlmaLinux/Rocky/CentOS 9+, RHEL 9+, Fedora
Runtime: Python 3.9+ (virtual environment at /opt/whisper/venv)
STT engine: faster-whisper with CTranslate2 (INT8 by default)
API framework: FastAPI + Uvicorn
API server: api_server.py (installed to /opt/whisper/api_server.py)
Audio decoding: PyAV (bundled FFmpeg libraries — no system ffmpeg required)
Data directory: /var/lib/whisper (model cache, persistent across upgrades)
Config file: /etc/whisper/whisper.conf
Service: whisper.service (systemd, runs as dedicated whisper system user)

License

faster-whisper is Copyright (C) SYSTRAN, and is distributed under the MIT License.

This project is an independent setup for Whisper and is not affiliated with, endorsed by, or sponsored by OpenAI or SYSTRAN.