README.org
March 9, 2026 ยท View on GitHub
#+STARTUP: showeverything
** whisper.el
Speech-to-Text interface for Emacs using OpenAI's [[https://github.com/openai/whisper][whisper speech recognition model]]. For inference, it uses the C/C++ port [[https://github.com/ggerganov/whisper.cpp][whisper.cpp]] that can run on consumer grade CPU without requiring a high end GPU.
You can capture audio with your input device (microphone) or choose a media file on disk, and have the transcribed text inserted into your Emacs buffer, optionally after translating to English from your local language. This runs offline without having to use non-free cloud services (though quality varies depending on the [[https://github.com/openai/whisper#available-models-and-languages][language]]).
*** Install and Usage
Aside from Git, a C++ compiler and CMake (to build whisper.cpp), the system needs to have =FFmpeg= for recording audio.
You can install =whisper.el= by cloning this repo somewhere, and then use it like:
#+begin_src elisp (use-package whisper :load-path "path/to/whisper.el" :bind ("C-H-r" . whisper-run) :config (setopt whisper-install-directory "/tmp/" whisper-model "base" whisper-language "en" whisper-translate nil whisper-cursor-return 'start whisper-use-threads (/ (num-processors) 2))) #+end_src
Or from Emacs 29.1 onwards you can install a package directly from vc source:
#+begin_src elisp (use-package whisper :vc (:url "https://github.com/natrys/whisper.el" :branch "master")) #+end_src
The entry points of this package are these two functions:
- =whisper-run=: Toggle between recording from your microphone and transcribing
- =whisper-file=: Same as before but transcribes a local file on disk
Invoking =whisper-run= with a prefix argument (C-u) has the same effect as =whisper-file=.
Both of these functions will automatically compile whisper.cpp dependency and download language model the first time they are run. When recording is in progress, invoking them stops it and starts transcribing. Otherwise if compilation, download (of model file) or transcription job is in progress, calling them again stops that.
Additionally, =whisper-select-language= function lets you set your language interactively.
Note for MacOS users: If whisper.el is failing silently, it might be because Emacs doesn't have the permission to use the Mic. Follow one of the [[https://github.com/natrys/whisper.el/wiki/MacOS-Configuration#grant-emacs-permission-to-use-mic][recipes]] in wiki to grant it explicitly.
*** Variables
- =whisper-install-directory=: Location where whisper.cpp will be installed. Default is =~/.emacs.d/.cache/=.
- =whisper-language=: Specify your spoken language; default is =en=. For all possible short-codes (ISO 639-1): [[https://github.com/ggerganov/whisper.cpp/blob/aa6adda26e1ee9843dddb013890e3312bee52cfe/whisper.cpp#L31][see here]]. You can also set it to =auto= to allow whisper.cpp to infer the language from first 30 seconds of audio. Note that you can set this interactively with =whisper-select-language= function too.
- =whisper-model=: Which language model to use. Default is =base=. Valid values are: tiny, base, small, medium, large-v1, large-v2, large-v3, large-v3-turbo. Bigger models are more accurate, but takes more time and more RAM to run (aside from more disk space and download size), see: [[https://github.com/ggerganov/whisper.cpp#memory-usage][resource requirements]]. Note that tiny, base, small and medium come with =.en= variants (e.g. =small.en=) that might be faster, but are for English only.
- =whisper-translate=: Default =nil= means transcription output language is same as spoken language. Setting it to =t= translates it to English first.
- =whisper-use-threads=: Default =nil= means let whisper.cpp choose appropriate value (which it sets with formula min(4, num_of_cores)). If you want to use more than 4 threads (as you have more than 4 cpu cores), set this number manually.
Additionally, depending on your input device and system you may need to modify these variables to get recording to work:
- =whisper--ffmpeg-input-format=: This is what you would pass to the =-f= flag of FFmpeg to input to record audio. Default is =pulse= on Linux, =avfoundation= on OSX and =dshow= on Windows.
- =whisper--ffmpeg-input-device=: This is what you would pass to the =-i= flag of FFmpeg to record audio. If you are using pulseaudio or pipewire-pulse in linux, then the default is =default= source, otherwise this will likely need to be set. For MacOS users, the wiki contains a recipe that lets you set this interactively: [[https://github.com/natrys/whisper.el/wiki/MacOS-Configuration#what-should-be-the-value-of-whisper--ffmpeg-input-device][see here]].
Few other variables for customising workflow:
- =whisper-insert-text-at-point=: By default whisper.el inserts the transcribed text at the point where =whisper-run= or =whisper-file= was invoked. But if you set this to =nil=, the text will be displayed in a separate buffer instead
- If you don't want to display the buffer at all and want to take control of your your own workflow through functions in =whisper-after-transcription-hook=, then also set =whisper-display-transcription-buffer= to =nil=.
- By default these temporary output buffers are differentiated by timestamp; you have to take care of cleanup once you are done. If you want to keep reusing the same buffer, do: #+begin_src elisp (setq whisper-transcription-buffer-name-function #'whisper--simple-transcription-buffer-name) #+end_src
- =whisper-return-cursor=: When whisper.el inserts transcribed text at the point where =whisper-run= was invoked, where we put the cursor depends on this variable. It can have the following values:
- ='start= (default): Puts cursor at the start of the inserted text (so exactly where =whisper-run= was invoked)
- ='end=: Puts cursor at the end of the inserted text
- =nil=: Don't move the cursor at all. Useful when you have drifted away after calling =whisper-run= and don't want to lose your current position
- =whisper-server-mode=: Choose between different transcription modes:
- =nil= (default): Use whisper.cpp directly
- =local=: Spawns a whisper.cpp local HTTP server and uses that over network
- =remote=: In case you already have whisper.cpp HTTP server running in a remote machine
- =whisper-server-baseurl=: Base URL for the whisper.cpp server (for both =local= or =remote= modes). Should be a full URL including protocol, host, and port (e.g.
http://localhost:8642orhttps://whisper.example.com)
- =whisper-server-baseurl=: Base URL for the whisper.cpp server (for both =local= or =remote= modes). Should be a full URL including protocol, host, and port (e.g.
- =openai=: Use OpenAI compatible transcription API (requires API key)
- =whisper-openai-api-baseurl=: When using OpenAI server mode by default we use their API but you could use other services too (e.g. for Mistral set it to
https://api.mistral.ai) - =whisper-openai-api-key=: API key for OpenAI compatible whisper API (required when using OpenAI server mode)
- =whisper-openai-model=: When using OpenAI server mode we default to their best proprietary model
gpt-4o-transcribebut you could usewhisper-1or other service appropriate models likevoxtral-mini-latestwhich is Mistral's open-weight model
- =whisper-openai-api-baseurl=: When using OpenAI server mode by default we use their API but you could use other services too (e.g. for Mistral set it to
- =whisper-recording-timeout=: Default is =300= seconds. We do not want to start recording and then forget. The intermediate temporary file is stored in uncompressed =wav= format (roughly 4.5mb per minute but can vary), they can grow and fill disk even if
/tmp/is used for it by default. - =whisper-show-progress-in-mode-line=: By default, progress level of running job in whisper.cpp is shown in the mode line.
- =whisper-quantize=: Whether to quantize the model (default =nil=). Non-nil valid values are: q4_0, q4_1, q5_0, q5_1, q8_0. For an explanation of what quantization means, [[https://github.com/natrys/whisper.el#quantize-the-model][see below]]. If it's defined, whisper.el will automatically quantize the model before using that.
- =whisper-install-whispercpp=: By default the installation of whisper.cpp is done automatically. If you are on a platform where our automatic install fails, but you are able to do so manually at =whisper-install-directory=, you can set this to ='manual= to ensure we don't try and fail to install it automatically. Also if you are planning to not use whisper.cpp at all by overriding =whisper-command= ([[https://github.com/natrys/whisper.el#use-something-other-than-whispercpp][see below]]), you can just set this to =nil= to ensure no whisper.cpp related runtime checks and downloads will be performed.
*** Note on recording quality
Pulseaudio and PipeWire users who haven't further configured their =default= source may find that results are better when at least =echocancel= filter is enabled, by loading relevant module. Then you could either set that as the default source (using e.g. =pactl=) or just use that source's name in =whisper--ffmpeg-input-device=. Otherwise, the following programs/plugins could be used to improve quality of audio recording in general:
- [[https://github.com/wwmm/easyeffects][Easy Effects]]
- [[https://github.com/werman/noise-suppression-for-voice][RNNoise Plugin]]
- [[https://github.com/noisetorch/NoiseTorch][NoiseTorch]]
- [[https://github.com/Rikorose/DeepFilterNet][DeepFilterNet]] (see also [[https://github.com/Rikorose/DeepFilterNet/blob/main/ladspa/README.md][PipeWire integration]])
*** Hooks
There are a few hooks provided for registering user defined actions:
- =whisper-before-transcription-hook=: Functions here are run before anything else. Helpful to ensure suitable condition to run whisper (e.g. check if buffer is read-only).
- =whisper-after-transcription-hook=: If you want to do some text formatting or transformation on the whisper output, add a function here. Each function would run in a temporary buffer containing transcription output, with point set to beginning of the buffer. For example, the default command output is one big line of text. If you want to do something like adding a paragraph break every N sentences, you could do: #+begin_src elisp (defun whisper--break-sentences (n) "Put a paragraph break every N sentences." (catch 'return (while t (dotimes (_ n) (forward-sentence 1) (when (eobp) (throw 'return nil))) (insert "\n\n") (when (= (char-after) ?\ ) (delete-horizontal-space)))))
(add-hook 'whisper-post-process-hook (lambda () (whisper--break-sentences 5))) ;; add a paragraph break every 5 sentences #+end_src
- =whisper-after-insert-hook=: These functions are run after transcription is completed and the text has been inserted into the original buffer.
*** Performance Guide for Advanced Users
By default, whisper.cpp performance on CPU is likely good enough for most people and most use cases. However if it's not good enough for you, here are some things you could do:
**** Update whisper.cpp
The upstream whisper.cpp is continuously improving. If you are using an old version, updating whisper.cpp is the first thing you could try. Simplest way to do that is to delete your whisper.cpp installation folder and re-run the command, which will reinstall from latest commit.
**** Quantize the model
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types. This sacrifices precision for resource efficiency. The idea is that quantized version of a bigger model may afford you to use it (if you are RAM constrained e.g.) with some penalty or accuracy, while still being more accurate hopefully than the smaller model you would be using otherwise.
**** Re-compile whisper.cpp for hardware acceleration
Offloading the encoder inference to hardware or optimised external libraries may result in speed-up. There are options to use: Core ML (for Apple hardware), cuBLAS (for NVIDIA GPU), OpenVINO (Intel CPU/GPU), CLBlast (for GPUs that support OpenCL), OpenBLAS (an optimised matrix processing library for CPU). Consult [[https://github.com/ggerganov/whisper.cpp][whisper.cpp README]] for how to re-compile whisper.cpp to enable those.
**** Use something other than whisper.cpp
If you think there is something else you want to use, you have the option to override the =whisper-command= function definition, or define an overriding advice. This function takes a path to input audio file as argument, and returns a list denoting the command (compatible to =:command= argument to [[https://www.gnu.org/software/emacs/manual/html_node/elisp/Asynchronous-Processes.html][make-process]]), to be run instead of whisper.cpp. You can use the variables described above in this readme to devise the command. The wiki [[https://github.com/natrys/whisper.el/wiki/Setup-to-use-whisper%E2%80%90ctranslate2-instead-of-whisper.cpp][contains a recipe]] that shows how to use [[https://github.com/Softcatala/whisper-ctranslate2][whisper-ctranslate2]] with whisper.el. This client is compatible to OpenAI's original one, so porting the recipe to use the original client should be possible.
Note that when you are using something other than whisper.cpp, the onus is on you to make sure the target program is properly installed and relevant model files for it are downloaded beforehand. We don't support anything other than whisper.cpp so any problems integrating them with whisper.el that's specific to those software may strain our ability to address.
**** Server Modes
whisper.el supports three different modes of operation:
***** Direct Mode (default)
The default mode runs whisper.cpp directly for each transcription request. This is the simplest setup but requires the subprocess loading the model each time, and the latency incurred may or may not be trivial depending on your machine and usage.
***** Local Server Mode
When =whisper-server-mode= is set to =local=, whisper.el will run whisper.cpp as a persistent HTTP server. Benefits: model is loaded once and kept in memory, better performance for multiple transcription requests, server can be shared between multiple Emacs instances and other programs etc. To use local server mode:
#+begin_src elisp (setq whisper-server-mode 'local whisper-server-baseurl "http://127.0.0.1:8642") #+end_src
***** Remote Server Mode
For when you want to connect to an existing whisper.cpp server running on e.g. another machine.
#+begin_src elisp (setq whisper-server-mode 'remote whisper-server-baseurl "http://192.168.0.101:8642") #+end_src
You need to ensure the server is running and is accessible (for the =/inference= endpoint).
***** OpenAI API Mode
When =whisper-server-mode= is set to =openai=, whisper.el will use OpenAI's official Whisper API (or another compatible provider). Benefits: no local model or CPU requirements, access to better models (though often proprietary) with lower error rates, potentially faster transcription etc.
Note: You need to bring your API key (which will incur charges). Non-local services also have privacy issues.
To use OpenAI API compatible server mode:
#+begin_src elisp (setq whisper-server-mode 'openai whisper-openai-api-key (getenv "OPENAI_API_KEY")) #+end_src
You don't necessarily have to use OpenAI service. Whisper is served by other providers like Groq. Or you could, for example use Mistral's open-weight model ([[https://mistral.ai/news/voxtral][Voxtral]]) from their platform:
#+begin_src elisp (setq whisper-server-mode 'openai whisper-openai-model "voxtral-mini-latest" whisper-openai-api-baseurl "https://api.mistral.ai/" whisper-openai-api-key (getenv "MISTRAL_API_KEY")) #+end_src
*** Caveats
- Whisper is open-source in the sense that weights and the engine source is available. But training data or methodology is not.
- Real time transcribing is probably not feasible with it yet. The accuracy is better when it has a bigger window of surrounding context. Plus it would need beefy hardware to keep up, possibly using a smaller model. There is some interesting activity going on at whisper.cpp upstream, but in the end I don't see the appeal of that in my workflow (yet).