forced-alignment-tools

April 18, 2026 ยท View on GitHub

A collection of links and notes on forced alignment tools

Did I miss an aligner? Please open an issue or directly fork-commit-pullrequest.

Definition of Forced Alignment

Given an audio file containing speech, and the corresponding transcript, computing a forced alignment is the process of determining, for each fragment of the transcript, the time interval (in the audio file) containing the spoken text of the fragment.

The granularity of "text fragment" might be arbitrarily defined as any of the following:

  • a paragraph,
  • a sentence,
  • a clause/portion of a sentence (i.e., a sequence of words),
  • a word, or
  • a phoneme (i.e., a single sound);

but note that a given aligner might be designed to produce a good alignment only at a specific granularity, while it might produce incorrect results or even no output at all at finer granularities.

For example, given this text file and this audio file, a force aligment at verse-level can be the following:

1                                                     => [00:00:00.000, 00:00:02.640]
From fairest creatures we desire increase,            => [00:00:02.640, 00:00:05.880]
That thereby beauty's rose might never die,           => [00:00:05.880, 00:00:09.240]
But as the riper should by time decease,              => [00:00:09.240, 00:00:11.920]
His tender heir might bear his memory:                => [00:00:11.920, 00:00:15.280]
...
Pity the world, or else this glutton be,              => [00:00:43.640, 00:00:48.080]
To eat the world's due, by the grave and thee.        => [00:00:48.080, 00:00:53.240]

Typical applications of forced alignment include Audio-eBooks, closed captioning, and automating the creation of training data for automated speech recognition systems.

Programs and Libraries

The following matrix contains open source programs and libraries for computing forced alignments that have been actually proven to install and run (albeit the installation procedure for some of them is pretty complex).

All tools, except aeneas, are based on speech recognition algorithms; all tools, except aeneas and gentle, are maintained by research groups or individuals in academia.

Most tools are based on the HTK, which is not free for commercial purposes, although a commercial license can be purchased from the University of Cambridge.

You can also download the raw data file in JSON format.

NameAlgorithmFinest Supported GranularitySupported Language(s)InterfaceCode Language(s)LicenseDocumentationMailing List/ForumActiveNotes
aeneasDTWword30+CLI, LIB, WebPython, CAGPLYYYNot based on ASR
CMU SphinxHMM (own), RNNphoneme11CLI, LIBC, Java, PythonMIT-likeYYY
CTC segmentationRNNphonemeEnglish, German, Chinese, FrenchCLIPythonApacheYNYCan be extended to any other language where an ASR model is available
DARLAHMM (HTK)phonemeEnglishWeb??YNN?Based on Prosodylab-Aligner or YouTube ASR
FAVE-alignHMM (HTK)phonemeEnglishCLI, (Web)PythonGPLYYYacustic models from P2FA; GitHub code updated more frequently than Web
GentleHMM (Kaldi)phonemeEnglishCLI, WebPythonMITNNYBased on Kaldi
JuliusHMM (own)phonemeEnglish, JapaneseCLI, LIBCMIT-likeYYN?
KaldiHMM (own), DNN, RNNphonemeEnglishCLI, LIBC++ApacheYYYCUDA support
kaldi-dnn-ali-gopHMM (Kaldi), DNN (Kaldi nnet3)phonemeEnglishCLI, LIBShell Script, C++, PythonGPLNNYWork with other languages given kaldi acoustic models
LaBB-CATHMM (HTK)phonemeEnglishWebJavaGPLYYY
MAUSHMM (HTK)phoneme21CLI, WebCAll rights reservedREADMEYY
Montreal Forced AlignerHMM (Kaldi)phonemeEnglishCLIPythonMITYNYCan train other languages
Penn Forced Aligner (P2FA)HMM (HTK)phonemeEnglishCLI, WebPython?README, TutorialNN?
Prosodylab-AlignerHMM (HTK)phonemeEnglishCLIPythonMITREADME, TutorialNYCan train other languages
SailAlignHMM (HTK)phonemeEnglish, Greek, SpanishCLIPerlGPLREADMENN?
SPPASHMM (Julius)phoneme12+CLI, GUIPythonGPLYYYCan train other language, several plugins

Additional Pointers