forced-alignment-tools

April 18, 2026 · View on GitHub

A collection of links and notes on forced alignment tools

Version: 1.0.11
Date: 2026-04-18
Author: Alberto Pettarin (contact)
License: Creative Commons Attribution 4.0 International (CC BY 4.0)

Did I miss an aligner? Please open an issue or directly fork-commit-pullrequest.

Definition of Forced Alignment

Given an audio file containing speech, and the corresponding transcript, computing a forced alignment is the process of determining, for each fragment of the transcript, the time interval (in the audio file) containing the spoken text of the fragment.

The granularity of "text fragment" might be arbitrarily defined as any of the following:

a paragraph,
a sentence,
a clause/portion of a sentence (i.e., a sequence of words),
a word, or
a phoneme (i.e., a single sound);

but note that a given aligner might be designed to produce a good alignment only at a specific granularity, while it might produce incorrect results or even no output at all at finer granularities.

For example, given this text file and this audio file, a force aligment at verse-level can be the following:

1                                                     => [00:00:00.000, 00:00:02.640]
From fairest creatures we desire increase,            => [00:00:02.640, 00:00:05.880]
That thereby beauty's rose might never die,           => [00:00:05.880, 00:00:09.240]
But as the riper should by time decease,              => [00:00:09.240, 00:00:11.920]
His tender heir might bear his memory:                => [00:00:11.920, 00:00:15.280]
...
Pity the world, or else this glutton be,              => [00:00:43.640, 00:00:48.080]
To eat the world's due, by the grave and thee.        => [00:00:48.080, 00:00:53.240]

Typical applications of forced alignment include Audio-eBooks, closed captioning, and automating the creation of training data for automated speech recognition systems.

Programs and Libraries

The following matrix contains open source programs and libraries for computing forced alignments that have been actually proven to install and run (albeit the installation procedure for some of them is pretty complex).

All tools, except aeneas, are based on speech recognition algorithms; all tools, except aeneas and gentle, are maintained by research groups or individuals in academia.

Most tools are based on the HTK, which is not free for commercial purposes, although a commercial license can be purchased from the University of Cambridge.

You can also download the raw data file in JSON format.

Name	Algorithm	Finest Supported Granularity	Supported Language(s)	Interface	Code Language(s)	License	Documentation	Mailing List/Forum	Active	Notes
aeneas	DTW	word	30+	CLI, LIB, Web	Python, C	AGPL	Y	Y	Y	Not based on ASR
CMU Sphinx	HMM (own), RNN	phoneme	11	CLI, LIB	C, Java, Python	MIT-like	Y	Y	Y
CTC segmentation	RNN	phoneme	English, German, Chinese, French	CLI	Python	Apache	Y	N	Y	Can be extended to any other language where an ASR model is available
DARLA	HMM (HTK)	phoneme	English	Web	?	?	Y	N	N?	Based on Prosodylab-Aligner or YouTube ASR
FAVE-align	HMM (HTK)	phoneme	English	CLI, (Web)	Python	GPL	Y	Y	Y	acustic models from P2FA; GitHub code updated more frequently than Web
Gentle	HMM (Kaldi)	phoneme	English	CLI, Web	Python	MIT	N	N	Y	Based on Kaldi
Julius	HMM (own)	phoneme	English, Japanese	CLI, LIB	C	MIT-like	Y	Y	N?
Kaldi	HMM (own), DNN, RNN	phoneme	English	CLI, LIB	C++	Apache	Y	Y	Y	CUDA support
kaldi-dnn-ali-gop	HMM (Kaldi), DNN (Kaldi nnet3)	phoneme	English	CLI, LIB	Shell Script, C++, Python	GPL	N	N	Y	Work with other languages given kaldi acoustic models
LaBB-CAT	HMM (HTK)	phoneme	English	Web	Java	GPL	Y	Y	Y
MAUS	HMM (HTK)	phoneme	21	CLI, Web	C	All rights reserved	README	Y	Y
Montreal Forced Aligner	HMM (Kaldi)	phoneme	English	CLI	Python	MIT	Y	N	Y	Can train other languages
Penn Forced Aligner (P2FA)	HMM (HTK)	phoneme	English	CLI, Web	Python	?	README, Tutorial	N	N?
Prosodylab-Aligner	HMM (HTK)	phoneme	English	CLI	Python	MIT	README, Tutorial	N	Y	Can train other languages
SailAlign	HMM (HTK)	phoneme	English, Greek, Spanish	CLI	Perl	GPL	README	N	N?
SPPAS	HMM (Julius)	phoneme	12+	CLI, GUI	Python	GPL	Y	Y	Y	Can train other language, several plugins

AGPL: GNU Affero General Public License
Apache: Apache License
CLI: command line interface
DNN: Deep Neural Network
DTW: Dynamic Time Warping
GPL: GNU General Public License
GUI: graphical interface
HMM: Hidden Markov Model
LIB: library callable by third party software
MFCC: Mel-frequency Cepstral Coefficients
MIT: MIT License
RNN: Recurrent Neural Network
Web: Web-based graphical interface, local and/or remote

Additional Pointers

AZP2FA (fork of P2FA)
Automated Audio Segmentation Using Forced Alignment
Automatic and Accurate Captioning (based on CMU Sphinx)
Berkeley Phonetics Machine
Building Acoustic Models using Kaldi Voxforge recipe to obtain word level transcripts for long video files
CTC segmentation: Implementation in ESPnet
CTC segmentation: Implementation in Nvidia NeMo
CTC segmentation: Implementation in Speechbrain
CTC segmentation: Implementation in pytorch
DARLA
EasyAlign: phonetic alignment with Praat
FAVE-align (the Web interface for the Penn Forced Aligner)
FAVE-align (source code)
Forced Alignment Overview (ISIP)
Forced Alignment and Speech Recognition Systems (Oxford)
Forced Alignment of Spoken Audio
Forced Alignment with InproTK (and Sphinx)
Gentle (based on Kaldi)
HTKBook (has a chapter on computing forced alignments with HTK, requires registration)
InproTK
Introduction to Speech Analysis with FAVE
Julius
Kaldi Forced Alignment
Kaldi
Korean Phonetic Aligner (Web only, Korean only)
LaBB-CAT
Long Audio Aligner Landed in Trunk (Sphinx)
MAUS
Montreal Forced Aligner
Penn Forced Aligner
Penn Forced Aligner
Praatalign: an interactive Praat plug-in for performing phonetic forced alignment
ProsodyLab-Aligner
Robust Automatic Transcription of Speech (RATS)
SPPAS Automatic Annotation of Speech (based on Julius)
Simple English Forced Alignment (UPenn LING521)
VoxForge
WebMAUS (the Web interface for MAUS)
What is forced alignment? (ICSI)
What is forced alignment? (VoxForge))
aeneas
andreemic/forced-alignment (Replicate.com interface to aeneas)
speech.zone