Common Voice Datasets

June 16, 2026 · View on GitHub

This repo contains release details and metadata for the Common Voice datasets. Please visit the Mozilla Data Collective Common Voice section to download the latest datasets.

Dataset Types

Common Voice collects voice data through multiple modalities. Each dataset type has its own release information, data structure, and documentation.

Type	Alias	Status	Releases	Latest (2026-06)	Languages
Scripted Speech	SCS	Active	26	v26.0	294
Spontaneous Speech	SPS	Active	4	v4.0	78
Code Switching	CS	Alpha	--	--	--

See each dataset type's documentation for detailed information about data structures, fields in metadata files (.tsv), archive contents, and release changelogs. Note that the "date" in releases represents the cut-off date for data collection and validation, not the actual release date of the dataset.

Data Pipeline

flowchart LR
    subgraph SCS["Scripted Speech (SCS)"]
        SCS_DB[("DB")]
        SCS_GCS["GCS"]
    end
    subgraph SCS_BUN["SCS Bundler"]
        CC["CorporaCreator"]
    end
    subgraph SCS_BUN2["SCS Bundler"]
      UP["Uploader"]
    end

    DSH["cv-datasheets"]

    subgraph SPS["Spontaneous Speech (SPS)"]
        SPS_DB[("DB")]
        SPS_GCS["GCS"]
    end
    subgraph SPS_BUN["SPS Bundler"]
        QA["QA Pipeline"]
    end

    BUN_GCS["GCS
    datasets
    datasheets
    stats"]

    MDC[["MDC
    downloads"]]
    CDS[["cv-dataset ◀"]]

    SCS_DB -->|data| SCS_BUN
    SCS_GCS -->|clips| SCS_BUN
    DSH -->|JSON| SCS_BUN
    DSH -->|JSON| SPS_BUN
    SPS_DB -->|data| SPS_BUN
    SPS_GCS -->|audio| SPS_BUN
    SCS_BUN --> BUN_GCS
    SPS_BUN --> BUN_GCS
    BUN_GCS -->|datasets| UP
    BUN_GCS -->|datasheets| UP
    UP -->|API| MDC
    BUN_GCS -->|stats| CDS

    style CDS fill:#1a73e8,color:#ffffff,stroke:#1558b0,stroke-width:2px

Overview

Scripted Speech (SCS)

---
config:
    xyChart:
        width: 900
        height: 400
    themeVariables:
        xyChart:
            plotColorPalette: "#1a73e8, #34a853"
---
xychart-beta
    title "Scripted Speech: Total & Validated Hours"
    x-axis ["1","2","3","4","5.1","6.1","7","8","9","10","11","12","13","14","15","16.1","17","18","19","20","21","22","23","24","25","26"]
    y-axis "Hours" 0 --> 43000
    bar [1368,2366,2454,4257,7226,9283,13905,18243,20217,20817,24231,26119,27141,28117,28750,30328,31175,32121,32584,33154,33534,33815,35921,38932,41792,42388]
    bar [1096,1872,1979,3401,5671,7335,11192,14122,14973,15234,16429,17127,17689,18651,19159,19915,20408,20943,21593,22106,22344,22640,24600,25886,28377,28893]

For details see: Scripted Speech documentation

Spontaneous Speech (SPS)

---
config:
    xyChart:
        width: 600
        height: 350
    themeVariables:
        xyChart:
            plotColorPalette: "#1a73e8, #34a853"
---
xychart-beta
    title "Spontaneous Speech: Total vs Validated Hours"
    x-axis ["v1.0","v2.0","v3.0","v4.0"]
    y-axis "Hours" 0 --> 600
    bar [428,454,508,515]
    bar [263,268,269,273]

For details see: Spontaneous Speech documentation

Dataset Access

You can download the Common Voice datasets from the Mozilla Data Collective (MDC) platform:

Directly from the browser
Using the MDC API
Using the MDC Python SDK to directly load the datasets as pandas DataFrame in your Python codebase

Generating Dataset Statistics

Helper scripts are available in the helpers/ directory for processing bundler output into dataset statistics. See helpers/README.md for detailed usage and examples.

All helper scripts support multiple dataset types via the first argument:

node helpers/createStats.js <dataset-type> <stats-folder>
node helpers/compareReleases.js <dataset-type> <dataset-1> <dataset-2>
node helpers/createDeltaStatistics.js <dataset-type> <dataset-1> <dataset-2>
node helpers/recalculateStats.js <dataset-type> <dataset>

Citation

If you use the data in a published academic work we would appreciate if you cite the following article:

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R., Saunders, L., Tyers, F. M. and Weber, G. (2020) "Common Voice: A Massively-Multilingual Speech Corpus". Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). pp. 4211--4215

@inproceedings{commonvoice:2020,
  author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.},
  title = {Common Voice: A Massively-Multilingual Speech Corpus},
  booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)},
  pages = {4211--4215},
  year = 2020
}

Feedback

Please only use this repo to provide feedback on technical issues with the dataset, such as file corruptions, problems with the partitions, and so on. For more expansive discussions, please join us in Discourse or Matrix.