The 'Spotify Subset' includes file names from the Spotify Dataset (Tanaka et al. (2022)) for classifying language variations in Brazilian Portuguese. The selection of file names resulted from applying a filter to the original dataset metadata, focusing on idiomatic expressions and names or acronyms of locations.
Spotify A subset
General Table
| Speakers | Duration | Episodes | Female | Male |
|---|
| 92 | ~15hrs 24 min | 52 | 43 | 38 |
Subset A Information
| Accent | Speaker | Duration | Female | Male |
|---|
| Rio de Janeiro | 5 | 49 min | 2 | 3 |
| Bahia | 4 | 1hr 27 min | 4 | |
| Mato Grosso do Sul | 4 | 18 min | 3 | 1 |
| Maranhão | 7 | 1hr 18 min | 2 | 3 |
| Minas Gerais | ~35 | 5hrs 23 min | ~13 | ~22 |
| Recife | 10 | 3hrs 45 min | | |
| São Paulo | ~25 | 1hr 18 min | ~19 | ~7 |
| Rio Grande do Sul | 2 | ~53 min | | 2 |
Spotify B subset
General Table
| Accent | Train_speakers | Dev_speakers | Test_speakers | Podcasts | Episodes | Duration | segments |
|---|
| RE | 69 | 23 | 11 | 15 | 57 | ~48.23 | 14,008 |
| SP | 52 | 18 | 15 | 11 | 78 | ~30.88 | 11,906 |