Spotify Subset

January 10, 2024 · View on GitHub

The 'Spotify Subset' includes file names from the Spotify Dataset (Tanaka et al. (2022)) for classifying language variations in Brazilian Portuguese. The selection of file names resulted from applying a filter to the original dataset metadata, focusing on idiomatic expressions and names or acronyms of locations.

Spotify A subset

General Table

Speakers	Duration	Episodes	Female	Male
92	~15hrs 24 min	52	43	38

Subset A Information

Accent	Speaker	Duration	Female	Male
Rio de Janeiro	5	49 min	2	3
Bahia	4	1hr 27 min	4
Mato Grosso do Sul	4	18 min	3	1
Maranhão	7	1hr 18 min	2	3
Minas Gerais	~35	5hrs 23 min	~13	~22
Recife	10	3hrs 45 min
São Paulo	~25	1hr 18 min	~19	~7
Rio Grande do Sul	2	~53 min		2

Spotify B subset

General Table

Accent	Train_speakers	Dev_speakers	Test_speakers	Podcasts	Episodes	Duration	segments
RE	69	23	11	15	57	~48.23	14,008
SP	52	18	15	11	78	~30.88	11,906