Screenshot of Mozilla Common Voice v20 splits by gender as a horizontal bar chart
https://mediacdn.aus.social/media_attachments/files/113/637/520/565/369/301/original/228394c542876be0.png
The Mozilla #CommonVoice #dataset v20 was released yesterday - the largest open #speech dataset in the world. My #dataviz, linked below, shows a continuation of patterns seen for some years now:
➡️ There's more data collected for #Catalan (ca) than for #English (en) - testament to the independence and language reclamation efforts in Catalunya. Language and cultural transmission are deeply intertwined.
➡️ Some of the newer #languages to Common Voice, like #Ligurian / #Genoese (lij) have contributions from mostly older speakers, which is unusual in comparison to the rest of the dataset. This may reflect the population that currently speak those languages - as many regional languages in Italy are in rapid decline.
➡️ Some languages such as Eastern Mari / Meadow Mari (mhr) - a #Uralic language spoken in the Mari-El Republic within Russia - have samples from predominantly female-identifying speakers, again contrasting to the rest of the dataset. Other languages here include #Cantonese (yue), #Georgian (ka), and #Kalenjin (kln).
➡️ A key part in the preparation of the Common Voice dataset is the validation of utterances to assure they match their written transcription - which requires at least two validations by separate speakers. Some newer languages to Common Voice, such as Erzya (myv) and Moksha (mdf), both Uralic languages, have nearly 100% validation.
What are your interpretations of the dataset?
https://observablehq.com/@kathyreid/mozilla-common-voice-v20-dataset-metadata-coverage
GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.
All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.