A recent Google study has revealed significant quality issues in three major multilingual speech datasets: Mozilla Common Voice 17.0, FLEURS, and VoxPopuli. These datasets are crucial for developing multilingual automatic speech recognition (ASR) models, yet the researchers found flaws that could mislead results, particularly for low-resource languages. Problems identified include short and repetitive recordings, poor audio quality, and a lack of speaker diversity, which can introduce biases related to gender, age, and regional accents.

The study highlights that while some issues can be addressed programmatically, others require a deeper understanding of sociolinguistic contexts. The researchers advocate for a comprehensive approach to dataset creation, emphasizing the need for clear design goals and guidelines for contributors. They suggest that future efforts should include both automated and manual quality checks, as well as detailed metadata to help users interpret the data accurately.

For more insights into the study’s findings and recommendations, readers are encouraged to explore the full article.

→ Read full article via slator.com