|国家预印本平台
首页|Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

来源:Arxiv_logoArxiv
英文摘要

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Isabel Papadimitriou、Ayanda Mnyakeni、Pallavi Baljekar、Andre Niyongabo Rubungo、Isaac Caswell、Duygu Ataman、Beno?t Sagot、Allahsera Tapo、Clara Rivera、Orhan Firat、Mathias Jenny、Mofetoluwa Adeyemi、Monang Setyawan、Supheakmungkol Sarin、Nasanbayar Ulzii-Orshikh、Ankur Bapna、Oghenefego Ahia、Orevaoghene Ahia、Daan van Esch、Ahmed Baruwa、Sakine ?abuk Ball?、Mathias M¨1ller、Stella Biderman、Claytone Sikasote、Lisa Wang、Shamsuddeen Hassan Muhammad、Artem Sokolov、Annette Rios、Jamshidbek Mirzakhalov、Sweta Agrawal、Sakhile Dlamini、Israel Abebe Azime、Colin Leong、Ahsan Wahab、Alessia Battisti、Yacine Jernite、Toan Q. Nguyen、Nze Lawson、Bonaventure F. P. Dossou、Julia Kreutzer、Ayodele Awokoya、Tapiwanashe Matangira、Kelechi Ogueji、Nishant Subramani、Nanda Muhammad、Andr¨| M¨1ller、Nisansa de Silva、Pedro Ortiz Suarez、Sokhar Samb、Sneha Kudugunta、Iroro Orife、Salomey Osei

10.1162/tacl_a_00447

语言学印欧语系常用外国语

Isabel Papadimitriou,Ayanda Mnyakeni,Pallavi Baljekar,Andre Niyongabo Rubungo,Isaac Caswell,Duygu Ataman,Beno?t Sagot,Allahsera Tapo,Clara Rivera,Orhan Firat,Mathias Jenny,Mofetoluwa Adeyemi,Monang Setyawan,Supheakmungkol Sarin,Nasanbayar Ulzii-Orshikh,Ankur Bapna,Oghenefego Ahia,Orevaoghene Ahia,Daan van Esch,Ahmed Baruwa,Sakine ?abuk Ball?,Mathias M¨1ller,Stella Biderman,Claytone Sikasote,Lisa Wang,Shamsuddeen Hassan Muhammad,Artem Sokolov,Annette Rios,Jamshidbek Mirzakhalov,Sweta Agrawal,Sakhile Dlamini,Israel Abebe Azime,Colin Leong,Ahsan Wahab,Alessia Battisti,Yacine Jernite,Toan Q. Nguyen,Nze Lawson,Bonaventure F. P. Dossou,Julia Kreutzer,Ayodele Awokoya,Tapiwanashe Matangira,Kelechi Ogueji,Nishant Subramani,Nanda Muhammad,Andr¨| M¨1ller,Nisansa de Silva,Pedro Ortiz Suarez,Sokhar Samb,Sneha Kudugunta,Iroro Orife,Salomey Osei.Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets[EB/OL].(2021-03-22)[2025-07-16].https://arxiv.org/abs/2103.12028.点此复制

评论