|国家预印本平台
首页|NusaCrowd: Open Source Initiative for Indonesian NLP Resources

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

来源:Arxiv_logoArxiv
英文摘要

We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.

Ika Alfina、Samsul Rahmadani、Keith Stevens、Ryandito Diandaru、Christian Wibisono、Jennifer Santoso、Alham Fikri Aji、Bryan Wilie、Dyah Damapuspita、David Moeljadi、Ichwanul Muslim Karo Karo、James Jaya、Wenliang Dai、Ilham Firdausi Putra、Herry Sujaini、Karissa Vincentio、Sebastian Ruder、Kaustubh D. Dhole、Tirana Noor Fatyanosa、Pascale Fung、Muhammad Farid Adilazuarda、Graham Neubig、Muhammad Satrio Wicaksono、Ayu Purwarianti、Arie Ardiyanti Suryani、Frederikus Hudi、Rifki Afina Putri、Ali Akbar Septiandri、Yan Xu、Ryan Ignatius、Dan Su、Samuel Cahyawijaya、Vito Ghifari、Fajri Koto、Made Nindyatama Nityasya、Timothy Baldwin、Cahya Wirawan、Holy Lovenia、Sakriani Sakti、Yulianti Oenang、Ziwei Ji、Ade Romadhony、Genta Indra Winata、Tiezheng Yu、Cuk Tho、Ivan Halim Parmonangan、Rahmad Mahendra

南岛语系(马来亚-玻里尼西亚语系)常用外国语语言学

Ika Alfina,Samsul Rahmadani,Keith Stevens,Ryandito Diandaru,Christian Wibisono,Jennifer Santoso,Alham Fikri Aji,Bryan Wilie,Dyah Damapuspita,David Moeljadi,Ichwanul Muslim Karo Karo,James Jaya,Wenliang Dai,Ilham Firdausi Putra,Herry Sujaini,Karissa Vincentio,Sebastian Ruder,Kaustubh D. Dhole,Tirana Noor Fatyanosa,Pascale Fung,Muhammad Farid Adilazuarda,Graham Neubig,Muhammad Satrio Wicaksono,Ayu Purwarianti,Arie Ardiyanti Suryani,Frederikus Hudi,Rifki Afina Putri,Ali Akbar Septiandri,Yan Xu,Ryan Ignatius,Dan Su,Samuel Cahyawijaya,Vito Ghifari,Fajri Koto,Made Nindyatama Nityasya,Timothy Baldwin,Cahya Wirawan,Holy Lovenia,Sakriani Sakti,Yulianti Oenang,Ziwei Ji,Ade Romadhony,Genta Indra Winata,Tiezheng Yu,Cuk Tho,Ivan Halim Parmonangan,Rahmad Mahendra.NusaCrowd: Open Source Initiative for Indonesian NLP Resources[EB/OL].(2022-12-19)[2025-06-22].https://arxiv.org/abs/2212.09648.点此复制

评论