|国家预印本平台
首页|BanglaByT5: Byte-Level Modelling for Bangla

BanglaByT5: Byte-Level Modelling for Bangla

BanglaByT5: Byte-Level Modelling for Bangla

来源:Arxiv_logoArxiv
英文摘要

Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a morphologically rich language like Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla. Built upon a small variant of Googles ByT5 architecture, BanglaByT5 is pre-trained on a 14GB curated corpus combining high-quality literary and newspaper articles. Through zeroshot and supervised evaluations across generative and classification tasks, BanglaByT5 demonstrates competitive performance, surpassing several multilingual and larger models. Our findings highlight the efficacy of byte-level modelling for morphologically rich languages and highlight BanglaByT5 potential as a lightweight yet powerful tool for Bangla NLP, particularly in both resource-constrained and scalable environments.

Pramit Bhattacharyya、Arnab Bhattacharya

印欧语系计算技术、计算机技术

Pramit Bhattacharyya,Arnab Bhattacharya.BanglaByT5: Byte-Level Modelling for Bangla[EB/OL].(2025-05-21)[2025-07-09].https://arxiv.org/abs/2505.17102.点此复制

评论