|国家预印本平台
| 注册
首页|Qomhra: A Bilingual Irish and English Large Language Model

Qomhra: A Bilingual Irish and English Large Language Model

Khanh-Tung Tran Liam Lonergan Ailbhe Ní Chasaide Neasa Ní Chiaráin Barry Devereux Joseph McInerney

Arxiv_logoArxiv

Qomhra: A Bilingual Irish and English Large Language Model

Khanh-Tung Tran Liam Lonergan Ailbhe Ní Chasaide Neasa Ní Chiaráin Barry Devereux Joseph McInerney

作者信息

Abstract

Large language model (LLM) research and development has overwhelmingly focused on the world's major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhrá}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate ``accepted'' and ``rejected'' responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2 Irish-speakers, diverging from LLM-as-a-judge ratings, indicating a misalignment between current LLMs and the Irish-language community. Subsequently, we leverage Gemini-2.5-Pro to translate a large scale English-language instruction tuning dataset to Irish and to synthesise a first-of-its-kind Irish-language human preference dataset. We comprehensively evaluate Qomhrá across several benchmarks, testing translation, gender understanding, topic identification, and world knowledge; these evaluations show gains of up to 29\% in Irish and 44\% in English compared to the existing open-source Irish LLM baseline, UCCIX. The results of our framework provide insight and guidance to developing LLMs for both Irish and other low-resource languages.

引用本文复制引用

Khanh-Tung Tran,Liam Lonergan,Ailbhe Ní Chasaide,Neasa Ní Chiaráin,Barry Devereux,Joseph McInerney.Qomhra: A Bilingual Irish and English Large Language Model[EB/OL].(2026-01-08)[2026-04-04].https://arxiv.org/abs/2510.17652.

学科分类

常用外国语/印欧语系

评论

首发时间 2026-01-08
下载量:0
|
点击量:3
段落导航相关论文