|国家预印本平台
首页|QuranMorph: Morphologically Annotated Quranic Corpus

QuranMorph: Morphologically Annotated Quranic Corpus

QuranMorph: Morphologically Annotated Quranic Corpus

来源:Arxiv_logoArxiv
英文摘要

We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)

Diyam Akra、Tymaa Hammouda、Mustafa Jarrar

语言学闪-含语系(阿非罗-亚细亚语系)

Diyam Akra,Tymaa Hammouda,Mustafa Jarrar.QuranMorph: Morphologically Annotated Quranic Corpus[EB/OL].(2025-06-22)[2025-07-02].https://arxiv.org/abs/2506.18148.点此复制

评论