|国家预印本平台
首页|Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

来源:Arxiv_logoArxiv
英文摘要

Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.

Matteo Di Cristofaro

语言学

Matteo Di Cristofaro.Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results[EB/OL].(2025-07-02)[2025-07-16].https://arxiv.org/abs/2507.01764.点此复制

评论