首页|Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

来源：

英文摘要

Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative method (e.g. collocations) while ensuring the reliability of qualitative approaches. This paper examines how discrepancies in tokenisation affect the representation of language data and the validity of analytical findings: investigating the challenges posed by emojis and homoglyphs, the study highlights the necessity of preprocessing these elements to maintain corpus fidelity to the source data. The research presents methods for ensuring that digital texts are accurately represented in corpora, thereby supporting reliable linguistic analysis and guaranteeing the repeatability of linguistic interpretations. The findings emphasise the necessity of a detailed understanding of both linguistic and technical aspects involved in digital textual data to enhance the accuracy of corpus analysis, and have significant implications for both quantitative and qualitative approaches in corpus-based research.

作者：Matteo Di Cristofaro

作者单位：

学科分类：语言学

推荐引用：Matteo Di Cristofaro.Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results[EB/OL].(2025-07-02)[2025-07-16].https://arxiv.org/abs/2507.01764.点此复制

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results

评论