|国家预印本平台
首页|基于Transformer的中文文本纠错算法

基于Transformer的中文文本纠错算法

hinese Text Error Correction Algorithm Based on Transformer

中文摘要英文摘要

中文拼写检查(Chinese Spelling Correct, CSC)是一种用于检测和纠正中文文本中拼写错误的任务。当前,中文文本纠错工作大多采用基于BERT的端到端语言模型。其中Soft-Masked BERT为主流的方法之一,即先通过检测网络回归每个位置的字符的出错概率,随后通过软掩码的方式编码特征向量送入基于BERT的纠错网络。但是这种方法使用GRU作为纠错网络,存在无法捕捉长序依赖的问题。因此,本文基于Soft-Masked BERT提出了一种基于Transformer的中文文本纠错算法来解决上述问题。它由一个基于堆叠式Transformer Encoder的错误检测网络和一个基于BERT的纠错网络组成。检测网络通过软掩码机制与纠错网络相连。同时为了减少错误字符对结果的影响,我们去掉了对检错网络的跳接结构。在 SIGHAN13、SIGHAN14 和 SIGHAN15三个基准数据集上的测试结果表明,本文提出的方法在性能上优于 Soft-Masked BERT。其中,在SIGHAN14上检错准确率提升了2.8%,纠错准确率提升了2.9%。

hinese Spelling Correct (CSC) is a task for detecting and correcting spelling errors in Chinese text. At present, most Chinese text error correction work adopts a BERT-based end-to-end language model. Among them, Soft-Masked BERT is one of the mainstream methods. It first regresses the error probability of the character at each position through the detection network, and then the feature vector is encoded by the soft-mask mechanism and sent to the BERT-based error correction network. However, this method uses GRU as an error correction network, which has the problem of not being able to capture long-order dependencies. Therefore, this paper proposes a Transformer-based Chinese text error correction algorithm based on Soft-Masked BERT to solve the above problems. It consists of an error detection network based on stacked Transformer encoders and a BERT-based error correction network. The detection network is connected to the error correction network through a soft-mask mechanism. At the same time, in order to reduce the influence of wrong characters on the results, we removed the jumper structure of the error detection network. The test results on the three benchmarks of SIGHAN13, SIGHAN14 and SIGHAN15 show that the proposed method outperforms Soft-Masked BERT in performance. Among them, the error detection accuracy rate on SIGHAN14 has increased by 2.8%, and the error correction accuracy rate has increased by 2.9%.

申悦

汉语

人工智能中文拼写纠错TransformerBERT

rtificial IntelligenceChinese Spelling CorrectionTransformerBERT

申悦.基于Transformer的中文文本纠错算法[EB/OL].(2022-03-24)[2025-06-15].http://www.paper.edu.cn/releasepaper/content/202203-363.点此复制

评论