|国家预印本平台
首页|中文分词模型词典融入方法比较

中文分词模型词典融入方法比较

中文摘要英文摘要

目前比较流行的中文分词方法为基于统计模型的机器学习方法。基于统计的方法一般采用人工标注的句子级的标注语料进行训练,但是这种方法往往忽略了已有的经过多年积累的人工标注的词典信息。这些信息尤其是在面向跨领域时,由于目标领域句子级别的标注资源稀少,从而显得更加珍贵。因此如何充分而且有效的在基于统计的模型中利用词典信息,是一个非常值得关注的工作。最近已有部分工作对它进行了研究,按照词典信息融入方式大致可以分为两类:一类是在基于字的序列标注模型中融入词典特征,而另一类是在基于词的柱搜索模型中融入特征。对这两类方法进行比较,并进一步进行结合。实验表明,这两类方法结合之后,词典信息可以得到更充分的利用,最终无论是在同领域测试和还是在跨领域测试上都取得了更优的性能。

hinese word segmentation is a fundamental task in Chinese natural language processing. Currently the mainstream methods for Chinese word segmentation exploit statistical machine learning models. These methods usually require manual-annotated segmented sentences as training corpus, yet have neglected the annotated large-scale lexicon resources which have been built before, where these resources can be highly valuable when cross-domain evaluation is conducted, as the gold-standard sentence-level annotations arerare. Recently, the integration of lexicon formation into word segmentation models has gained increasing interest. As a whole, the integration methods can be classified into two categories: one being based on character-based models that cast word segmentation problem as sequence labeling, and the other being based on word-based models that use beam-search to decode. In this paper, we compare these two models, and combine them. Experimental results on benchmark data sets show that lexicon information can be more fully explored after combination, and finally the combined model can achieve better performances with both in- and cross-domain settings.

冯雪

10.12074/201805.00241V1

汉语

中文分词条件随机场柱搜索领域自适应

冯雪.中文分词模型词典融入方法比较[EB/OL].(2018-05-20)[2025-08-02].https://chinaxiv.org/abs/201805.00241.点此复制

评论