|国家预印本平台
首页|基于条件随机场的汉语词汇特征研究

基于条件随机场的汉语词汇特征研究

中文摘要英文摘要

汉语语言在书面表达时不具有天然分词的特性,词汇与词汇之间没有分词标记,因此在汉语文本的识别中需结合其行文的习惯及规则,即所谓的词汇特征。已有研究通常在实验中显式地标注词汇特征来提高识别效果,增加了人工处理流程,极大地加重了算法移植的工作量。研究并归纳了常用汉语语言的词汇特征,并利用条件随机场(conditional random fields,CRF)的特征提取能力,自行实现了复杂特征函数,在语料只具有简单标注的前提下,隐式地提取词汇特征,提高了识别效果。实验证明,在汉语分词中应用复杂词汇特征能有效提高识别性能,提供了在应用中提高识别算法可移植性的新思路。

In Chinese written expression, there is no word segmentation between vocabularies, so the principle of writing (or called lexical features) is what it needs to process the segmentation of Chinese content. Former researches usually mark the lexical features into training content to improve the performance, which increases the manual processing flow and the workload of the algorithm transplantation. Based on Conditional Random Fields (CRF) and the simple tags, this paper improves the recognition performance by concluding the lexical features of Chinese and transforming them to complicated functions which used by CRF. Experiments show that applying complex lexical features in Chinese word segmentation can effectively improve recognition performance and provide a new way to improve the portability of recognition algorithms in applications.

史晟辉、黄定琦

10.12074/201905.00046V1

汉语语言学

条件随机场汉语词汇特征信息提取

史晟辉,黄定琦.基于条件随机场的汉语词汇特征研究[EB/OL].(2019-05-10)[2025-08-03].https://chinaxiv.org/abs/201905.00046.点此复制

评论