|国家预印本平台
| 注册
首页|双向LSTM-CRF在煤炭学领域的中文分词技术

双向LSTM-CRF在煤炭学领域的中文分词技术

陈赫 钱旭

双向LSTM-CRF在煤炭学领域的中文分词技术

BI-LSTM-CRF Chinese word segmentation technology in the field of coal science

陈赫 钱旭

作者信息

摘要

在英语词汇中,空格作为词与词之间的自然分隔符,而汉语词汇之间并没有这样明确的分隔符。因此,在英语自然语言处理中取得良好效果的深层学习模式和方法不能直接应用。递归神经网络(CNN)能够很好地处理序列标记问题,已被广泛应用到自然语言处理(NLP)任务中。本文提出了一种基于长短期记忆(LSTM)神经网络改进的双向长短期记忆条件随机场(BI-LSTM-CRF)模型,不仅保留了LSTM能够利用上下文信息的特性,同时能够通过CRF层考虑输出标签之间前后的依赖关系。利用该分词模型,通过加入预训练的字嵌入向量,以及使用不同词位标注集在Bakeoff2005数据集上进行的分词实验,结果表明:BI-LSTM-CRF模型比LSTM和双向LSTM模型具有更好的分词性能,同时具有很好地泛化能力;相比四词位,采用六词位标注集的神经网络模型能够取得更好的分词性能;BI-LSTM-CRF模型和训练方法能有效地解决汉语自然语言处理中的分词和词性标注问题,且在煤炭学领域能取得较好的效果。

Abstract

In English words, spaces are used as natural delimiters between words, and there are no such clear delimiters between Chinese words. Therefore, deep learning models and methods that obtain good results in English natural language processing cannot be directly applied. Recursive neural network (CNN) is able to handle sequence tag question, has been widely applied to natural language processing (NLP) task put forward a kind of based on short - and long-term memory (LSTM) neural network to improve the conditions of two-way short - and long-term memory with the airport (BI - LSTM - CRF) model, not only retain the characteristics of the LSTM able to use context information, at the same time can through the CRF layer considering output before and after the dependencies between tagsUsing the word segmentation model, the word segmentation experiment was carried out on the Bakeoff2005 dataset by adding pre-trained word embedding vectors and using different word bit-labeling sets. The results showed that the bi-lstm-crf model had better segmentation ability and better generalization ability than the LSTM and two-way LSTM models.The neural network model with six lexical tagging sets can achieve better segmentation performance than the four-lexical model.The BI-LSTM-CRF model and training method can effectively solve the problem of word segmentation and part of speech tagging in Chinese natural language processing, and can obtain good performance in the field of coal science.

关键词

中文分词/BI-LSTM-CRF/词位标注/煤炭学

Key words

Chinese word segmentation/BI-LSTM-CRF/speech tagging/coal science

引用本文复制引用

陈赫,钱旭.双向LSTM-CRF在煤炭学领域的中文分词技术[EB/OL].(2019-01-03)[2025-12-13].http://www.paper.edu.cn/releasepaper/content/201901-16.

学科分类

汉语/矿业工程理论与方法论

评论

首发时间 2019-01-03
下载量:0
|
点击量:5
段落导航相关论文