查询意图自动分类的新方法探讨
A New Approach to Automatic Classification of Query Intent
[目的]基于Sogou查询日志数据实现查询意图的自动识别。[方法](1)将ODP主题类目体系映射到Rose等的意图类目体系,利用启发式和本体匹配的方法形成标注规则,对查询进行自动标注;(2)利用LTP工具提取查询的自然语言层面特征,包括:分词特征、词性特征和词之间的句法依赖关系特征,同时也提取出了查询的统计和用户行为特征;(3)基于所标注的查询日志数据集以及所提取的特征,利用GBDT实现查询意图的自动识别。[结果](1)本文提出的自动标注规则所标注的数据集和人工标注的数据集的标签比例接近;(2)使用本文提出的特征集合训练的分类器的意图识别效率优于不使用词之间的句法依赖关系特征的效率;(3)利用GBDT分类模型取得结果的正确率、准确率、召回率和F1值分别为0.75、0.76、0.93与0.84。[局限]本文仅使用Sogou查询日志数据,还需要在数据集上进行进一步验证;利用自然语言处理工具无法完全提取查询的语法和语义特征。[结论]本文提出的标注规则可以迅速获得大量被标注的训练数据集;充分提取自然语言层面特征可以提高查询意图的识别效果;GBDT作为集合类型的机器学习模型在意图识别效率方面优于线性分类模型(如逻辑回归和支持向量机)。
[Objective] In this paper, we achieve automatic identification of query intent based on Sogou query log data. [Methods] (1) We map the ODP subject class system to Rose's intention class system, using the heuristic and ontology matching method to form annotation rules, and automatically mark the query; (2) We use the LTP tool to extract the natural language level features of the query, including: the characteristics of the word segmentation, the part of speech and the syntactic dependency between the words, and the statistical and user behavior of the query; (3) Based on the marked query log data set and the extracted features, the automatic identification of the query intention is realized by using GBDT. [Results] (1) The scale of the data set marked by the automatic labeling rule proposed in this paper is close to that of the manually labeled data set; (2) The efficiency of the classifier using the feature set training proposed in this paper is superior to the efficiency of the syntactic dependency feature between the unused words; (3) Using the GBDT classification model to obtain the results of the accuracy, precision, recall and F1 values were 0.75,0.76,0.93 and 0.84. [Limitations] This article only uses Sogou query log data, but also need to be further validated on other data sets; the use of natural language processing tools can not fully extract the query syntax and semantic features. [Conclusions] In this paper, we can quickly obtain a large number of trained training data sets; fully extract the characteristics of natural language level can improve the recognition of the intention of the query; GBDT as an assembly type of machine learning model in the efficiency of the identification efficiency is better than the linear classification model (such as Logical Regression and Support Vector Machine).
贺国秀、张晓娟
武汉大学信息管理学院西南大学计算机与信息科学学院
科学交流与知识传播
GBDT机器学习查询日志查询意图自然语言处理
GBDTMachine LearningQuery logQuery IntentNature Language Processing
国家自科基金 基于语言模型的通用实体检索建模及框架实现研究( 71173164 ) 国家社科基金 融合用户个性化与实时性意图的查询推荐模型研究( 15 CT Q019 )
贺国秀,张晓娟.查询意图自动分类的新方法探讨[EB/OL].(2022-06-29)[2024-12-22].https://sinoxiv.napstic.cn/article/3444792.点此复制
评论