|国家预印本平台
首页|基于PMI的变体短信分词方法研究

基于PMI的变体短信分词方法研究

Research on words segmentation algorithm of message variety based on improved PMI

中文摘要英文摘要

伴随着移动通信网络发展而滋生蔓延的垃圾短信问题,给手机用户造成了困扰,也给网络运营者带来了挑战。在垃圾短信治理工作中,对短信文本进行合理的分词是识别、分类、拦截等任务的前提。常用分词工具因难以适应违规类短信语法不规范,字词多变体,特殊符号混杂等特点,致使分词精度下降。本文将改进的点互信息与提出的cross-skip-bi-grams模型相结合来解决变体短信的分词难题,并提出最优切分、分词合并、增量训练与反馈训练等方法提高该分词方法的实用性和鲁棒性。实验结果标明该方法改善了在违规短信上的分词精度。

With the development of mobile communication network,the growth and spread of spam messages is plaguing mobile phone users .This situation is bring a great challenge to the governance of spam messages.In the spam messages governance, reasonable segmentation of SMS text is the prerequisites for recognition ,classification and intercept. Illegal SMS is ungrammatical, variability and those kinds of things,which leads to the problem of the segmentation accuracy deteriorated.In this paper,a revised PMI is combined with the proposed cross-skip-bi-grams model to resolve the problem.In addition,optimal segmentation,segmentation merge,increment training and feedback training are proposed to improve the usability and robustness of this segmentation method.Experimental results show that this method can improve the accuracy of segmentation on Illegal SMSinformation.

魏昕、张勇、高鹏

通信计算技术、计算机技术

自然语言处理中文分词PMI变体词

natural language processinghinese words segmentationPMIword variant

魏昕,张勇,高鹏.基于PMI的变体短信分词方法研究[EB/OL].(2016-08-22)[2025-08-02].http://www.paper.edu.cn/releasepaper/content/201608-121.点此复制

评论