基于LDA模型的微博话题检测
EVENT DETECTION FROM MICROBLOGS BASED ON LDA MODEL
随着微博用户的不断增长,国外的Twitter和国内的新浪微博已经成为媒体和个人发布信息的重要平台.对于微博这种特殊的文本,通常小于140字,包含了丰富的社会化信息,且微博内容不仅包含话题文本,也包含其他无话题表征能力的冗余文本,传统的文本挖掘算法并不能很好的做微博话题的提取.本文结合中文词性标注和LDA(Latent Dirichlet Allocation)主题模型两种方法用于微博话题提取,运用中文词性标注可以很好的过滤掉微博文本中无话题表征能力的文本词语,运用LDA主题模型可以将文本信息表示在一个低维的主题空间之中,从而有效的挖掘文本潜在的关系,从语义上更好的挖掘微博话题.实验表明相较于传统的文本分析分析方法,中文词性标注和LDA模型结合能够提高话题发现的准确率.最后本文提出如何计算话题热度,基于话题热度对话题进行排序.
s the number of microblog users is growing, twitter and weibo have become important information platforms for media and individuals.The content of microblogs are usually short (less than 140 words) and contain wealth of social information, so traditional text mining algorithms are not good at extracting microblog topics. In this paper, we combined chinese POS(Part Of Speech) tagging and LDA (Latent Dirichlet Allocation ) topic model to extract topics form microblogs.It showed that POS tagging can filter out useless information of microblogs and lda model can represent text data into a low dimensional topic space. And it also showed that combining POS tagging and LDA model can improve the accuracy of extracting topic from microblogs. Also in this paper, we proposed a new method to calculate the heat of topics that had been extracted, and ranked the topics based on the heat of the topics.
汪进祥、刘念
计算技术、计算机技术
主题模型话题检测词性标注短文本
ldaevent detectionpart of speech taggingshort text
汪进祥,刘念.基于LDA模型的微博话题检测[EB/OL].(2014-12-01)[2025-08-04].http://www.paper.edu.cn/releasepaper/content/201412-24.点此复制
评论