|国家预印本平台
首页|一种基于主题的网页实时分类模型的研究

一种基于主题的网页实时分类模型的研究

research of webpage classication on real-time based on the theme

中文摘要英文摘要

本文首先对一般的分类模型进行了研究,并且分析了该模型对于网页实时分类的不足之处。在此基础上,为了更适合网页的实时分类,本文提出了基于主题的网页分类模型。首先,通过Nutch构造垂直搜索引擎的主题爬虫,可以一直对互联网上的网页进行抓取,保证网页的实时性;然后,通过主题去噪对Nutch的抓取结果进行处理,去除一部分与分类无关的页面;最后,对抓取到的网页进行分类。实验证明,通过此模型,可以在网页分类的速度和准确率上都得到很大提高。对于网页实时分类的大数据要求,此模型可以有效优化输入样本,节省计算时间。

In this paper, the general classification model is studied firstly,and analyzing the Inadequacies of the general model for real-time classification of the webpage.On this basis,for more suitable for real-time classification,this paper presents a classification model based on the theme.Firstly,constructing the theme of vertical search engine crawlers through Nutch,and the webpage can be crawled all the time,so it can ensure the real-time web.Secondly,removing part of the pages witch has nothing to do with the classification by processing the crawling results of Nutch through theme denoising.In the end,the webpages crawled can be classfried.The experiment show that the speed and accuracy can be improved with the model.For the requirement of big data of the webpage classification on real-time,this model can effectively optimize the input sample and save computing time.

马健红、张晨光

计算技术、计算机技术

计算机应用技术主题分类实时分类

computer application technologythemeclassificationreal-time classification

马健红,张晨光.一种基于主题的网页实时分类模型的研究[EB/OL].(2013-07-12)[2025-08-11].http://www.paper.edu.cn/releasepaper/content/201307-189.点此复制

评论