一种基于枚举的网络实体爬取方法研究
Enumeration-based Web-scale Entity Crawling
现如今互联网上的信息呈爆炸式增长,如何从中快速获取大量同类实体成为一项重要研究课题,而传统的基于关系的网络实体爬取方法无法解决这类问题。本文首先分析了互联网网站特点,发现一些网站URL具有可枚举性,进而提出了一种新的基于枚举的网络实体爬取方法。该方法在新浪微博位置服务和百度文库上进行了实验,实验结果表明该方法具有较好的采样覆盖率和准确性,在大部分情况下要比基于关系的爬取方法更有效。
Nowadays the information is increasing on the Internet, and obtaining web-scale entities has become an important topic. However, traditional relation-based crawling strategy has some shortages. To fix them, this paper first analysis the enumerability of URL, and then raise an enumeration-based algorithm framework to do the crawling, including na?ve feature incorporation, clustering, feature incorporation after clustering, feature completion and enumeration sequence optimization. The experiment on Sina Weibo Place and Baidu Wenku shows that this approach has good performance on sample coverage and accuracy, which is better than traditional approach in many cases.
肖仰华、张俊骏
计算技术、计算机技术
网络实体爬虫枚举
Web-scale EntityCrawlingEnumeration
肖仰华,张俊骏.一种基于枚举的网络实体爬取方法研究[EB/OL].(2014-01-23)[2025-08-16].http://www.paper.edu.cn/releasepaper/content/201401-1056.点此复制
评论