基于HttpClient与HTMLParser的网页正文提取
Web Content Extraction based on HttpClient and HTMLParser
随着互联网的高速发展,针对互联网的分析处理显得日益重要。本文研究了HttpClient、HTMLParser等技术,提出并实现了一种基于HttpClient与HTMLParser的网页抓取解析方法,该方法能够快速有效对HTML页面进行抓取解析,提取出所需的文本内容。
With the rapid development of the Internet, the analysis on the Internet is becoming more and more important. This paper studies the HttpClient, HTMLParser technology, put forward and realizes a web capture and analysis method based on HttpClient and HTMLParser, this method can capture the HTML page and then extract the text content fast and effectively.
崔鸿雁、陈智彬
计算技术、计算机技术
正文提取HttpClientHTMLParserHadoop
text extractionHttpClientHTMLParserHadoop
崔鸿雁,陈智彬.基于HttpClient与HTMLParser的网页正文提取[EB/OL].(2011-12-22)[2025-08-16].http://www.paper.edu.cn/releasepaper/content/201112-569.点此复制
评论