首页|基于HttpClient与HTMLParser的网页正文提取

基于HttpClient与HTMLParser的网页正文提取

Web Content Extraction based on HttpClient and HTMLParser

来源：

中文摘要

英文摘要

随着互联网的高速发展，针对互联网的分析处理显得日益重要。本文研究了HttpClient、HTMLParser等技术，提出并实现了一种基于HttpClient与HTMLParser的网页抓取解析方法，该方法能够快速有效对HTML页面进行抓取解析，提取出所需的文本内容。

With the rapid development of the Internet, the analysis on the Internet is becoming more and more important. This paper studies the HttpClient, HTMLParser technology, put forward and realizes a web capture and analysis method based on HttpClient and HTMLParser, this method can capture the HTML page and then extract the text content fast and effectively.

作者：崔鸿雁、陈智彬

作者单位：

学科分类：计算技术、计算机技术

中文关键词：正文提取HttpClientHTMLParserHadoop

英文关键词：text extractionHttpClientHTMLParserHadoop

推荐引用：崔鸿雁,陈智彬.基于HttpClient与HTMLParser的网页正文提取[EB/OL].(2011-12-22)[2025-08-16].http://www.paper.edu.cn/releasepaper/content/201112-569.点此复制

基于HttpClient与HTMLParser的网页正文提取

Web Content Extraction based on HttpClient and HTMLParser

评论