ABSTRACT
The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.Google Scholar
- Y. Diao, H. Lu, S. Chen, and Z. Tian. Toward learning based web query processing. In Proceedings of the 26th International Conference on Very Large Databases (VLDB '00), pages 317--328, 2000. Google ScholarDigital Library
- C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8):521--538, 1998. Google ScholarDigital Library
- S. Huang, X. Zheng, X. Wang, and D. Chen. News information extraction based on adaptive weighting using unsupervised bayesian algorithm. In Web Information Systems and Mining, pages 251--258. Springer, 2011. Google ScholarDigital Library
- H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In LREC, 2008.Google Scholar
- N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190--197. ACM, 1999. Google ScholarDigital Library
- J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of the 12th international conference on World Wide Web, pages 187--196. ACM, 2003. Google ScholarDigital Library
- Y. Xia, Y. Yang, S. Zhang, and H. Yu. Automatic wrapper generation and maintenance. In PACLIC, pages 90--99, 2011.Google Scholar
- H. Yan and J. Yang. A very efficient approach to news title and content extraction on the web. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pages 389--390. ACM, 2011. Google ScholarDigital Library
- L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305. ACM, 2003. Google ScholarDigital Library
- G. G. G. Zaccak. Wrapster: semi-automatic wrapper generation for semi-structured websites. PhD thesis, Massachusetts Institute of Technology, 2007.Google Scholar
- S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency. In AAAI, volume 7, pages 1507--1513, 2007. Google ScholarDigital Library
Index Terms
- An automatic wrapper generation process for large scale crawling of news websites
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
An effective and efficient Web content extractor for optimizing the crawling process
Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. ...
A statistical approach for efficient crawling of rich internet applications
ICWE'12: Proceedings of the 12th international conference on Web EngineeringModern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling ...
Comments