skip to main content
10.1145/2645791.2645824acmotherconferencesArticle/Chapter ViewAbstractPublication PagespciConference Proceedingsconference-collections
research-article

An automatic wrapper generation process for large scale crawling of news websites

Authors Info & Claims
Published:02 October 2014Publication History

ABSTRACT

The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.

References

  1. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.Google ScholarGoogle Scholar
  2. Y. Diao, H. Lu, S. Chen, and Z. Tian. Toward learning based web query processing. In Proceedings of the 26th International Conference on Very Large Databases (VLDB '00), pages 317--328, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8):521--538, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Huang, X. Zheng, X. Wang, and D. Chen. News information extraction based on adaptive weighting using unsupervised bayesian algorithm. In Web Information Systems and Mining, pages 251--258. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In LREC, 2008.Google ScholarGoogle Scholar
  6. N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190--197. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of the 12th international conference on World Wide Web, pages 187--196. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Xia, Y. Yang, S. Zhang, and H. Yu. Automatic wrapper generation and maintenance. In PACLIC, pages 90--99, 2011.Google ScholarGoogle Scholar
  10. H. Yan and J. Yang. A very efficient approach to news title and content extraction on the web. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pages 389--390. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. G. G. Zaccak. Wrapster: semi-automatic wrapper generation for semi-structured websites. PhD thesis, Massachusetts Institute of Technology, 2007.Google ScholarGoogle Scholar
  13. S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency. In AAAI, volume 7, pages 1507--1513, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An automatic wrapper generation process for large scale crawling of news websites

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics
        October 2014
        355 pages
        ISBN:9781450328975
        DOI:10.1145/2645791
        • General Chairs:
        • Katsikas Sokratis,
        • Hatzopoulos Michael,
        • Apostolopoulos Theodoros,
        • Anagnostopoulos Dimosthenis,
        • Program Chairs:
        • Carayiannis Elias,
        • Varvarigou Theodora,
        • Nikolaidou Mara

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 October 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        PCI '14 Paper Acceptance Rate51of102submissions,50%Overall Acceptance Rate190of390submissions,49%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader