research-article

An automatic wrapper generation process for large scale crawling of news websites

Authors:
Iraklis Varlamis

Department of Informatics and Telematics Harokopio, University of Athens, Omirou 9, Tavros, Athens, Greece

Department of Informatics and Telematics Harokopio, University of Athens, Omirou 9, Tavros, Athens, Greece
View Profile

,
Nikos Tsirakis

Palo LTD, Kokkoni Corinthias 20002, Corinthia, Greece

Palo LTD, Kokkoni Corinthias 20002, Corinthia, Greece
View Profile

,
Vasilis Poulopoulos

Palo LTD, Kokkoni Corinthias 20002, Corinthia, Greece

Palo LTD, Kokkoni Corinthias 20002, Corinthia, Greece
View Profile

,
Panagiotis Tsantilas

Palo LTD Kokkoni Corinthias 20002 Corinthia, Greece

Palo LTD Kokkoni Corinthias 20002 Corinthia, Greece
View Profile

PCI '14: Proceedings of the 18th Panhellenic Conference on InformaticsOctober 2014Pages 1–6https://doi.org/10.1145/2645791.2645824

Published:02 October 2014Publication History

PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics

Pages 1–6

ABSTRACT

The creation and maintenance of a large-scale news content aggregator is a tedious task, which requires more than a simple RSS aggregator. Many news sites appear every day on the Internet, providing new content in different refresh rates; well established news sites restrict access to their content only to subscribers or online readers, without offering RSS feeds, whereas other sites update their CMS or website tem-plate and lead crawlers to fetch errors. The main problem that arises from this continuous generation and alteration of pages on the Internet is the automated discovery of the appropriate and useful content and the dynamic rules that crawlers need to apply in order not to become outdated. In this paper we present an innovative mechanism for extracting useful content (title, body and media) from news articles web pages, based on automatic extraction of patterns that form each domain. The system is able to achieve high performance by combining information gathered while discovering the structure of a news site, together with "knowledge" that acquires at each crawling step, in order to improve the quality of the next steps of its own procedure. Additionally, the system can recognize changes in patterns in order to rebuild the domain rules whenever the domain changes structure. This system has been successfully implemented in palo.rs, the first news search engine in Serbia.

References

D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Vips: A vision-based page segmentation algorithm. Technical report, Microsoft technical report, MSR-TR-2003-79, 2003.Google Scholar
Y. Diao, H. Lu, S. Chen, and Z. Tian. Toward learning based web query processing. In Proceedings of the 26th International Conference on Very Large Databases (VLDB '00), pages 317--328, 2000. Google ScholarDigital Library
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8):521--538, 1998. Google ScholarDigital Library
S. Huang, X. Zheng, X. Wang, and D. Chen. News information extraction based on adaptive weighting using unsupervised bayesian algorithm. In Web Information Systems and Mining, pages 251--258. Springer, 2011. Google ScholarDigital Library
H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In LREC, 2008.Google Scholar
N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents, pages 190--197. ACM, 1999. Google ScholarDigital Library
J. Wang and F. H. Lochovsky. Data extraction and label assignment for web databases. In Proceedings of the 12th international conference on World Wide Web, pages 187--196. ACM, 2003. Google ScholarDigital Library
Y. Xia, Y. Yang, S. Zhang, and H. Yu. Automatic wrapper generation and maintenance. In PACLIC, pages 90--99, 2011.Google Scholar
H. Yan and J. Yang. A very efficient approach to news title and content extraction on the web. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, pages 389--390. ACM, 2011. Google ScholarDigital Library
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296--305. ACM, 2003. Google ScholarDigital Library
G. G. G. Zaccak. Wrapster: semi-automatic wrapper generation for semi-structured websites. PhD thesis, Massachusetts Institute of Technology, 2007.Google Scholar
S. Zheng, R. Song, and J.-R. Wen. Template-independent news extraction based on visual consistency. In AAAI, volume 7, pages 1507--1513, 2007. Google ScholarDigital Library

Index Terms

An automatic wrapper generation process for large scale crawling of news websites
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More
An effective and efficient Web content extractor for optimizing the crawling process

Classical Web crawlers make use of only hyperlink information in the crawling process. However, focused crawlers are intended to download only Web pages that are relevant to a given topic by utilizing word information before downloading the Web page. ...
Read More
A statistical approach for efficient crawling of rich internet applications
ICWE'12: Proceedings of the 12th international conference on Web Engineering

Modern web technologies, like AJAX result in more responsive and usable web applications, sometimes called Rich Internet Applications (RIAs). Traditional crawling techniques are not sufficient for crawling RIAs. We present a new strategy for crawling ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics
October 2014
355 pages
ISBN:9781450328975
DOI:10.1145/2645791
General Chairs:
Katsikas Sokratis
Department of Digital Systems, University of Piraeus
,
Hatzopoulos Michael
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens
,
Apostolopoulos Theodoros
Department of Informatics, Athens University of Economics and Business
,
Anagnostopoulos Dimosthenis
Department of Informatics & Telematics, Harokopio University of Athens
,
Program Chairs:
Carayiannis Elias
Department of Systems & Technology Management, School of Business, George Washington University
,
Varvarigou Theodora
School of Electrical and Computer Engineering, National Technical University of Athens
,
Nikolaidou Mara
Department of Informatics & Telematics, Harokopio University of Athens
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 October 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web Mining
Web crawling
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
PCI '14 Paper Acceptance Rate51of102submissions,50%Overall Acceptance Rate190of390submissions,49%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 213
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An automatic wrapper generation process for large scale crawling of news websites

PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

An effective and efficient Web content extractor for optimizing the crawling process

A statistical approach for efficient crawling of rich internet applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An automatic wrapper generation process for large scale crawling of news websites

PCI '14: Proceedings of the 18th Panhellenic Conference on Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Current challenges in web crawling

An effective and efficient Web content extractor for optimizing the crawling process

A statistical approach for efficient crawling of rich internet applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media