A fast template-based approach to automatically identify primary text content of a web page

Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham, The Duy Bui

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

7 Citations (Scopus)

Abstract

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new web page from the website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.

Original languageEnglish
Title of host publicationKSE 2009 - The 1st International Conference on Knowledge and Systems Engineering
Place of PublicationWashington, DC
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages232-236
Number of pages5
ISBN (Print)9780769538464
DOIs
Publication statusPublished - 2009
Externally publishedYes
Event1st International Conference on Knowledge and Systems Engineering, KSE 2009 - Hanoi, Viet Nam
Duration: 13 Oct 200917 Oct 2009

Other

Other1st International Conference on Knowledge and Systems Engineering, KSE 2009
Country/TerritoryViet Nam
CityHanoi
Period13/10/0917/10/09

Keywords

  • Data mining
  • Template detection
  • Web mining

Fingerprint

Dive into the research topics of 'A fast template-based approach to automatically identify primary text content of a web page'. Together they form a unique fingerprint.

Cite this