DOM-based XHTML document structure analysis separating content from navigation elements

Constantine Mantratzis, Steve Cassidy

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

This paper describes an algorithm that attempts to distinguish core content from clutter within a web document. The end goal is to aid in the separation of the core-content from hyperlinked-clutter such as text advertisements and long links of syndicated references to other web documents. Its advantage over other approaches is its ability to identify both loosely as well as tightly defined "table-like" or "list-like" structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree. The resulting data can then be used to "narrow down" the core-content of a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of "clipping" a web document to its bare essentials for me with hardware-limited devices such as PDAs and cell phones.

LanguageEnglish
Title of host publicationProceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005
EditorsM. Mohammadian
Place of PublicationLos Alamitos, CA
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages633-637
Number of pages5
Volume1
ISBN (Print)0769525040, 9780769525044
DOIs
Publication statusPublished - Nov 2005
EventInternational Conference on Computational Intelligence for Modelling, Control and Automation, CIMCA 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, IAWTIC 2005 - Vienna, Austria
Duration: 28 Nov 200530 Nov 2005

Other

OtherInternational Conference on Computational Intelligence for Modelling, Control and Automation, CIMCA 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, IAWTIC 2005
CountryAustria
CityVienna
Period28/11/0530/11/05

Fingerprint

Personal digital assistants
Information retrieval
Navigation
Semantics
Hardware

Cite this

Mantratzis, C., & Cassidy, S. (2005). DOM-based XHTML document structure analysis separating content from navigation elements. In M. Mohammadian (Ed.), Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005 (Vol. 1, pp. 633-637). [1631334] Los Alamitos, CA: Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/CIMCA.2005.1631334
Mantratzis, Constantine ; Cassidy, Steve. / DOM-based XHTML document structure analysis separating content from navigation elements. Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005. editor / M. Mohammadian. Vol. 1 Los Alamitos, CA : Institute of Electrical and Electronics Engineers (IEEE), 2005. pp. 633-637
@inproceedings{a490cd1658984dca91c9a1dcafa5a8f5,
title = "DOM-based XHTML document structure analysis separating content from navigation elements",
abstract = "This paper describes an algorithm that attempts to distinguish core content from clutter within a web document. The end goal is to aid in the separation of the core-content from hyperlinked-clutter such as text advertisements and long links of syndicated references to other web documents. Its advantage over other approaches is its ability to identify both loosely as well as tightly defined {"}table-like{"} or {"}list-like{"} structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree. The resulting data can then be used to {"}narrow down{"} the core-content of a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of {"}clipping{"} a web document to its bare essentials for me with hardware-limited devices such as PDAs and cell phones.",
author = "Constantine Mantratzis and Steve Cassidy",
year = "2005",
month = "11",
doi = "10.1109/CIMCA.2005.1631334",
language = "English",
isbn = "0769525040",
volume = "1",
pages = "633--637",
editor = "M. Mohammadian",
booktitle = "Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005",
publisher = "Institute of Electrical and Electronics Engineers (IEEE)",
address = "United States",

}

Mantratzis, C & Cassidy, S 2005, DOM-based XHTML document structure analysis separating content from navigation elements. in M Mohammadian (ed.), Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005. vol. 1, 1631334, Institute of Electrical and Electronics Engineers (IEEE), Los Alamitos, CA, pp. 633-637, International Conference on Computational Intelligence for Modelling, Control and Automation, CIMCA 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, IAWTIC 2005, Vienna, Austria, 28/11/05. https://doi.org/10.1109/CIMCA.2005.1631334

DOM-based XHTML document structure analysis separating content from navigation elements. / Mantratzis, Constantine; Cassidy, Steve.

Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005. ed. / M. Mohammadian. Vol. 1 Los Alamitos, CA : Institute of Electrical and Electronics Engineers (IEEE), 2005. p. 633-637 1631334.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - DOM-based XHTML document structure analysis separating content from navigation elements

AU - Mantratzis, Constantine

AU - Cassidy, Steve

PY - 2005/11

Y1 - 2005/11

N2 - This paper describes an algorithm that attempts to distinguish core content from clutter within a web document. The end goal is to aid in the separation of the core-content from hyperlinked-clutter such as text advertisements and long links of syndicated references to other web documents. Its advantage over other approaches is its ability to identify both loosely as well as tightly defined "table-like" or "list-like" structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree. The resulting data can then be used to "narrow down" the core-content of a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of "clipping" a web document to its bare essentials for me with hardware-limited devices such as PDAs and cell phones.

AB - This paper describes an algorithm that attempts to distinguish core content from clutter within a web document. The end goal is to aid in the separation of the core-content from hyperlinked-clutter such as text advertisements and long links of syndicated references to other web documents. Its advantage over other approaches is its ability to identify both loosely as well as tightly defined "table-like" or "list-like" structures of hyperlinks (from nested tables to simple, bullet-pointed lists) by operating at various levels within the DOM tree. The resulting data can then be used to "narrow down" the core-content of a web document for semantic analysis or other information retrieval purposes as well as to aid in the process of "clipping" a web document to its bare essentials for me with hardware-limited devices such as PDAs and cell phones.

UR - http://www.scopus.com/inward/record.url?scp=33847186688&partnerID=8YFLogxK

U2 - 10.1109/CIMCA.2005.1631334

DO - 10.1109/CIMCA.2005.1631334

M3 - Conference proceeding contribution

SN - 0769525040

SN - 9780769525044

VL - 1

SP - 633

EP - 637

BT - Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005

A2 - Mohammadian, M.

PB - Institute of Electrical and Electronics Engineers (IEEE)

CY - Los Alamitos, CA

ER -

Mantratzis C, Cassidy S. DOM-based XHTML document structure analysis separating content from navigation elements. In Mohammadian M, editor, Proceedings International Conference on Computational Intelligence for Modelling, Control & Automation CIMCA 2005 Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce IAWTIC 2005. Vol. 1. Los Alamitos, CA: Institute of Electrical and Electronics Engineers (IEEE). 2005. p. 633-637. 1631334 https://doi.org/10.1109/CIMCA.2005.1631334