A model for detecting and merging vertically spanned table cells in plain text documents

Vanessa Long, Robert Dale, Steve Cassidy

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

A spanned cell in a table is a single, complete unit that physically occupies multiple columns and/or multiple rows. Spanned cells are common in tables, and they are a significant cause of error in the extraction of tables from free text documents. In this paper, we present a model for the detection and merging of vertically spanned cells for tables presented in plain text documents. Our model and algorithm are based purely on the layout features of the tables, and they require no semantic understanding of the documents. When tested on the 98 tables appearing in 40 randomly selected documents from a corpus of company announcements from the Australian Stock Exchange (ASX), our algorithm achieves an accuracy of 86.79% in detecting and merging vertically spanned cells.

LanguageEnglish
Title of host publicationProceedings Eighth International Conference on Document Analysis and Recognition
EditorsBob Werner
Place of PublicationLos Alamitos, CA
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages1242-1246
Number of pages5
Volume1
ISBN (Print)0769524206, 9780769524207
DOIs
Publication statusPublished - Sep 2005
Event8th International Conference on Document Analysis and Recognition - Seoul, Korea, Republic of
Duration: 31 Aug 20051 Sep 2005

Other

Other8th International Conference on Document Analysis and Recognition
CountryKorea, Republic of
CitySeoul
Period31/08/051/09/05

Fingerprint

Merging
Semantics
Industry

Bibliographical note

Copyright 2005 IEEE. Reprinted from Eighth International Conference on Document Analysis and Recognition : proceedings : August 31 to September 1, 2005, Seoul, Korea. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Macquarie University’s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Cite this

Long, V., Dale, R., & Cassidy, S. (2005). A model for detecting and merging vertically spanned table cells in plain text documents. In B. Werner (Ed.), Proceedings Eighth International Conference on Document Analysis and Recognition (Vol. 1, pp. 1242-1246). [1575741] Los Alamitos, CA: Institute of Electrical and Electronics Engineers (IEEE). https://doi.org/10.1109/ICDAR.2005.21
Long, Vanessa ; Dale, Robert ; Cassidy, Steve. / A model for detecting and merging vertically spanned table cells in plain text documents. Proceedings Eighth International Conference on Document Analysis and Recognition. editor / Bob Werner. Vol. 1 Los Alamitos, CA : Institute of Electrical and Electronics Engineers (IEEE), 2005. pp. 1242-1246
@inproceedings{48576e7202fc4ded8afb174751ecd2f9,
title = "A model for detecting and merging vertically spanned table cells in plain text documents",
abstract = "A spanned cell in a table is a single, complete unit that physically occupies multiple columns and/or multiple rows. Spanned cells are common in tables, and they are a significant cause of error in the extraction of tables from free text documents. In this paper, we present a model for the detection and merging of vertically spanned cells for tables presented in plain text documents. Our model and algorithm are based purely on the layout features of the tables, and they require no semantic understanding of the documents. When tested on the 98 tables appearing in 40 randomly selected documents from a corpus of company announcements from the Australian Stock Exchange (ASX), our algorithm achieves an accuracy of 86.79{\%} in detecting and merging vertically spanned cells.",
author = "Vanessa Long and Robert Dale and Steve Cassidy",
note = "Copyright 2005 IEEE. Reprinted from Eighth International Conference on Document Analysis and Recognition : proceedings : August 31 to September 1, 2005, Seoul, Korea. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Macquarie University{\^a}€™s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.",
year = "2005",
month = "9",
doi = "10.1109/ICDAR.2005.21",
language = "English",
isbn = "0769524206",
volume = "1",
pages = "1242--1246",
editor = "Bob Werner",
booktitle = "Proceedings Eighth International Conference on Document Analysis and Recognition",
publisher = "Institute of Electrical and Electronics Engineers (IEEE)",
address = "United States",

}

Long, V, Dale, R & Cassidy, S 2005, A model for detecting and merging vertically spanned table cells in plain text documents. in B Werner (ed.), Proceedings Eighth International Conference on Document Analysis and Recognition. vol. 1, 1575741, Institute of Electrical and Electronics Engineers (IEEE), Los Alamitos, CA, pp. 1242-1246, 8th International Conference on Document Analysis and Recognition, Seoul, Korea, Republic of, 31/08/05. https://doi.org/10.1109/ICDAR.2005.21

A model for detecting and merging vertically spanned table cells in plain text documents. / Long, Vanessa; Dale, Robert; Cassidy, Steve.

Proceedings Eighth International Conference on Document Analysis and Recognition. ed. / Bob Werner. Vol. 1 Los Alamitos, CA : Institute of Electrical and Electronics Engineers (IEEE), 2005. p. 1242-1246 1575741.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - A model for detecting and merging vertically spanned table cells in plain text documents

AU - Long,Vanessa

AU - Dale,Robert

AU - Cassidy,Steve

N1 - Copyright 2005 IEEE. Reprinted from Eighth International Conference on Document Analysis and Recognition : proceedings : August 31 to September 1, 2005, Seoul, Korea. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of Macquarie University’s products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

PY - 2005/9

Y1 - 2005/9

N2 - A spanned cell in a table is a single, complete unit that physically occupies multiple columns and/or multiple rows. Spanned cells are common in tables, and they are a significant cause of error in the extraction of tables from free text documents. In this paper, we present a model for the detection and merging of vertically spanned cells for tables presented in plain text documents. Our model and algorithm are based purely on the layout features of the tables, and they require no semantic understanding of the documents. When tested on the 98 tables appearing in 40 randomly selected documents from a corpus of company announcements from the Australian Stock Exchange (ASX), our algorithm achieves an accuracy of 86.79% in detecting and merging vertically spanned cells.

AB - A spanned cell in a table is a single, complete unit that physically occupies multiple columns and/or multiple rows. Spanned cells are common in tables, and they are a significant cause of error in the extraction of tables from free text documents. In this paper, we present a model for the detection and merging of vertically spanned cells for tables presented in plain text documents. Our model and algorithm are based purely on the layout features of the tables, and they require no semantic understanding of the documents. When tested on the 98 tables appearing in 40 randomly selected documents from a corpus of company announcements from the Australian Stock Exchange (ASX), our algorithm achieves an accuracy of 86.79% in detecting and merging vertically spanned cells.

UR - http://www.scopus.com/inward/record.url?scp=33947420202&partnerID=8YFLogxK

U2 - 10.1109/ICDAR.2005.21

DO - 10.1109/ICDAR.2005.21

M3 - Conference proceeding contribution

SN - 0769524206

SN - 9780769524207

VL - 1

SP - 1242

EP - 1246

BT - Proceedings Eighth International Conference on Document Analysis and Recognition

PB - Institute of Electrical and Electronics Engineers (IEEE)

CY - Los Alamitos, CA

ER -

Long V, Dale R, Cassidy S. A model for detecting and merging vertically spanned table cells in plain text documents. In Werner B, editor, Proceedings Eighth International Conference on Document Analysis and Recognition. Vol. 1. Los Alamitos, CA: Institute of Electrical and Electronics Engineers (IEEE). 2005. p. 1242-1246. 1575741 https://doi.org/10.1109/ICDAR.2005.21