TEXUS: table extraction system for PDF documents

Roya Rastan, Hye Young Paik, John Shepherd, Seung Hwan Ryu*, Amin Beheshti

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

7 Citations (Scopus)

Abstract

Tables in documents are a rich and under-exploited source of structured data in otherwise unstructured documents. The extraction and understanding of tabular data is a challenging task which has attracted the attention of researchers from a range of disciplines such as information retrieval, machine learning and natural language processing. In this demonstration, we present an end-to-end table extraction and understanding system which takes a PDF file and automatically generates a set of XML and CSV files containing the extracted cells, rows and columns of tables, as well as a complete reading order analysis of the tables. Unlike many systems that work as a black-boxed, ad-hoc solution, our system design incorporates the open, reusable and extensible architecture to support research into, and development of, table-processing systems. During the demo, users will see how our system gradually transforms a PDF document into a set of structured files through a series of processing modules, namely: locating, segmenting and function/structure analysis.

Original languageEnglish
Title of host publicationDatabases Theory and Applications
Subtitle of host publication29th Australasian Database Conference, ADC 2018, Proceedings
EditorsJunhu Wang, Gao Cong, Jinjun Chen, Jianzhong Qi
Place of PublicationCham
PublisherSpringer, Springer Nature
Pages345-349
Number of pages5
ISBN (Electronic)9783319920139
ISBN (Print)9783319920122
DOIs
Publication statusPublished - 1 Jan 2018
Event29th Australasian Database Conference, ADC 2018 - Gold Coast, Australia
Duration: 24 May 201827 May 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10837 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th Australasian Database Conference, ADC 2018
CountryAustralia
CityGold Coast
Period24/05/1827/05/18

Keywords

  • Document processing
  • Information extraction
  • Table extraction
  • Table processing
  • TEXUS

Fingerprint

Dive into the research topics of 'TEXUS: table extraction system for PDF documents'. Together they form a unique fingerprint.

Cite this