CNN-IETS: a CNN-based probabilistic approach for information extraction by text segmentation

Meng Hu, Zhixu Li*, Yongxin Shen, An Liu, Guanfeng Liu, Kai Zheng, Lei Zhao

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

5 Citations (Scopus)

Abstract

Information Extraction by Text Segmentation (IETS) aims at segmenting text inputs to extract implicit data values contained in them. The state-of-art IETS approaches mainly rely on machine learning techniques, either supervised or unsupervised. However, while the supervised approaches require a large labelled training data, the performance of the unsupervised ones could be unstable on different data sets. To overcome their weaknesses, this paper introduces CNN-IETS, a novel unsupervised probabilistic approach that takes the advantages of pre-existing data and a Convolution Neural Network (CNN)-based probabilistic classification model. While using the CNN model can ease the burden of selecting high quality features in associating text segments with attributes of a given domain, the pre-existing data as a domain knowledge base can provide training data with a comprehensive list of features for building the CNN model. Given an input text, we do initial segmentation (according to the occurrences of these words in the knowledge base) to generate text segments for CNN classification with probabilities. Then, based on the probabilistic CNN classification results, we work on finding the most probable labelling way to the whole input text. As a complementary, a bidirectional sequencing model learned on demand from test data is finally deployed to do further adjustment to some problematic labelled segments. Our experimental study conducted on several real data collections shows that CNN-IETS improves the extraction quality of state-of-art approaches by more than 10%.

Original languageEnglish
Title of host publicationCIKM '17 Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages1159-1168
Number of pages10
ISBN (Electronic)9781450349185
DOIs
Publication statusPublished - 2017
Externally publishedYes
EventACM Conference on Information and Knowledge Management (CIKM) - Singapore, Singapore
Duration: 6 Nov 201710 Nov 2017

Conference

ConferenceACM Conference on Information and Knowledge Management (CIKM)
CountrySingapore
CitySingapore
Period6/11/1710/11/17

Keywords

  • Convolution Neural Network
  • Information Extraction
  • IETS
  • Convolution neural network
  • Information extraction

Cite this