Emerald 110k: a multidisciplinary dataset for abstract sentence classification

Connor Stead, Stephen Smith, Peter Busch, Savanid Vatanasakdakul

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

53 Downloads (Pure)

Abstract

Background: Datasets available for abstract sentence classification modelling are predominately comprised of abstracts sourced from biomedical research.

Aims: To contribute a large non-biomedical multidisciplinary dataset for abstract sentence classification model research.

Method: Bulk extract and transformation of Emerald Group Publishing structured abstracts indexed on Scopus.

Results: We present the largest multidisciplinary dataset for abstract sentence classification modelling, consisting of 1,050,397 sentences from 103,457 abstracts.
Original languageEnglish
Title of host publicationProceedings of the 17th Workshop of the Australasian Language Technology Association
EditorsMeladel Mistica, Massimo Piccardi, Andrew MacKinlay
Place of PublicationMelbourne, VIC
PublisherAustralasian Language Technology Association
Pages120-125
Number of pages6
Publication statusPublished - 2019
Event17th Annual Workshop of The Australasian Language Technology Association (ALTA 2019) - Sydney, Australia
Duration: 4 Dec 20196 Dec 2019

Conference

Conference17th Annual Workshop of The Australasian Language Technology Association (ALTA 2019)
Country/TerritoryAustralia
CitySydney
Period4/12/196/12/19

Bibliographical note

Copyright the Publisher 2019. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Structured abstracts
  • Natural language processing
  • information systems

Fingerprint

Dive into the research topics of 'Emerald 110k: a multidisciplinary dataset for abstract sentence classification'. Together they form a unique fingerprint.

Cite this