Emerald 110k: a multidisciplinary dataset for abstract sentence classification

Connor Stead, Stephen Smith, Peter Busch, Savanid Vatanasakdakul

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

Background: Datasets available for abstract sentence classification modelling are predominately comprised of abstracts sourced from biomedical research.

Aims: To contribute a large non-biomedical multidisciplinary dataset for abstract sentence classification model research.

Method: Bulk extract and transformation of Emerald Group Publishing structured abstracts indexed on Scopus.

Results: We present the largest multidisciplinary dataset for abstract sentence classification modelling, consisting of 1,050,397 sentences from 103,457 abstracts.
LanguageEnglish
Title of host publication17th Annual Workshop of the Australasian Language Technology Association
Subtitle of host publicationALTA 2019
Number of pages8
Publication statusAccepted/In press - 25 Oct 2019
Event17th Annual Workshop of The Australasian Language Technology Association (ALTA 2019) - Sydney, Australia
Duration: 4 Dec 20196 Dec 2019

Conference

Conference17th Annual Workshop of The Australasian Language Technology Association (ALTA 2019)
CountryAustralia
CitySydney
Period4/12/196/12/19

Fingerprint

Biomedical Research
Research
Datasets

Keywords

  • Structured abstracts
  • Natural language processing
  • information systems

Cite this

Stead, C., Smith, S., Busch, P., & Vatanasakdakul, S. (Accepted/In press). Emerald 110k: a multidisciplinary dataset for abstract sentence classification. In 17th Annual Workshop of the Australasian Language Technology Association: ALTA 2019
Stead, Connor ; Smith, Stephen ; Busch, Peter ; Vatanasakdakul, Savanid. / Emerald 110k : a multidisciplinary dataset for abstract sentence classification. 17th Annual Workshop of the Australasian Language Technology Association: ALTA 2019. 2019.
@inproceedings{d9763441cdde486e999856906a4ddd09,
title = "Emerald 110k: a multidisciplinary dataset for abstract sentence classification",
abstract = "Background: Datasets available for abstract sentence classification modelling are predominately comprised of abstracts sourced from biomedical research. Aims: To contribute a large non-biomedical multidisciplinary dataset for abstract sentence classification model research. Method: Bulk extract and transformation of Emerald Group Publishing structured abstracts indexed on Scopus. Results: We present the largest multidisciplinary dataset for abstract sentence classification modelling, consisting of 1,050,397 sentences from 103,457 abstracts.",
keywords = "Structured abstracts, Natural language processing, information systems",
author = "Connor Stead and Stephen Smith and Peter Busch and Savanid Vatanasakdakul",
year = "2019",
month = "10",
day = "25",
language = "English",
booktitle = "17th Annual Workshop of the Australasian Language Technology Association",

}

Stead, C, Smith, S, Busch, P & Vatanasakdakul, S 2019, Emerald 110k: a multidisciplinary dataset for abstract sentence classification. in 17th Annual Workshop of the Australasian Language Technology Association: ALTA 2019. 17th Annual Workshop of The Australasian Language Technology Association (ALTA 2019), Sydney, Australia, 4/12/19.

Emerald 110k : a multidisciplinary dataset for abstract sentence classification. / Stead, Connor; Smith, Stephen; Busch, Peter; Vatanasakdakul, Savanid.

17th Annual Workshop of the Australasian Language Technology Association: ALTA 2019. 2019.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Emerald 110k

T2 - a multidisciplinary dataset for abstract sentence classification

AU - Stead, Connor

AU - Smith, Stephen

AU - Busch, Peter

AU - Vatanasakdakul, Savanid

PY - 2019/10/25

Y1 - 2019/10/25

N2 - Background: Datasets available for abstract sentence classification modelling are predominately comprised of abstracts sourced from biomedical research. Aims: To contribute a large non-biomedical multidisciplinary dataset for abstract sentence classification model research. Method: Bulk extract and transformation of Emerald Group Publishing structured abstracts indexed on Scopus. Results: We present the largest multidisciplinary dataset for abstract sentence classification modelling, consisting of 1,050,397 sentences from 103,457 abstracts.

AB - Background: Datasets available for abstract sentence classification modelling are predominately comprised of abstracts sourced from biomedical research. Aims: To contribute a large non-biomedical multidisciplinary dataset for abstract sentence classification model research. Method: Bulk extract and transformation of Emerald Group Publishing structured abstracts indexed on Scopus. Results: We present the largest multidisciplinary dataset for abstract sentence classification modelling, consisting of 1,050,397 sentences from 103,457 abstracts.

KW - Structured abstracts

KW - Natural language processing

KW - information systems

M3 - Conference proceeding contribution

BT - 17th Annual Workshop of the Australasian Language Technology Association

ER -

Stead C, Smith S, Busch P, Vatanasakdakul S. Emerald 110k: a multidisciplinary dataset for abstract sentence classification. In 17th Annual Workshop of the Australasian Language Technology Association: ALTA 2019. 2019