Publishing the Trove Newspaper Corpus

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.
LanguageEnglish
Title of host publicationProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
EditorsNicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Place of PublicationLuxemburg
PublisherEuropean Language Resources Association (ELRA)
Pages4520-4525
Number of pages6
ISBN (Electronic)9782951740891
Publication statusPublished - May 2016
EventInternational Conference on Language Resources and Evaluation (10th : 2016) - Portorož, Slovenia
Duration: 23 May 201628 May 2016

Conference

ConferenceInternational Conference on Language Resources and Evaluation (10th : 2016)
Abbreviated titleLREC 2016
CountrySlovenia
CityPortorož
Period23/05/1628/05/16

Fingerprint

newspaper
language

Bibliographical note

Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • newspaper
  • corpus
  • linked data
  • Newspaper
  • Linked data
  • Corpus

Cite this

Cassidy, S. (2016). Publishing the Trove Newspaper Corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, ... S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4520-4525). Luxemburg: European Language Resources Association (ELRA).
Cassidy, Stephen. / Publishing the Trove Newspaper Corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). editor / Nicoletta Calzolari ; Khalid Choukri ; Thierry Declerck ; Sara Goggi ; Marko Grobelnik ; Bente Maegaard ; Joseph Mariani ; Hélène Mazo ; Asunción Moreno ; Jan Odijk ; Stelios Piperidis. Luxemburg : European Language Resources Association (ELRA), 2016. pp. 4520-4525
@inproceedings{a39bab70cfcb4c7a8f06be2262bf482b,
title = "Publishing the Trove Newspaper Corpus",
abstract = "The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.",
keywords = "newspaper, corpus, linked data, Newspaper, Linked data, Corpus",
author = "Stephen Cassidy",
note = "Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.",
year = "2016",
month = "5",
language = "English",
pages = "4520--4525",
editor = "Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Mariani, {Joseph } and Mazo, {H{\'e}l{\`e}ne } and Moreno, {Asunci{\'o}n } and Odijk, {Jan } and Piperidis, {Stelios }",
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)",
publisher = "European Language Resources Association (ELRA)",

}

Cassidy, S 2016, Publishing the Trove Newspaper Corpus. in N Calzolari, K Choukri, T Declerck, S Goggi, M Grobelnik, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk & S Piperidis (eds), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Luxemburg, pp. 4520-4525, International Conference on Language Resources and Evaluation (10th : 2016), Portorož, Slovenia, 23/05/16.

Publishing the Trove Newspaper Corpus. / Cassidy, Stephen.

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). ed. / Nicoletta Calzolari; Khalid Choukri; Thierry Declerck; Sara Goggi; Marko Grobelnik; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asunción Moreno; Jan Odijk; Stelios Piperidis. Luxemburg : European Language Resources Association (ELRA), 2016. p. 4520-4525.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Publishing the Trove Newspaper Corpus

AU - Cassidy, Stephen

N1 - Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

PY - 2016/5

Y1 - 2016/5

N2 - The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.

AB - The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.

KW - newspaper

KW - corpus

KW - linked data

KW - Newspaper

KW - Linked data

KW - Corpus

UR - http://www.lrec-conf.org/proceedings/lrec2016/index.html

UR - http://www.scopus.com/inward/record.url?scp=85037126101&partnerID=8YFLogxK

M3 - Conference proceeding contribution

SP - 4520

EP - 4525

BT - Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Grobelnik, Marko

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Hélène

A2 - Moreno, Asunción

A2 - Odijk, Jan

A2 - Piperidis, Stelios

PB - European Language Resources Association (ELRA)

CY - Luxemburg

ER -

Cassidy S. Publishing the Trove Newspaper Corpus. In Calzolari N, Choukri K, Declerck T, Goggi S, Grobelnik M, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Luxemburg: European Language Resources Association (ELRA). 2016. p. 4520-4525