Publishing the Trove Newspaper Corpus

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contribution

1 Citation (Scopus)
87 Downloads (Pure)


The Trove Newspaper Corpus is derived from the National Library of Australia’s digital archive of newspaper text. The corpus is a snapshot of the NLA collection taken in 2015 to be made available for language research as part of the Alveo Virtual Laboratory and contains 143 million articles dating from 1806 to 2007. This paper describes the work we have done to make this large corpus available as a research collection, facilitating access to individual documents and enabling large scale processing of the newspaper text in a cloud-based environment.
Original languageEnglish
Title of host publicationProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
EditorsNicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Place of PublicationLuxemburg
PublisherEuropean Language Resources Association (ELRA)
Number of pages6
ISBN (Electronic)9782951740891
Publication statusPublished - May 2016
EventInternational Conference on Language Resources and Evaluation (10th : 2016) - Portorož, Slovenia
Duration: 23 May 201628 May 2016


ConferenceInternational Conference on Language Resources and Evaluation (10th : 2016)
Abbreviated titleLREC 2016

Bibliographical note

Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.


  • newspaper
  • corpus
  • linked data
  • Newspaper
  • Linked data
  • Corpus

Fingerprint Dive into the research topics of 'Publishing the Trove Newspaper Corpus'. Together they form a unique fingerprint.

  • Cite this

    Cassidy, S. (2016). Publishing the Trove Newspaper Corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, ... S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4520-4525). Luxemburg: European Language Resources Association (ELRA).