WikiWars: A new corpus for research on temporal expressions

Paweł Mazur*, Robert Dale

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

49 Citations (Scopus)

Abstract

The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the preparation of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

Original languageEnglish
Title of host publicationEMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages913-922
Number of pages10
ISBN (Print)1932432868, 9781932432862
Publication statusPublished - 2010
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2010 - Cambridge, MA, United States
Duration: 9 Oct 201011 Oct 2010

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2010
CountryUnited States
CityCambridge, MA
Period9/10/1011/10/10

Fingerprint Dive into the research topics of 'WikiWars: A new corpus for research on temporal expressions'. Together they form a unique fingerprint.

Cite this