Updating the ICE annotation system

Tagging, parsing and validation

Deanna Wong*, Steve Cassidy, Pam Peters

*Corresponding author for this work

Research output: Contribution to journalArticle

5 Citations (Scopus)

Abstract

The textual markup scheme of the International Corpus of English (ICE) project evolved continuously from 1988 on, more or less independently of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate comparisons of their linguistic content. However, this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Furthermore, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focusses on several points of difficulty that are inherent in the system - especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool brings the Australian version into line with the current ICE standard, and also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternative systems of corpus annotation, such as that developed by the TEI.

Original languageEnglish
Pages (from-to)115-144
Number of pages30
JournalCorpora
Volume6
Issue number2
DOIs
Publication statusPublished - Nov 2011

    Fingerprint

Cite this