Updating the ICE annotation system: Tagging, parsing and validation

Research output: Contribution to journalArticleResearchpeer-review

Abstract

The textual markup scheme of the International Corpus of English (ICE) project evolved continuously from 1988 on, more or less independently of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate comparisons of their linguistic content. However, this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Furthermore, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focusses on several points of difficulty that are inherent in the system - especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool brings the Australian version into line with the current ICE standard, and also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternative systems of corpus annotation, such as that developed by the TEI.

LanguageEnglish
Pages115-144
Number of pages30
JournalCorpora
Volume6
Issue number2
DOIs
Publication statusPublished - Nov 2011

Fingerprint

Parsing
International Corpus of English
Annotation
Tagging
conversation
linguistics
Encoding
Convert
Corpus Annotation

Cite this

@article{68d7ee33fe354447a07bd1425ee5c1f3,
title = "Updating the ICE annotation system: Tagging, parsing and validation",
abstract = "The textual markup scheme of the International Corpus of English (ICE) project evolved continuously from 1988 on, more or less independently of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate comparisons of their linguistic content. However, this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Furthermore, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focusses on several points of difficulty that are inherent in the system - especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool brings the Australian version into line with the current ICE standard, and also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternative systems of corpus annotation, such as that developed by the TEI.",
author = "Deanna Wong and Steve Cassidy and Pam Peters",
year = "2011",
month = "11",
doi = "10.3366/cor.2011.0009",
language = "English",
volume = "6",
pages = "115--144",
journal = "Corpora",
issn = "1749-5032",
publisher = "Edinburgh University Press",
number = "2",

}

Updating the ICE annotation system : Tagging, parsing and validation. / Wong, Deanna; Cassidy, Steve; Peters, Pam.

In: Corpora, Vol. 6, No. 2, 11.2011, p. 115-144.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - Updating the ICE annotation system

T2 - Corpora

AU - Wong, Deanna

AU - Cassidy, Steve

AU - Peters, Pam

PY - 2011/11

Y1 - 2011/11

N2 - The textual markup scheme of the International Corpus of English (ICE) project evolved continuously from 1988 on, more or less independently of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate comparisons of their linguistic content. However, this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Furthermore, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focusses on several points of difficulty that are inherent in the system - especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool brings the Australian version into line with the current ICE standard, and also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternative systems of corpus annotation, such as that developed by the TEI.

AB - The textual markup scheme of the International Corpus of English (ICE) project evolved continuously from 1988 on, more or less independently of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate comparisons of their linguistic content. However, this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Furthermore, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focusses on several points of difficulty that are inherent in the system - especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool brings the Australian version into line with the current ICE standard, and also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternative systems of corpus annotation, such as that developed by the TEI.

UR - http://www.scopus.com/inward/record.url?scp=82455171657&partnerID=8YFLogxK

U2 - 10.3366/cor.2011.0009

DO - 10.3366/cor.2011.0009

M3 - Article

VL - 6

SP - 115

EP - 144

JO - Corpora

JF - Corpora

SN - 1749-5032

IS - 2

ER -