A central task in clinical information extraction is the classification of sentences to identify key information in publications, such as intervention and outcomes. Surface tokens and part-of-speech tags have been the most commonly used feature types for this task. In this paper we evaluate the use of word representations, induced from approximately 100m tokens of unlabelled in-domain data, as a form of semi-supervised learning for this task. We take an approach based on unsupervised word clusters, using the Brown clustering algorithm, with results showing that this method outperforms the standard features. We inspect the induced word representations and the resulting discriminative model features to gain further insights about this approach.
|Number of pages||9|
|Journal||ALTA 2015 : Proceedings of Australasian Language Technology Association Workshop 2015|
|Publication status||Published - 2015|
|Event||Australasian Language Technology Association Workshop (13th : 2015) - Parramatta, NSW|
Duration: 8 Dec 2015 → 9 Dec 2015