Abstract
This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of a top performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from a Latent Dirichlet Allocation model as a proxy for "activities" contexts, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.
| Original language | English |
|---|---|
| Title of host publication | COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers |
| Place of Publication | Stroudsburg, PA |
| Publisher | Association for Computational Linguistics, ACL Anthology |
| Pages | 2326-2334 |
| Number of pages | 9 |
| ISBN (Electronic) | 9781941643266 |
| Publication status | Published - 2014 |
| Event | 25th International Conference on Computational Linguistics, COLING 2014 - Dublin, Ireland Duration: 23 Aug 2014 → 29 Aug 2014 |
Other
| Other | 25th International Conference on Computational Linguistics, COLING 2014 |
|---|---|
| Country/Territory | Ireland |
| City | Dublin |
| Period | 23/08/14 → 29/08/14 |
Bibliographical note
Copyright the Author(s) 2014. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.Fingerprint
Dive into the research topics of 'Unsupervised word segmentation in context'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver