Abstract
Adaptor grammars are a framework for expressing and performing inference over a variety of non-parametric linguistic models. These models currently provide state-of-the-art performance on unsupervised word segmentation from phonemic representations of child-directed unsegmented English utterances. This paper investigates the applicability of these models to unsupervised word segmentation of Mandarin. We investigate a wide variety of different segmentation models, and show that the best segmentation accuracy is obtained from models that capture inter word "collocational" dependencies. Surprisingly, enhancing the models to exploit syllable structure regularities and to capture tone information does improve overall word segmentation accuracy, perhaps because the information these elements convey is redundant when compared to the inter-word dependencies.
Original language | English |
---|---|
Title of host publication | Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference |
Editors | Chu-Ren Huang, Dan Jurafsky |
Place of Publication | China |
Publisher | Press of Tsinghua University |
Pages | 528-536 |
Number of pages | 9 |
Volume | 2 |
Publication status | Published - 2010 |
Event | 23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China Duration: 23 Aug 2010 → 27 Aug 2010 |
Other
Other | 23rd International Conference on Computational Linguistics, Coling 2010 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 23/08/10 → 27/08/10 |