Abstract
Cross-linguistic studies on unsupervised word segmentation have consistently shown that English is easier to segment than other languages. In this paper, we propose an explanation of this finding based on the notion of segmentation ambiguity. We show that English has a very low segmentation ambiguity compared to Japanese and that this difference correlates with the segmentation performance in a unigram model. We suggest that segmentation ambiguity is linked to a trade-off between syllable structure complexity and word length distribution.
Original language | English |
---|---|
Title of host publication | Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics |
Subtitle of host publication | CMCL 2013 : August 8, 2013, Sofia, Bulgaria |
Place of Publication | Stroudsburg, PA |
Publisher | Association for Computational Linguistics |
Pages | 1-10 |
Number of pages | 10 |
ISBN (Print) | 9781937284619 |
Publication status | Published - 2013 |
Event | Annual Workshop on Cognitive Modeling and Computational Linguistics (4th : 2013) - Sofia, Bulgaria Duration: 8 Aug 2013 → 8 Aug 2013 |
Workshop
Workshop | Annual Workshop on Cognitive Modeling and Computational Linguistics (4th : 2013) |
---|---|
City | Sofia, Bulgaria |
Period | 8/08/13 → 8/08/13 |