Abstract
Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on suboptimal search procedures.
Original language | English |
---|---|
Title of host publication | Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics |
Place of Publication | Stroudsburg, PA |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 673-680 |
Number of pages | 8 |
Volume | 1 |
ISBN (Print) | 1932432655, 9781932432657 |
DOIs | |
Publication status | Published - Jul 2006 |
Externally published | Yes |
Event | 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, COLING/ACL - 2006 - Sydney, Australia Duration: 17 Jul 2006 → 21 Jul 2006 |
Other
Other | 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, COLING/ACL - 2006 |
---|---|
Country/Territory | Australia |
City | Sydney |
Period | 17/07/06 → 21/07/06 |