Studying the effect of input size for Bayesian word segmentation on the providence corpus

Benjamin B̈orschinger*, Katherine Demuth, Mark Johnson

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

4 Citations (Scopus)

Abstract

Studies of computational models of language acquisition depend to a large part on the input available for experiments. In this paper, we study the effect that input size has on the performance of word segmentation models embodying different kinds of linguistic assumptions. Because currently available corpora for word segmentation are not suited for addressing this question, we perform our study on a novel corpus based on the Providence Corpus (Demuth et al., 2006). We find that input size can have dramatic effects on segmentation performance and that, somewhat surprisingly, models performing well on smaller amounts of data can show a marked decrease in performance when exposed to larger amounts of data. We also present the data-set on which we perform our experiments comprising longitudinal data for six children. This corpus makes it possible to ask more specific questions about computational models of word segmentation, in particular about intra-language variability and about how the performance of different models can change over time.

Original languageEnglish
Title of host publication24th International Conference on Computational Linguistics
Subtitle of host publicationProceedings of COLING 2012: Technical Papers
EditorsMartin Kay, Christian Boitet
Place of PublicationMumbai
PublisherIndian Institute of Technology
Pages325-340
Number of pages16
Publication statusPublished - 2012
Event24th International Conference on Computational Linguistics, COLING 2012 - Mumbai, India
Duration: 8 Dec 201215 Dec 2012

Other

Other24th International Conference on Computational Linguistics, COLING 2012
Country/TerritoryIndia
CityMumbai
Period8/12/1215/12/12

Fingerprint

Dive into the research topics of 'Studying the effect of input size for Bayesian word segmentation on the providence corpus'. Together they form a unique fingerprint.

Cite this