Unsupervised word segmentation in context

Gabriel Synnaeve, Isabelle Dautriche, Benjamin Borschinger, Mark Johnson, Emmanuel Dupoux

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

4 Citations (Scopus)
18 Downloads (Pure)


This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of a top performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from a Latent Dirichlet Allocation model as a proxy for "activities" contexts, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.

Original languageEnglish
Title of host publicationCOLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics, ACL Anthology
Number of pages9
ISBN (Electronic)9781941643266
Publication statusPublished - 2014
Event25th International Conference on Computational Linguistics, COLING 2014 - Dublin, Ireland
Duration: 23 Aug 201429 Aug 2014


Other25th International Conference on Computational Linguistics, COLING 2014

Bibliographical note

Copyright the Author(s) 2014. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.


Dive into the research topics of 'Unsupervised word segmentation in context'. Together they form a unique fingerprint.

Cite this