Unsupervised word segmentation in context

Gabriel Synnaeve, Isabelle Dautriche, Benjamin Borschinger, Mark Johnson, Emmanuel Dupoux

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of a top performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from a Latent Dirichlet Allocation model as a proxy for "activities" contexts, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.

LanguageEnglish
Title of host publicationCOLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics, ACL Anthology
Pages2326-2334
Number of pages9
ISBN (Electronic)9781941643266
Publication statusPublished - 2014
Event25th International Conference on Computational Linguistics, COLING 2014 - Dublin, Ireland
Duration: 23 Aug 201429 Aug 2014

Other

Other25th International Conference on Computational Linguistics, COLING 2014
CountryIreland
CityDublin
Period23/08/1429/08/14

Fingerprint

grammar
segmentation
Word Segmentation
performance
time
Utterance
Grammar
Segmentation
Performance Study
Lexical Item
Annotation
Dirichlet

Bibliographical note

Copyright the Author(s) 2014. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Cite this

Synnaeve, G., Dautriche, I., Borschinger, B., Johnson, M., & Dupoux, E. (2014). Unsupervised word segmentation in context. In COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers (pp. 2326-2334). Stroudsburg, PA: Association for Computational Linguistics, ACL Anthology.
Synnaeve, Gabriel ; Dautriche, Isabelle ; Borschinger, Benjamin ; Johnson, Mark ; Dupoux, Emmanuel. / Unsupervised word segmentation in context. COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers. Stroudsburg, PA : Association for Computational Linguistics, ACL Anthology, 2014. pp. 2326-2334
@inproceedings{c40ee351bb0140bd8ad3270cce89fd94,
title = "Unsupervised word segmentation in context",
abstract = "This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of a top performing segmentation models by 2.5{\%} on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from a Latent Dirichlet Allocation model as a proxy for {"}activities{"} contexts, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.",
author = "Gabriel Synnaeve and Isabelle Dautriche and Benjamin Borschinger and Mark Johnson and Emmanuel Dupoux",
note = "Copyright the Author(s) 2014. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.",
year = "2014",
language = "English",
pages = "2326--2334",
booktitle = "COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers",
publisher = "Association for Computational Linguistics, ACL Anthology",

}

Synnaeve, G, Dautriche, I, Borschinger, B, Johnson, M & Dupoux, E 2014, Unsupervised word segmentation in context. in COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers. Association for Computational Linguistics, ACL Anthology, Stroudsburg, PA, pp. 2326-2334, 25th International Conference on Computational Linguistics, COLING 2014, Dublin, Ireland, 23/08/14.

Unsupervised word segmentation in context. / Synnaeve, Gabriel; Dautriche, Isabelle; Borschinger, Benjamin; Johnson, Mark; Dupoux, Emmanuel.

COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers. Stroudsburg, PA : Association for Computational Linguistics, ACL Anthology, 2014. p. 2326-2334.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Unsupervised word segmentation in context

AU - Synnaeve, Gabriel

AU - Dautriche, Isabelle

AU - Borschinger, Benjamin

AU - Johnson, Mark

AU - Dupoux, Emmanuel

N1 - Copyright the Author(s) 2014. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

PY - 2014

Y1 - 2014

N2 - This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of a top performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from a Latent Dirichlet Allocation model as a proxy for "activities" contexts, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.

AB - This paper extends existing word segmentation models to take non-linguistic context into account. It improves the token F-score of a top performing segmentation models by 2.5% on a 27k utterances dataset. We posit that word segmentation is easier in-context because the learner is not trying to access irrelevant lexical items. We use topics from a Latent Dirichlet Allocation model as a proxy for "activities" contexts, to label the Providence corpus. We present Adaptor Grammar models that use these context labels, and we study their performance with and without context annotations at test time.

UR - http://www.scopus.com/inward/record.url?scp=84959918809&partnerID=8YFLogxK

M3 - Conference proceeding contribution

SP - 2326

EP - 2334

BT - COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers

PB - Association for Computational Linguistics, ACL Anthology

CY - Stroudsburg, PA

ER -

Synnaeve G, Dautriche I, Borschinger B, Johnson M, Dupoux E. Unsupervised word segmentation in context. In COLING 2014 - 25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers. Stroudsburg, PA: Association for Computational Linguistics, ACL Anthology. 2014. p. 2326-2334