Unsupervised text segmentation based on native language characteristics

Shervin Malmasi, Mark Dras, Mark Johnson, Lan Du, Magdalena Wolska

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

8 Citations (Scopus)
36 Downloads (Pure)

Abstract

Most work on segmenting text does so on the basis of topic changes, but it can be of interest to segment by other, stylistically expressed characteristics such as change of authorship or native language. We propose a Bayesian unsupervised text segmentation approach to the latter. While baseline models achieve essentially random segmentation on our task, indicating its difficulty, a Bayesian model that incorporates appropriately compact language models and alternating asymmetric priors can achieve scores on the standard metrics around halfway to perfect segmentation.

Original languageEnglish
Title of host publicationACL 2017
Subtitle of host publicationProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Long Papers)
EditorsRegina Barzilay, Min-Yen Kan
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages1457-1469
Number of pages13
Volume1
ISBN (Electronic)9781945626753
DOIs
Publication statusPublished - 2017
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: 30 Jul 20174 Aug 2017

Conference

Conference55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
Country/TerritoryCanada
CityVancouver
Period30/07/174/08/17

Bibliographical note

Copyright the Publisher. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Fingerprint

Dive into the research topics of 'Unsupervised text segmentation based on native language characteristics'. Together they form a unique fingerprint.

Cite this