Grammar induction from (lots of) words alone

John K. Pate, Mark Johnson

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

10 Citations (Scopus)
72 Downloads (Pure)


Grammar induction is the task of learning syntactic structure in a setting where that structure is hidden. Grammar induction from words alone is interesting because it is similiar to the problem that a child learning a language faces. Previous work has typically assumed richer but cognitively implausible input, such as POS tag annotated data, which makes that work less relevant to human language acquisition. We show that grammar induction from words alone is in fact feasible when the model is provided with sufficient training data, and present two new streaming or mini-batch algorithms for PCFG inference that can learn from millions of words of training data. We compare the performance of these algorithms to a batch algorithm that learns from less data. The minibatch algorithms outperform the batch algorithm, showing that cheap inference with more data is better than intensive inference with less data. Additionally, we show that the harmonic initialiser, which previous work identified as essential when learning from small POS-tag annotated corpora (Klein and Manning, 2004), is not superior to a uniform initialisation.

Original languageEnglish
Title of host publicationCOLING 2016 - the 26th International Conference on Computational Linguistics
Subtitle of host publicationProceedings of COLING 2016: Technical Papers
PublisherAssociation for Computational Linguistics, ACL Anthology
Number of pages10
ISBN (Print)9784879747020
Publication statusPublished - 1 Jan 2016
Event26th International Conference on Computational Linguistics, COLING 2016 - Osaka, Japan
Duration: 11 Dec 201616 Dec 2016


Conference26th International Conference on Computational Linguistics, COLING 2016

Bibliographical note

Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.


Dive into the research topics of 'Grammar induction from (lots of) words alone'. Together they form a unique fingerprint.

Cite this