PCFGs, topic models, Adaptor Grammars and learning topical collocations and the structure of proper names

Mark Johnson*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

35 Citations (Scopus)

Abstract

This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as "topic models" to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combineinsights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.

Original languageEnglish
Title of host publicationACL 2010 - 48th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages1148-1157
Number of pages10
ISBN (Print)9781617388088
Publication statusPublished - 2010
Event48th Annual Meeting of the Association for Computational Linguistics, ACL 2010 - Uppsala, Sweden
Duration: 11 Jul 201016 Jul 2010

Other

Other48th Annual Meeting of the Association for Computational Linguistics, ACL 2010
Country/TerritorySweden
CityUppsala
Period11/07/1016/07/10

Fingerprint

Dive into the research topics of 'PCFGs, topic models, Adaptor Grammars and learning topical collocations and the structure of proper names'. Together they form a unique fingerprint.

Cite this