PCFG Models of Linguistic Tree Representations

Mark Johnson*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

241 Citations (Scopus)


The kinds of tree representations used in a treebank corpus can have a dramatic effect on performance of a parser based on the PCFG estimated from that corpus, causing the estimated likelihood of a tree to differ substantially from its frequency in the training corpus. This paper points out that the Penn II treebank representations are of the kind predicted to have such an effect, and describes a simple node relabeling transformation that improves a treebank PCFG-based parser's average precision and recall by around 8%, or approximately half of the performance difference between a simple PCFG model and the best broad-coverage parsers available today. This performance variation comes about because any PCFG, and hence the corpus of trees from which the PCFG is induced, embodies independence assumptions about the distribution of words and phrases. The particular independence assumptions implicit in a tree representation can be studied theoretically and investigated empirically by means of a tree transformation/detransformation process.

Original languageEnglish
Pages (from-to)613-632
Number of pages20
JournalComputational Linguistics
Issue number4
Publication statusPublished - Dec 1998
Externally publishedYes


Dive into the research topics of 'PCFG Models of Linguistic Tree Representations'. Together they form a unique fingerprint.

Cite this