Estimating an author’s vocabulary

Donald R. McNeil*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

The problem of estimating an author’s vocabulary, given a sample of the author’s writings, is considered. It is assumed that the vocabulary is fixed and finite, and that the author writes a composition by successively drawing words from this collection, independently of the previous configuration. Attention is focussed on the random variable X(n), the total number of different words used in a sample of n. It is shown that under fairly general conditions, the distribution of X(n), suitably normalized and scaled, is asymptotically Gaussian, and this result may be used to obtain a large sample estimator of vocabulary size.

Original languageEnglish
Pages (from-to)92-96
Number of pages5
JournalJournal of the American Statistical Association
Issue number341
Publication statusPublished - 1973
Externally publishedYes


