Abstract
The problem of estimating an author’s vocabulary, given a sample of the author’s writings, is considered. It is assumed that the vocabulary is fixed and finite, and that the author writes a composition by successively drawing words from this collection, independently of the previous configuration. Attention is focussed on the random variable X(n), the total number of different words used in a sample of n. It is shown that under fairly general conditions, the distribution of X(n), suitably normalized and scaled, is asymptotically Gaussian, and this result may be used to obtain a large sample estimator of vocabulary size.
Original language | English |
---|---|
Pages (from-to) | 92-96 |
Number of pages | 5 |
Journal | Journal of the American Statistical Association |
Volume | 68 |
Issue number | 341 |
DOIs | |
Publication status | Published - 1973 |
Externally published | Yes |