Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports

Juan C. Quiroz*, Liliana Laranjo, Catalin Tufanaru, Ahmet Baki Kocaballi, Dana Rezazadegan, Shlomo Berkovsky, Enrico Coiera

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)
44 Downloads (Pure)

Abstract

Background: Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions.

Objective: This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution.

Method: We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data.

Result: Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law.

Conclusion: Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.

Original languageEnglish
Article number104324
Pages (from-to)1-9
Number of pages9
JournalInternational Journal of Medical Informatics
Volume145
Early online date2 Nov 2020
DOIs
Publication statusPublished - Jan 2021

Keywords

  • Data mining
  • MIMIC-III dataset
  • Machine learning
  • Maximum likelihood estimation
  • Power-law with exponential cut-off
  • Statistical distributions

Fingerprint

Dive into the research topics of 'Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports'. Together they form a unique fingerprint.

Cite this