Abstract
Background: Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions.
Objective: This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution.
Method: We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data.
Result: Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law.
Conclusion: Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.
| Original language | English |
|---|---|
| Article number | 104324 |
| Pages (from-to) | 1-9 |
| Number of pages | 9 |
| Journal | International Journal of Medical Informatics |
| Volume | 145 |
| Early online date | 2 Nov 2020 |
| DOIs | |
| Publication status | Published - Jan 2021 |
Keywords
- Data mining
- MIMIC-III dataset
- Machine learning
- Maximum likelihood estimation
- Power-law with exponential cut-off
- Statistical distributions
Fingerprint
Dive into the research topics of 'Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Centre of Research Excellence in Digital Health (CREDiH)
Coiera, E. (Chief Investigator), Glasziou, P. (Chief Investigator), Hansen, D. (Chief Investigator), Magrabi, F. (Chief Investigator), Sintchenko, V. (Chief Investigator), Verspoor, K. (Chief Investigator), Gallego-Luxan, B. (Chief Investigator), Lau, A. (Chief Investigator), Dunn, A. (Associate Investigator), Longhurst, C. (Associate Investigator), Tsafnat, G. (Associate Investigator), Cutler, H. (Associate Investigator), Makeham, M. (Associate Investigator), Shaw, T. (Associate Investigator), Shah, N. (Associate Investigator), Runciman, W. (Chief Investigator) & Liaw, S. T. (Chief Investigator)
1/01/18 → 31/12/22
Project: Research
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver