Projects per year
Abstract
Background: Bayesian modelling and statistical text analysis rely on informed probability priors to encourage good solutions.
Objective: This paper empirically analyses whether text in medical discharge reports follow Zipf's law, a commonly assumed statistical property of language where word frequency follows a discrete power-law distribution.
Method: We examined 20,000 medical discharge reports from the MIMIC-III dataset. Methods included splitting the discharge reports into tokens, counting token frequency, fitting power-law distributions to the data, and testing whether alternative distributions-lognormal, exponential, stretched exponential, and truncated power-law-provided superior fits to the data.
Result: Discharge reports are best fit by the truncated power-law and lognormal distributions. Discharge reports appear to be near-Zipfian by having the truncated power-law provide superior fits over a pure power-law.
Conclusion: Our findings suggest that Bayesian modelling and statistical text analysis of discharge report text would benefit from using truncated power-law and lognormal probability priors and non-parametric models that capture power-law behavior.
Original language | English |
---|---|
Article number | 104324 |
Pages (from-to) | 1-9 |
Number of pages | 9 |
Journal | International Journal of Medical Informatics |
Volume | 145 |
Early online date | 2 Nov 2020 |
DOIs | |
Publication status | Published - Jan 2021 |
Keywords
- Data mining
- MIMIC-III dataset
- Machine learning
- Maximum likelihood estimation
- Power-law with exponential cut-off
- Statistical distributions
Fingerprint
Dive into the research topics of 'Empirical analysis of Zipf's law, power law, and lognormal distributions in medical discharge reports'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Centre of Research Excellence in Digital Health (CREDiH)
Coiera, E., Glasziou, P., Hansen, D., Magrabi, F., Sintchenko, V., Verspoor, K., Gallego-Luxan, B., Lau, A., Dunn, A., Longhurst, C., Tsafnat, G., Cutler, H., Makeham, M., Shaw, T., Shah, N., Runciman, W. & Liaw, S. T.
1/01/18 → 31/12/22
Project: Research