RHMD: a real-world dataset for health mention classification on Reddit

Usman Naseem*, Matloob Khushi, Jinman Kim, Adam G. Dunn

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)


People on social media share their thoughts and experiences using diseases and symptoms words other than to mention their health, which can introduce biases in data-driven public health applications. For the advancement of HMC research, in this study, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD is composed of 10015 manually annotated Reddit posts that include 15 common disease or symptom terms and are labeled with four labels: personal health mentions (HMs), nonpersonal HMs, figurative HMs, and hyperbolic HMs. Empirical evaluation using recently proposed methods demonstrates the challenge of labeling user-generated text across these four types. Contributions to this work include the public release of a robustly annotated Reddit dataset (RHMD) for HM tasks and a comprehensive performance analysis of baseline methods. We expect the release of the dataset, and the evaluations will help facilitate the development of new methods for detecting HMs in the user-generated text. The dataset is available at https://github.com/usmaann/RHMD-Health-Mention-Dataset.

Original languageEnglish
Pages (from-to)2325-2334
Number of pages10
JournalIEEE Transactions on Computational Social Systems
Issue number5
Early online date11 Jul 2022
Publication statusPublished - Oct 2023
Externally publishedYes


Dive into the research topics of 'RHMD: a real-world dataset for health mention classification on Reddit'. Together they form a unique fingerprint.

Cite this