Identification of disease or symptom terms in Reddit to improve health mention classification

Usman Naseem, Jinman Kim, Matloob Khushi, Adam G. Dunn

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

22 Citations (Scopus)

Abstract

In a user-generated text such as on social media platforms and online forums, people often use disease or symptom terms in ways other than to describe their health. In data-driven public health surveillance, the health mention classification (HMC) task aims to identify posts where users are discussing health conditions rather than using disease and symptom terms for other reasons. Existing computational research typically only studies health mentions in Twitter, with limited coverage of disease or symptom terms, ignore user behavior information, and other ways people use disease or symptom terms. To advance the HMC research, we present a Reddit health mention dataset (RHMD), a new dataset of multi-domain Reddit data for the HMC. RHMD consists of 10,015 manually labeled Reddit posts that mention 15 common disease or symptom terms and are annotated with four labels: namely personal health mentions, non-personal health mentions, figurative health mentions, and hyperbolic health mentions. With RHMD, we propose HMCNET that combines a target keyword (disease or symptom term) identification and user behavior hierarchically to improve HMC. Experimental results demonstrate that the proposed approach outperforms state-of-the-art methods with an F1-Score of 0.75 (an increase of 11% over the state-of-the-art) and shows that our new dataset poses a strong challenge to the existing HMC methods.

Original languageEnglish
Title of host publicationWWW '22
Subtitle of host publicationproceedings of the ACM Web Conference 2022
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Pages2573-2581
Number of pages9
ISBN (Electronic)9781450390965
DOIs
Publication statusPublished - 2022
Externally publishedYes
Event31st ACM World Wide Web Conference, WWW 2022 - Virtual, Online, France
Duration: 25 Apr 202229 Apr 2022

Conference

Conference31st ACM World Wide Web Conference, WWW 2022
Country/TerritoryFrance
CityVirtual, Online
Period25/04/2229/04/22

Keywords

  • Health Mention Classification
  • Public Health Surveillance
  • Reddit

Fingerprint

Dive into the research topics of 'Identification of disease or symptom terms in Reddit to improve health mention classification'. Together they form a unique fingerprint.

Cite this