A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Usman Naseem*, Imran Razzak, Peter W. Eklund

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

61 Citations (Scopus)

Abstract

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.

Original languageEnglish
Pages (from-to)35239-35266
Number of pages28
JournalMultimedia Tools and Applications
Volume80
Issue number28-29
DOIs
Publication statusPublished - Nov 2021
Externally publishedYes

Keywords

  • Natural language processing
  • Text pre-processing
  • Tweet classification
  • Machine learning

Fingerprint

Dive into the research topics of 'A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter'. Together they form a unique fingerprint.

Cite this