Imbalanced classification for protein subcellular localization with multilabel oversampling

Priyanka Rana, Arcot Sowmya, Erik Meijering, Yang Song*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)
38 Downloads (Pure)

Abstract

Motivation: Subcellular localization of human proteins is essential to comprehend their functions and roles in physiological processes, which in turn helps in diagnostic and prognostic studies of pathological conditions and impacts clinical decision-making. Since proteins reside at multiple locations at the same time and few subcellular locations host far more proteins than other locations, the computational task for their subcellular localization is to train a multilabel classifier while handling data imbalance. In imbalanced data, minority classes are underrepresented, thus leading to a heavy bias towards the majority classes and the degradation of predictive capability for the minority classes. Furthermore, data imbalance in multilabel settings is an even more complex problem due to the coexistence of majority and minority classes.

Results: Our studies reveal that based on the extent of concurrence of majority and minority classes, oversampling of minority samples through appropriate data augmentation techniques holds promising scope for boosting the classification performance for the minority classes. We measured the magnitude of data imbalance per class and the concurrence of majority and minority classes in the dataset. Based on the obtained values, we identified minority and medium classes, and a new oversampling method is proposed that includes non-linear mixup, geometric and colour transformations for data augmentation and a sampling approach to prepare minibatches. Performance evaluation on the Human Protein Atlas Kaggle challenge dataset shows that the proposed method is capable of achieving better predictions for minority classes than existing methods.

Availability and implementation: Data used in this study are available at https://www.kaggle.com/competitions/human-protein-atlas-image-classification/data. Source code is available at https://github.com/priyarana/Protein-subcellular-localisation-method.
Original languageEnglish
Article numberbtac841
Pages (from-to)1-7
Number of pages7
JournalBioinformatics
Volume39
Issue number1
Early online date29 Dec 2022
DOIs
Publication statusPublished - 1 Jan 2023
Externally publishedYes

Bibliographical note

Copyright the Author(s) 2022. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Humans
  • Algorithms
  • Proteins/metabolism
  • Software
  • Clinical Decision-Making
  • Protein Transport

Fingerprint

Dive into the research topics of 'Imbalanced classification for protein subcellular localization with multilabel oversampling'. Together they form a unique fingerprint.

Cite this