ENHANCING PREDICTIVE PERFORMANCE IN COVID-19 HEALTHCARE DATASETS: A CASE STUDY BASED ON HYPER ADASYN OVER-SAMPLING AND GENETIC FEATURE SELECTION

[ X ]

Tarih

2024

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Taylors Univ Sdn Bhd

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

Predictive analytics is paramount in the health industry, where it finds its wide application, in that it helps increase the forecast's accuracy level based on big data. Most of the time, there is a tendency toward the imbalance of the datasets in healthcare. In this study, two COVID-19 datasets from Kaggle were used as a case study of dataset imbalance. In such scenarios of imbalanced datasets like COVID-19, conventional sampling methods like ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced Learning) tend to yield only modest accuracy levels. To address another problem like finding the optimal features, this study proposes a novel approach that combines oversampling techniques with genetic feature selection (GFs) using laboratory data. This innovative method aims to construct machine -learning clinical prediction models for the identification of COVID-19 infected patients, leveraging two widely recognized datasets by using hyper ADASYN over -sampling and genetic feature selection, stands out for its unprecedented precision in identifying relevant features crucial for accurate predictions. Unlike the traditional approach, it can solve the class imbalance problem and tune the feature space to bring about a dramatic increase in accuracy, precision, recall, and overall predictive performance by using our hypermodel. Our approach significantly enhanced the performance of the classifier, and the Random Forest (RF) model with n trees classifies accurately to the limit of 99%, with precision 99%, recall 99%, and F1 -score 99% for each of the datasets. Decision Tree (DT) model achieved 92% with all metrics for Dataset I, and 95% with all metrics for Dataset II. Multilayer Perceptron (MLP) achieved 99% with all metrics, respectively, for both datasets. Gradient Boosting (XGB) achieved 97% for all metrics with dataset I and 98% with all metrics for dataset II. These results underscore the efficacy of our proposed method in balancing COVID-19 datasets and enhancing predictive accuracy.

Açıklama

Anahtar Kelimeler

ADASYN, Genetic feature selection, Healthcare, Imbalanced datasets, Oversampling, Predictive analytics

Kaynak

Journal of Engineering Science and Technology

WoS Q Değeri

Scopus Q Değeri

Q3

Cilt

19

Sayı

2

Künye