ENHANCING PREDICTIVE PERFORMANCE IN COVID-19 HEALTHCARE DATASETS: A CASE STUDY BASED ON HYPER ADASYN OVER-SAMPLING AND GENETIC FEATURE SELECTION
[ X ]
Tarih
2024
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Taylors Univ Sdn Bhd
Erişim Hakkı
info:eu-repo/semantics/closedAccess
Özet
Predictive analytics is paramount in the health industry, where it finds its wide application, in that it helps increase the forecast's accuracy level based on big data. Most of the time, there is a tendency toward the imbalance of the datasets in healthcare. In this study, two COVID-19 datasets from Kaggle were used as a case study of dataset imbalance. In such scenarios of imbalanced datasets like COVID-19, conventional sampling methods like ADASYN (Adaptive Synthetic Sampling Approach for Imbalanced Learning) tend to yield only modest accuracy levels. To address another problem like finding the optimal features, this study proposes a novel approach that combines oversampling techniques with genetic feature selection (GFs) using laboratory data. This innovative method aims to construct machine -learning clinical prediction models for the identification of COVID-19 infected patients, leveraging two widely recognized datasets by using hyper ADASYN over -sampling and genetic feature selection, stands out for its unprecedented precision in identifying relevant features crucial for accurate predictions. Unlike the traditional approach, it can solve the class imbalance problem and tune the feature space to bring about a dramatic increase in accuracy, precision, recall, and overall predictive performance by using our hypermodel. Our approach significantly enhanced the performance of the classifier, and the Random Forest (RF) model with n trees classifies accurately to the limit of 99%, with precision 99%, recall 99%, and F1 -score 99% for each of the datasets. Decision Tree (DT) model achieved 92% with all metrics for Dataset I, and 95% with all metrics for Dataset II. Multilayer Perceptron (MLP) achieved 99% with all metrics, respectively, for both datasets. Gradient Boosting (XGB) achieved 97% for all metrics with dataset I and 98% with all metrics for dataset II. These results underscore the efficacy of our proposed method in balancing COVID-19 datasets and enhancing predictive accuracy.
Açıklama
Anahtar Kelimeler
ADASYN, Genetic feature selection, Healthcare, Imbalanced datasets, Oversampling, Predictive analytics
Kaynak
Journal of Engineering Science and Technology
WoS Q Değeri
Scopus Q Değeri
Q3
Cilt
19
Sayı
2