Strategic application of SMOTE and its variants to enhance AI-driven healthcare classification in imbalanced datasets

The strategic integration of Artificial Intelligence (AI) into healthcare systems offers significant opportunities to drive innovation, improve productivity, and enhance patient outcomes. One ongoing challenge in clinical AI applications is class imbalance, where minority cases, often representing the most critical health risks, are significantly underrepresented in datasets. This imbalance reduces model sensitivity and limits the effectiveness of AI-driven decision-making. To address this issue, this study presents a comparative analysis of five widely used oversampling strategies: SMOTE, Borderline-SMOTE, SMOTE-Tomek, SMOTE-ENN, and SVM-SMOTE. Using two imbalanced healthcare datasets (ASD-Child and Stroke), we evaluate each technique's impact on the performance of four machine learning classifiers: Logistic Regression, Random Forest, XGBoost, and Gradient Boosting. The models are assessed using multiple evaluation metrics: recall, precision, F1-score, ROC-AUC, and PR-AUC. This study provides practical guidance for healthcare organizations aiming to implement AI strategies that support fairer predictions, stronger clinical insights, and more productive data-driven systems by identifying optimal combinations of resampling techniques and classifiers.

Dara Tourt
Metropolitan State University
United States
Dara.tourt@my.metrostate.edu

Queen Booker
Metropolitan State University
United States
queen.booker@metrostate.edu

Carl Rebman
University of San Diego
United States
carlr@sandiego.edu