Unlocking the Power of Machine Learning for Diabetes Prediction: A Comparative Evaluation
Abstract
Diabetes is a chronic metabolic condition with severe long-term health consequences, making early and accurate detection crucial to preventing complications. While traditional diagnostic methods are reliable, they can be time-consuming and less effective in detecting diabetes at early stages. This study applies machine learning (ML) techniques to develop a predictive model for diabetes using the PIMA Indian Diabetes Dataset, which includes critical clinical and lifestyle factors like glucose levels, BMI, and age. The data was cleaned to address missing values, normalized using feature scaling, and divided into training (80%) and testing (20%) sets. A Logistic Regression model was selected for its interpretability and efficiency in binary classification tasks. The model achieved an accuracy of 79.3%, with precision, recall, and F1-scores of 76.5%, 74.1%, and 75.3%, respectively. A comparative analysis with other ML models (such as Random Forest and XGBoost) showed a solid balance between performance and simplicity. Key predictors included glucose levels and BMI, which align with clinical insights. Although the model demonstrates potential for early diabetes detection, challenges like class imbalance and false negatives highlight the need for improvements, potentially through oversampling techniques or more advanced algorithms. This work showcases the potential of ML to improve diabetes diagnostic tools, offering a scalable, data-driven solution to enhance healthcare decision-making. Future research could focus on exploring deep learning approaches and larger datasets to further boost accuracy and generalization.
References
American Diabetes Association. (2021). 2. Classification and diagnosis of diabetes: Standards of medical care in diabetes 2021. Diabetes Care, 44(Supplement 1), S15-S33. https://doi.org/10.2337/dc21-S002
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. https://doi.org/10.1145/2939672.2939785
Kramer, C. K., Zinman, B., & Retnakaran, R. (2013). Short-term intensive insulin therapy in type 2 diabetes mellitus: A systematic review and meta-analysis. The Lancet Diabetes & Endocrinology, 1(1), 28-34. https://doi.org/10.1016/S2213-8587(13)70006-8
Pedregosa, F., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830. https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., & Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proceedings of the Annual Symposium on Computer Application in Medical Care, 261-265. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245318/
World Health Organization. (2023). Diabetes fact sheet. https://www.who.int/news-room/fact-sheets/detail/diabetes
Zhang, L., Wang, Y., Niu, M., Wang, C., & Wang, Z. (2021). Machine learning for characterizing risk of type 2 diabetes mellitus in a rural Chinese population: The Henan Rural Cohort Study. Scientific Reports, 11(1), 1225. https://doi.org/10.1038/s41598-020-80323-z
A. E. Johnson et al., “MIMIC-III, a freely accessible critical care database,” Scientific Data, vol. 3, no. 1, pp. 1–9, 2016.
Q. Suo et al., “Personalized disease prediction using a CNN-based similarity learning method,” in 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, 2017, pp. 811–816.
D. K. McGuire et al., “Association of SGLT2 inhibitors with cardiovascular and kidney outcomes in patients with type 2 diabetes: A meta-analysis,” JAMA Cardiology, vol. 6, no. 2, pp. 148–158, 2021.
Refbacks
- There are currently no refbacks.