Abstract
Diabetes is one of the silent killer diseases that can effect if left without medication and a real change in lifestyle. 10.5% of adult people (10-79 years) have diabetic in the world according to the International Diabetes Federation (IDF) Diabetes Atlas (2021) reports [1]. And number getting higher. Thus, in this study, we aim to build a prediction model using Pima Indian Diabetes (PID) dataset. Dataset required heavy-duty processing because of its low-quality characteristics, such as lot missing values and imbalance. This paper shows how enhancing data quality can affectively reflect on models’ performance. Based on the conducted experiments, ensemble models such as Random Forest show highest performance (0.86% AUC-ROC) with highest encasement among all other model by around 4%.
Keywords
Diabetes Prediction
ensemble Models.
Pima Indian Diabetes (PID) Dataset