Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Early Onset Detection of Diabetes Using Feature Selection and Boosting Techniques


Affiliations
1 Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, India
2 Department of MCA, DG Vaishnav College, India
     

   Subscribe/Renew Journal


Diabetes is one of the most common diseases present in human beings. It is well known that diabetes is a metabolic disease with no permanent cure but on early detection longevity can be increased. This research work focuses on predicting the early onset of diabetes. The diabetic dataset from UCI Machine Learning Repository is used. The necessary preprocessing techniques have been carried out to make the data more robust and suiTable.for further processing. This research work proposes two feature selection and ensemble boosting techniques resulting in four combinations (models) to predict the presence of diabetes in persons. Also, a novelty is introduced in further reducing the number of features selected by the feature selection techniques. The reduction in the number of features will reduce the memory and time complexity of the model. Among the models proposed, Light Gradient Boosting (LightGBM) with Recursive Feature Elimination (RFE) as feature selector has produced better performance. Further, LightGBM with least features gave satisfactory results.

Keywords

Data Mining, Boosting, Medical Mining, Diabetes, Feature Selection.
Subscription Login to verify subscription
User
Notifications
Font Size

  • Yue Zhou, “Obesity and Diabetes as High‐Risk Factors for Severe Coronavirus Disease”, Diabetes/Metabolism Research and Reviews, Vol. 37, No. 2, pp. 1-13, 2021.
  • Su Min Jeong, “Body Mass Index, Diabetes, and the Risk of Parkinson’s Disease”, Movement Disorders, Vol. 35, No. 2, pp. 236-244, 2020.
  • V. Ritsinger, “Heart Failure Is a Common Complication after Acute Myocardial Infarction in Patients with Diabetes: A Nationwide Study in the Swedeheart Registry”, European Journal of Preventive Cardiology, Vol. 27, No. 17, pp. 1890-1901, 2020.
  • D.C. Klonoff, “The Increasing Incidence of Diabetes in the 21st Century”, Journal on Diabetes Science and Technology, Vol. 3, No. 1, pp. 1-2, 2009.
  • H.A. Shouip, “Diabetes Mellitus”, Technical Report, Faculty of Pharmacy and Pharmaceutical Industries, Sinai University, pp. 1-13, 2007.
  • P.H. Reddy, “Can Diabetes Be Controlled by Lifestyle Activities?”, Current Research in Diabetes and Obesity, Vol. 1, No. 4, pp. 1-13, 2017.
  • U. Fayyad, “Data Mining and Knowledge Discovery: Making Sense Out of Data”, IEEE Expert, Vol. 11, No. 5, pp. 20-25, 1996.
  • M. Goebel and L. Gruenwald, “A Survey of Data Mining and Knowledge Discovery Software Tools”, Proceedings of ACM Conference on Explorations Newsletter, pp. 20-33, 1999.
  • Said Salloum and Al-Emran, “Using Text Mining Techniques for Extracting Information”, Proceedings of International Conference on Intelligent Natural Language Processing: Trends and Applications, pp. 373-397, 2018.
  • Pengkun Yang and Hao Feng, “Utilization of Text Mining as a Big Data Analysis Tool for Food Science and Nutrition”, Comprehensive Reviews in Food Science and Food Safety, Vol. 19, No. 2, pp. 875-894, 2020.
  • F.R. Lucini, “Text Mining Approach to Explore Dimensions of Airline Customer Satisfaction using Online Customer Reviews”, Journal of Air Transport Management, Vol. 83, pp. 1-13, 2020.
  • R. Kosala and H. Blockeel, “Web Mining Research A Survey”, Proceedings of ACM Conference on Explorations Newsletter, pp. 1-15, 2000.
  • Rinkle Goradia, “Web Mining to Detect Online Spread of Terrorism”, International Journal of Engineering Research and Technology, Vol. 9, No. 7, pp. 645-648, 2020.
  • Hilala Jafarova and Rovshan Aliyev, “Applying K-Means Clustering Algorithm using Oracle Data Mining to Banking Data”, Springer, 2015.
  • Hamid Bekamiri, “A New Model to Identify the Reliability and Trust of Internet Banking Users Using Fuzzy Theory and Data-Mining”, Mathematics, Vol. 9, No. 9, pp. 916-927, 2021.
  • Luiza Antonie and Alexandru Coman, “Application of Data Mining Techniques for Medical Image Classification”, Proceedings of International Conference on Mining Multimedia and Complex Data, pp. 94-101, 2001.
  • M. Muzammal, “A Multi-Sensor Data Fusion Enabled Ensemble Approach for Medical Data from Body Sensor Networks”, Information Fusion, Vol. 53, pp. 155-164, 2020.
  • Vikas Chaurasia and Saurabh Pal, “A Novel Approach for Breast Cancer Detection using Data Mining Techniques”, International Journal of Innovative Research in Computer and Communication Engineering, Vol. 2, No. 1, pp. 1-14, 2014.
  • Mohammed Loey, “Breast and Colon Cancer Classification from Gene Expression Profiles using Data Mining Techniques”, Symmetry, Vol. 12, No. 3, pp. 408-423, 2020.
  • M.S. Singh and P. Choudhary, “Stroke Prediction using Artificial Intelligence”, Proceedings of International Conference on Industrial Automation and Electromechanical Engineering, pp. 158-161, 2017.
  • Ema Utami and Suwanto Raharjo, “Mortality Prediction using Data Mining Classification Techniques in Patients with Hemorrhagic Stroke”, Proceedings of International Conference on Cyber and IT Service Management, pp. 2222-2226, 2020.
  • M. Shahbaz, S. Ali, and A. Umer, “Classification of Alzheimer’s Disease using Machine Learning Techniques”, Proceedings of International Conference on Mining and Multimedia Data, pp. 296-303, 2019.
  • Al Hagery, Mohammed Abdullah, Ebtehal Ibrahim Al Fairouz and Norah Ahmed Al Humaidan. “Improvement of Alzheimer Disease Diagnosis Accuracy using Ensemble Methods”, Indonesian Journal of Electrical Engineering and Informatics, Vol. 8, No. 1, pp. 132-139, 2020.
  • M.S. Nair and U.K. Pandey, “A Study of Cataract Patient Data using C5”, Proceedings of International Conference on ICT Systems and Sustainability, pp. 407-414, 2020.
  • N. Mishra and J.M. Samuel, “Towards Integrating Data Mining with Knowledge-Based System for Diagnosis of Human Eye Diseases: The Case of an African Hospital”, Handbook of Research on Disease Prediction Through Data Analytics and Machine Learning, pp. 470-485, 2021.
  • A.S. Albahri, “Role of Biological Data Mining and Machine Learning Techniques in Detecting and Diagnosing the Novel Coronavirus (Covid-19): A Systematic Review”, Journal of Medical Systems, Vol. 44, pp. 1-11, 2020.
  • L.J. Muhammad, “Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients’ Recovery”, SN Computer Science, Vol. 1, No. 4, pp. 1-7, 2020.
  • Shruti Srivatsan and T. Santhanam, “A Study on Caesarean Section Prediction using ID3 Decision Tree Classifier”, Proceedings of International Virtual Conference on Computational Intelligence and Applications, pp. 310-317, 2020.
  • Tessy Badriyah, “Application of Naive Bayes Method for IUGR (Intra Uterine Growth Restriction) Diagnosis on The Pregnancy”, Proceedings of International Conference on Electrical, Communication, and Computer Engineering, pp. 1-13, 2020.
  • S. Perveen and M. Shanbhaz, “Performance Analysis of Data Mining Classification Techinques to Direct Diabetes”, Procedia Computer Science, Vol. 82, pp. 115-121, 2016.
  • H. Das, B. Naik and H.S. Behera, “Classification of Diabetes Mellitus Disease (DMD): A Data Mining (DM) Approach”, Proceedings of International Conference on Progress in Computing, Analytics and Networking. Advances in Intelligent Systems and Computing, pp. 1-13, 2018.
  • Tarig Mohamed, “Developing A Predicted Model for Diabetes Type 2 Treatment Plans by using Data Mining”, Journal of Theoretical and Applied Information Technology, Vol. 90, No. 2, pp. 181-192, 2016.
  • K. Shweta, Aishwarya Raj and Girija Attigeri, “Comparative Analysis of Prediction Algorithms for Diabetes”, Proceedings of International Conference on Advances in Computer Communication and Computational Sciences, pp. 331-337, 2019.
  • T. Yang, L. Zhang, L. Yi and H. Feng, “Ensemble Learning Models Based on Noninvasive Features for Type 2 Diabetes Screening: Model Development and Validation”, JMIR Medical Informatics, Vol. 8, No. 6, pp. 1-13, 2020.
  • Shruti Srivatsan and T. Santhanam, “A Comparison of Feature Selection Techniques and Ensemble Classifiers for Early Diabetes Prediction”, Proceedings of International Conference on AI, Robotics and Automation, pp. 132-137, 2020.
  • B. Pranto, S.M. Mehnaz, E.B. Mahid and I.M. Sadman, “Evaluating Machine Learning Methods for Predicting Diabetes among Female Patients in Bangladesh”, Information, Vol. 11, pp. 374-386, 2020.
  • Z. Xu and Z. Wang, “A Risk Prediction Model for Type 2 Diabetes Based on Weighted Feature Selection of Random Forest and XGBoost Ensemble Classifier”, Proceedings of International Conference on Advanced Computational Intelligence, pp. 278-283, 2019.
  • F. Faisal, S. Asaduzzaman and H. Minhaz, “Predicting Diabetes Mellitus and Analysing Risk-Factors Correlation”, EAI Endorsed Transactions on Pervasive Health and Technology, Vol. 8, No. 5, pp. 1-7, 2019.
  • A.A. Fareeha, Qurat-Ul-Ain and Y.E. Muhammad, “Comparative Analysis on Diagnosis of Diabetes Mellitus using Different Approaches - A Survey”, Informatics in Medicine, Vol. 21, pp. 1-21, 2020.
  • S. Rahman and H.Y. Bushra, “Likelihood Prediction of Diabetes at Early Stage Using Data Mining Techniques”, Proceedings of International Conference on Computer Vision and Machine Intelligence in Medical Image Analysis, pp. 551-559, 2020.
  • M.M. Faniqul, “Early-Stage Diabetes Risk Prediction Dataset”, Available at https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset, Accessed at 2019.
  • D.F. Williamson, Robert A. Parker and Juliette S. Kendrick, “The Box Plot: A Simple Visual Method to Interpret Data”, Annals of Internal Medicine, Vol. 110, No. 11, pp. 916-921, 1989.
  • Mert Akyol, “Clustering Hotels and Analyzing the Importance of Their Features by Machine Learning Techniques”, Bilgisayar Bilimleri Ve Teknolojileri Dergisi, Vol. 1, No. 1, pp. 16-23, 2016.
  • S. Visalakshi and V. Radha, “A Literature Review of Feature Selection Techniques and Applications: Review of Feature Selection in Data Mining”, Proceedings of IEEE International Conference on Computational Intelligence and Computing Research, pp. 1-6, 2014.
  • Sayank Paul, “Beginner's Guide to Feature Selection in Python”, Available at https://www.datacamp.com/community/tutorials/feature-selection-python, Accessed at 2021.
  • Dheeb Albashish, “Binary Biogeography-Based Optimization based SVM-RFE for Feature Selection”, Applied Soft Computing, Vol. 101, pp. 1-18, 2021.
  • Jason Brownlee, “Recursive Feature Elimination (RFE) for Feature Selection in Python”, Available at https://machinelearningmastery.com/rfe-feature-selection-in-python/, Accessed at 2020.
  • Dario Radecic, “Feature Selection in Python - Recursive Feature Elimination”, Available at https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15, Accessed at 2020.
  • R.E. Schapire, “The Boosting Approach to Machine Learning: An Overview”, Nonlinear Estimation and Classification, Vol. 12, No. 2, pp. 149-171, 2003.
  • J. Yoon, “Forecasting of Real GDP Growth using Machine Learning Models: Gradient Boosting and Random Forest Approach”, Computational Economics, Vol. 57, No. 1, pp 247-265, 2021.
  • Mahmoud Abbasi, and Oystein Haugen, “Boosting Algorithms for Network Intrusion Detection: A Comparative Evaluation of Real AdaBoost, Gentle AdaBoost and Modest AdaBoost”, Engineering Applications of Artificial Intelligence, Vol. 94, pp. 1-16, 2020.
  • R.E. Schapire, “The Strength of Weak Learnability”, Machine Learning, Vol. 5, No. 2, pp. 197-227, 1990.
  • Jason Brownlee, “Boosting and AdaBoost for Machine Learning”, Available at https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/, Accessed at 2020.
  • Zulaikha Lateef, “A Comprehensive Guide to Boosting Machine Learning Algorithms”, Available at https://www.edureka.co/blog/boosting-machine-learning/, Accessed at 2020.
  • Tianqi Chen and Guestrin, Carlos, “XGBoost: A Scalable Tree Boosting System”, Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785-794, 2016.
  • Wei Dong, “XGBoost Algorithm-based Prediction of Concrete Electrical Resistivity for Structural Health Monitoring”, Automation in Construction, Vol. 11, No. 1, pp. 1-14, 2020.
  • A.B. Parsa, A. Movahedi and H. Taghipour, “Toward Safer Highways, Application of Xgboost and Shap for Real-Time Accident Detection and Feature Analysis”, Accident Analysis and Prevention, Vol. 136, pp. 1-18, 2020.
  • Tianqi Chen, “Xgboost: Extreme Gradient Boosting”, Available at https://cran.r-project.org/web/packages/xgboost/vignettes/xgboost.pdf, Accessed at 2021.
  • Aarshay Jain, “Complete Guide to Parameter Tuning in XGBoost with codes in Python”, Available at https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/, Accessed at 2021.
  • Jason Brownlee, “A Gentle Introduction to Xgboost for applied machine learning”, Available at https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/, Accessed at 2021.
  • Guolin Ke, Qi Meng, Thomas Finley and Taifeng Wan, “Light GBM: A Highly Efficient Gradient Boosting Decision Tree”, Advances in Neural Information Processing Systems, Vol. 23, No. 1, pp. 1-16, 2017.
  • B. Shaker, “Light BBB: Computational Prediction Model of Blood-Brain-Barrier Penetration based on Light GBM”, Bioinformatics, Vol. 37, No. 8, pp. 1135-1139, 2020.
  • Mingxi Liu and Zeqian Sima, “A Novel Cryptocurrency Price Trend Forecasting Model based on Light GBM”, Finance Research Letters, Vol. 32, pp. 1-19, 2020.
  • Pranjal Khandelwal, “Which Algorithm takes the Crown: Light GBM vs XGBOOST?”, Available at https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/, Accessed at 2021.
  • Pushkar Mandot, “What is LightGBM, How to Implement it? How to Fine Tune the Parameters?”, Available at https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc, Accessed at 2021.
  • Sarang Narkhede, “Understanding Confusion Matrix”, Available at https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62, Accessed at 2021.
  • Yasen Jiao and Pufeng Du, “Performance Measures in Evaluating Machine Learning based Bioinformatics Predictors for Classifications”, Quantitative Biology, Vol. 4, No. 4, pp. 320-330, 2016.
  • D. Erika, “Accuracy, Recall and Precision”, Available at https://medium.com/@erika.dauria/accuracy-recall-precision-80a5b6cbd28d, Accessed at 2021.
  • Jason Brownlee, “A Gentle Introduction to the Fbeta-Measure for Machine Learning”, Available at https://machinelearningmastery.com/fbeta-measure-for-machine-learning/#:~:text=A%20default%20beta%20value%20is,the%20calculation%20of%20the%20score, Accessed at 2021.
  • Y. Pang, “Multi-Scale Interactive Network for Salient Object Detection”, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1-13, 2020.
  • Bharathi Raja and Vigneshwaran Muralidaran, “Findings of the shared task on Hope Speech Detection for Equality, Diversity, and Inclusion”, Proceedings of 1st Workshop on Language Technology for Equality, Diversity and Inclusion, pp. 1-12, 2021.
  • D. Devarriya, “Unbalanced Breast Cancer Data Classification using Novel Fitness Functions in Genetic Programming”, Expert Systems with Applications, Vol. 140, pp. 1-15, 2020.
  • Parul Pandey, “Simplifying the ROC and AUC Metrics”, Available at https://towardsdatascience.com/understanding-the-roc-and-auc-curves-a05b68550b69, Accessed at 2021.
  • Mary L. McHugh, “Interrater Reliability: The Kappa Statistic”, Biochemia Medica, Vol. 22, No. 3, pp. 276-282, 2012.
  • P.K. Chan and S.J. Stolfo, “On the Accuracy of Meta-learning for Scalable Data Mining”, Journal of Intelligent Information Systems, Vol. 8, pp. 5-28, 1997.
  • Lakshmi Devasena, “Effectiveness Analysis of Zero R, RIDOR and Part Classifiers for Credit Risk Appraisal”, International Journal of Advances in Computer Science and Technology, Vol. 3, No. 2, pp.6-11, 2014..

Abstract Views: 231

PDF Views: 1




  • Early Onset Detection of Diabetes Using Feature Selection and Boosting Techniques

Abstract Views: 231  |  PDF Views: 1

Authors

Shruti Srivatsan
Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, India
T. Santhanam
Department of MCA, DG Vaishnav College, India

Abstract


Diabetes is one of the most common diseases present in human beings. It is well known that diabetes is a metabolic disease with no permanent cure but on early detection longevity can be increased. This research work focuses on predicting the early onset of diabetes. The diabetic dataset from UCI Machine Learning Repository is used. The necessary preprocessing techniques have been carried out to make the data more robust and suiTable.for further processing. This research work proposes two feature selection and ensemble boosting techniques resulting in four combinations (models) to predict the presence of diabetes in persons. Also, a novelty is introduced in further reducing the number of features selected by the feature selection techniques. The reduction in the number of features will reduce the memory and time complexity of the model. Among the models proposed, Light Gradient Boosting (LightGBM) with Recursive Feature Elimination (RFE) as feature selector has produced better performance. Further, LightGBM with least features gave satisfactory results.

Keywords


Data Mining, Boosting, Medical Mining, Diabetes, Feature Selection.

References