Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Ensemble Approaches for Class Imbalance Problem:A Review


Affiliations
1 Department of Information Technology, USICT, GGSIP University, Dwarka, Delhi, India
     

   Subscribe/Renew Journal


In data mining, performing classification for skewed data distribution is a challenging problem. Traditional Classification Techniques (TCT) work efficiently in classifying data having symmetric distribution, as their internal design favors the balanced datasets. The Class Imbalance Problem (CIP) take place when number of instances of one class outnumbers instances of other classes. Some factors that contribute towards this imbalancing are noisy data, borderline samples, degree of class overlapping, small disjuncts, etc. In machine learning, ensembles are basically built to improve the performance and correctness of single classifier by training multiple classifiers to form the results that output the correct single class label. In this paper, our aim is to review ensemble learning methods having two-class problem. We propose different levels for ensemble learning methods that are at data level, at algorithm level and according to the base classifier.

Keywords

Bagging, Boosting, Classification, Class Imbalance Problem, Oversampling, Skewed Data Distribution, Undersampling.
User
Subscription Login to verify subscription
Notifications
Font Size

  • A. Gosain, and S. Sardana, “Handling class imbalance problem using oversampling techniques: A review,” 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 79-85, 2017.
  • J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques, 3rd ed., Morgan Kaufmann Publishers, 2011.
  • P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Pearson Education Pvt. Ltd., 2013.
  • Y. Sun, M. S. Kamel, A. K. Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognition, vol. 40, no. 12, pp. 3358-3378, 2017.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
  • C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, and A. Napolitano, “RUSBoost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems & Humans, vol. 40, no. 1, pp. 185-197, 2010.
  • Y. Freund, and R. E. Schapire, “Experiments with a new boosting algorithms,” in Proceedings of the 13th International Conference on Machine Learning (ICML’96), pp. 148-156, 1996.
  • M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches,” IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, vol. 42, no. 4, pp. 463-484, 2012.
  • J. Stefanowski, and S. Wilk, “Selective pre-processing of imbalanced data for improving classification performance,” in International Conference on Data Warehousing and Knowledge Discovery, Springer, Berlin, Heidelberg, pp. 283-292, 2008.
  • A. Gosain, A. Saha, and D. Singh, “Analysis of sampling based classification techniques to overcome class imbalancing,” in 3RD International Conference on Computing for Sustainable Global Development (INDIACom), pp. 2637-2643, 2016.
  • S. Fattahi, Z. Othman, and Z. A. Othman, “New approach with ensemble method to address class imbalance problem,” Journal of Theoretical & Applied Information Technology, vol. 72, no. 1, pp. 23-33, 2015.
  • N. V. Chawla, A. Lazarevic, L. O. Hall, and K. Bowyer, “SMOTEBoost: Improving prediction of the minority class in boosting,” in 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat Dubrovnik, Croatia, pp. 107-119, 2003.
  • X. Y. Liu, J. Wu, and Z. H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, vol. 39, no. 2, pp. 539-550, 2009.
  • R. Barandela, R. M. Valdovinos, and J. S. Sánchez, “New applications of ensembles of classifiers,” Pattern Analysis & Applications, vol. 6, no. 3, pp. 245-256, 2003.
  • L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
  • S. Wang, and X. Yao, “Diversity analysis on imbalanced data sets by using ensemble models,” IEEE Symposium on Computational Intelligence and Data Mining, pp. 324-331, 2009.
  • L. Breiman, “Pasting small votes for classification in large databases and on-line,” Machine Learning, vol. 36, no. 1-2, pp. 85-103, 1999.
  • S. Hu, Y. Liang, L. Ma, and Y. He, “MSMOTE: Improving classification performance when training data is imbalanced,” Second International Workshop on Computer Science and Engineering, vol. 2, pp. 13-17, 2009.
  • K. M. Ting, “An instance-weighting method to induce cost sensitive trees,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, pp. 659-665, 2002.
  • Y. Yang, and G. Ma, “Ensemble based active learning for class imbalance problem,” Journal of Biomedical Science and Engineering, vol. 3, no. 10, pp. 1021-1028, 2010.
  • J. J. Rodriguez, L. I. Kuncheva, and C. J. Alonso, “Rotation forest: A new classifier ensemble method,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1619-1630, 2006.
  • W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, “AdaCost: Misclassification cost-sensitive boosting,” in Proceedings of the Sixteenth International Conference on Machine Learning (ICML’99), pp. 97-105, 1999.
  • H. Guo, and H. L. Viktor, “Learning from imbalanced data sets with boosting and data generation: The databoost-im approach,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 30-39, 2004.
  • B. X. Wang, and N. Japkowicz, “Boosting support vector machines for imbalanced data-sets,” Knowledge Information Systems, vol. 25, no. 1, pp. 1-20, 2010.
  • M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, “EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling,” Pattern Recognition, vol. 46, no. 12, pp. 3460-3471, 2013.
  • P. Domingos, “Metacost: A general method for making classifiers cost-sensitive,” in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155-164, 1999.

Abstract Views: 511

PDF Views: 0




  • Ensemble Approaches for Class Imbalance Problem:A Review

Abstract Views: 511  |  PDF Views: 0

Authors

Anjana Gosain
Department of Information Technology, USICT, GGSIP University, Dwarka, Delhi, India
Arushi Gupta
Department of Information Technology, USICT, GGSIP University, Dwarka, Delhi, India

Abstract


In data mining, performing classification for skewed data distribution is a challenging problem. Traditional Classification Techniques (TCT) work efficiently in classifying data having symmetric distribution, as their internal design favors the balanced datasets. The Class Imbalance Problem (CIP) take place when number of instances of one class outnumbers instances of other classes. Some factors that contribute towards this imbalancing are noisy data, borderline samples, degree of class overlapping, small disjuncts, etc. In machine learning, ensembles are basically built to improve the performance and correctness of single classifier by training multiple classifiers to form the results that output the correct single class label. In this paper, our aim is to review ensemble learning methods having two-class problem. We propose different levels for ensemble learning methods that are at data level, at algorithm level and according to the base classifier.

Keywords


Bagging, Boosting, Classification, Class Imbalance Problem, Oversampling, Skewed Data Distribution, Undersampling.

References