Open Access Open Access  Restricted Access Subscription Access

An Empirical and Comparatively Research on Under-Sampling & Over-Sampling Defect-Prone Data-Sets Model in Light of Machine Learning


Affiliations
1 School of Control & Computer Engineering, North China Electric Power University, Beijing, China
2 Dept. of Information Communication Engineering, Guilin University of Electronic Technology, Guilin, China
 

The few researchers have put their ideas about class-imbalance during analysis of datasets, two types of class imbalances are present in datasets. First type in which some classes have many models than others and that is called between class imbalance. Second type in which few subsets of one class have less models than other subsets of similar class and that is within class-imbalance. Over-sampling and Under-sampling innovation assume noteworthy jobs in tackling the class-imbalance issue. There are numerous dissimilarities of over-sampling and under-sampling methods which utilized for class imbalanced dataset model. We have used two sampling techniques in our research paper for our imbalanced datasets models. One is over-sampling using SMOTE technique and another one is under-sampling using spread-sub-sample. During experiments, all results are measured in evaluation performance measure. Mostly they all are class imbalanced measurements, in which precision, recall, f-measure, area under curve and 12 different classifiers we have used in our experiments to get the comparatively results of both sampling techniques. The over-all analysis showed that the efficiency of correctly classified in over-sampling techniques is enhanced in few classifiers as compared to under-sampling techniques. The TP-rate and positive accuracy of both techniques, the stacking is worst classifier in these experiments and multi classification and LMT couldn’t increase the TP-rate in under-sampling techniques. The over-all comparative analysis of both techniques as compared with without using sample techniques have increased but over-sampling technique is more valuable to use for solving the class imbalance issue.

Keywords

- Software prediction, Under-sampling, Over-sampling, Sampling, Class imbalance, Defect-Prone.
User
Notifications
Font Size

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning,20(3), 273-297.
  • M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: One sided Selection”, In Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, Tennesse, Morgan Kaufmann,1997, pp. 179-186.
  • N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P.Kegelmeyer, “SMOTE: Synthetic Minority Oversampling Technique”, Journal of Artificial Intelligence Research, 16, 2002, pp. 321-357.
  • H. Han, W.Y. Wang, and B.H. Mao, “BorderlineSMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning”, in Proceedings of the International Conference on Intelligent Computing 2005, Part I, LNCS 3644, 2005, pp. 878–887.
  • G. M. Weiss and F. Provost, “Learning when training data are costly: the effect of class distribution on tree induction”, Journal of Artificial Intelligence Research, 19, 2003, pp. 315-354.
  • H. Han, L. Wang, M. Wen, and W. Y. Wang, “Oversampling Algorithm Based on Preliminary Classification in Imbalanced Data Sets Learning”, Journal of computer allocations (in Chinese), 2006 Vol.26 No.8, pp.1894-1897.
  • Miroslav Kubat, Robert C. Holte, and Stan Matwin. 1998. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 2-3 (1998), 195–215. http: //dblp.unitrier.de/db/journals/ml/ml30.html#KubatHM98
  • Miroslav Kubat and Stan Matwin. 1997. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In In Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann, 179–186. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.4487
  • Jorma Laurikkala. 2001. Improving Identification of Difficult Small Classes by Balancing Class Distribution.. In AIME (Lecture Notes in Computer Science), Silvana Quaglini, Pedro Barahona, and Steen Andreassen (Eds.), Vol. 2101. Springer, 63–66. http://dblp.unitrier.de/db/conf/aime/aime2001.html#Laurikkala01; http://dx.doi.org/10.1007/3-540-48229-6 9; http://www.bibsonomy.org/bibtex/299ad2efa02d1ffb2 9dced2ee0d3a23b4/dblp
  • Guillaume Lemaˆıtre, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/ v18/16-365
  • Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2006.Exploratory Under-Sampling for Class-Imbalance Learning.. In ICDM. IEEE Computer Society, 965– 969. http://dblp.uni-trier.de/db/conf/icdm/icdm2006.html#LiuWZ06
  • David Mease, Aj Wyner, and a Buja. 2007. Boosted classification trees and class probability/quantile estimation. The Journal of Machine Learning Research 8 (2007), 409–439. http://dl.acm.org/citation.cfm?id=1248675
  • Iman Nekooeimehr and Susana K. Lai-Yuen. 2016. Adaptive semiunsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst. Appl. 46 (2016), 405–416. http://dblp. unitrier.de/db/journals/eswa/eswa46.html#NekooeimehrL 16
  • Yuxin Peng. 2015. Adaptive Sampling with Optimal Cost for Class-Imbalance Learning.. In AAAI, Blai Bonet and Sven Koenig (Eds.). AAAI Press, 2921– 2927. http://dblp.uni-trier.de/db/conf/aaai/aaai2015.html#Peng15 Jonathan K. Pritchard, Matthew Stephens, and Peter Donnelly.
  • 2000. Inference of Population Structure Using Multilocus Genotype Data. Genetics 155 (June 2000), 945–959. http://pritch.bsd.uchicago.edu/publications/structure.pdf
  • Muhammad Atif Tahir, Josef Kittler, and Fei Yan. 2012. Inverse random under sampling for class imbalance problem and its application to multi-label classification. Pattern Recognition 45, 10 (2012), 3738–3750. http://dblp.unitrier.de/db/journals/pr/pr45. html#TahirKY12
  • Japkowicz, N. Class Imbalance: Are We Focusing on the Right Issue? in Notes from the ICML Workshop on Learning from Imbalanced Data Sets II. 2003.
  • Chawla, N.V.Data mining for imbalanced datasets: An overview, in Data mining and knowledge discovery handbook 2005, Springer. p. 853-867.
  • Batista, G.E., R.C. Prati, and M.C. Monard, Balancing strategies and class overlapping, in Advances in Intelligent Data Analysis VI2005, Springer. p.24-35.
  • Vaasa, S. Ralescu, A. Issues in mining imbalanced data sets -a review paper in Proceedings of the Midwest Artificial Intelligence and Cognitive Science Conference. 2005. Dayton.
  • Hen, H. and E.A. Garcia, Learning from imbalanced dataKnowledge and Data Engineering, IEEE Transactions on, 2009. 21(9): p. 1263-1284.
  • Fan, W., Stolfo, S. J., Zhang, J., and Chan, P. K., Adacost: misclassification cost-sensitive boosting, in MACHINE LEARNING-INTERNATIONAL WORKSHOP THEN CONFERENCE-, pages 97–105, Cite- seer, 1999.
  • Domingos, P., Metacost: a general method for making classifiers cost-sensitive, in Proceedings of the fifth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 155– 164, ACM, 1999.
  • Kotsiantis, S. et al., GESTS International Transactions on Computer Science and Engineering 30 (2006) 25.
  • He, H. and Garcia, E. A., Knowledge and Data Engineering, IEEE Transactions on 21 (2009) 1263.
  • Bhowan, U., Zhang, M., and Johnston, M., Genetic programming for image classification with unbalanced data, in Proceeding of the 24th International Conference Image and Vision Computing New Zealand, IVCNZ ’09, pages 316– 321, Wellington, 2009, IEEE.
  • Bhowan, U., Johnston, M., and Zhang, M., Differentiating between individual class performance in genetic programming fitness for classification with unbalanced data, in Evolutionary Computation, 2009. CEC’09. IEEE Congress on, pages 2802–2809, IEEE, 2009.
  • K. Veropoulos, C. Campbell, and N. Cristianini, “Controlling the sensitivity of support vector machines”, in Proceedings of the International Joint Conference on AI, 1999, pp. 55–60.
  • K.Z. Huang, H.Q. Yang, I. King, and M.R. Lyu,“Learning Classifiers from Imbalanced Data Based on BiasedMinimax Probability Machine”, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 2004.
  • P. Domingos, “MetaCost: A general method for making classifiers cost-sensitive”, in Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, ACM Press, 1999, pp.155-164.
  • W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, “AdaCost: misclassification cost-sensitive boosting”, inmProceedings of the Sixteenth International Conference on Machine Learning, 1999, pp. 99-105.
  • N. Japkowicz, “Supervised versus unsupervised binary learning by feed forward neural networks”, Machine Learning, 42(1/2), 2001, pp. 97-122.
  • B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a highdimensional distribution”, Neural Computation, 13(7), 2001, pp. 1443-1472.
  • D. Tax, “One-class classification”, Ph.D. dissertation, Delft University of Technology, 2001.
  • L. M. Manevitz and M. Yousef, “One-class SVMs for document classification”, Journal of Machine Learning Research, 2, 2001, pp. 139-154.
  • P. Riddle, R. Segal, and O. Etzioni, “Representation design and brute-force induction in a Boeing manufacturing design”, Applied Artificial Intelligence, 8, 1994, pp. 125-147.Lucia, A.D., Fasano, F., Grieco, C., Tortora, G.: Recovering design rationale from email repositories. In: Proceedings of ICSM 2009 (25th IEEE International Conference on Software Maintenance), IEEE CS Press (2009)
  • Pattison, D., Bird, C., Devanbu, P.: Talk andWork: a Preliminary Report. In: Proceedings of the Fifth International Working Conference on Mining Software Repositories, ACM (2008) 113–116.
  • Maaz Rasheed Malik, Liu Yining, “A Model Vector Machine Tree Classification For Software Fault Forecast Model (TSMO/TSVM)” IJANA JOURNAL, VOLUME 12, ISSUE 4, Page No : 4650-4655, DOI :10.35444/IJANA.2021.12407.

Abstract Views: 149

PDF Views: 1




  • An Empirical and Comparatively Research on Under-Sampling & Over-Sampling Defect-Prone Data-Sets Model in Light of Machine Learning

Abstract Views: 149  |  PDF Views: 1

Authors

Salahuddin Shaikh
School of Control & Computer Engineering, North China Electric Power University, Beijing, China
Liu Changan
School of Control & Computer Engineering, North China Electric Power University, Beijing, China
Maaz Rasheed Malik
Dept. of Information Communication Engineering, Guilin University of Electronic Technology, Guilin, China

Abstract


The few researchers have put their ideas about class-imbalance during analysis of datasets, two types of class imbalances are present in datasets. First type in which some classes have many models than others and that is called between class imbalance. Second type in which few subsets of one class have less models than other subsets of similar class and that is within class-imbalance. Over-sampling and Under-sampling innovation assume noteworthy jobs in tackling the class-imbalance issue. There are numerous dissimilarities of over-sampling and under-sampling methods which utilized for class imbalanced dataset model. We have used two sampling techniques in our research paper for our imbalanced datasets models. One is over-sampling using SMOTE technique and another one is under-sampling using spread-sub-sample. During experiments, all results are measured in evaluation performance measure. Mostly they all are class imbalanced measurements, in which precision, recall, f-measure, area under curve and 12 different classifiers we have used in our experiments to get the comparatively results of both sampling techniques. The over-all analysis showed that the efficiency of correctly classified in over-sampling techniques is enhanced in few classifiers as compared to under-sampling techniques. The TP-rate and positive accuracy of both techniques, the stacking is worst classifier in these experiments and multi classification and LMT couldn’t increase the TP-rate in under-sampling techniques. The over-all comparative analysis of both techniques as compared with without using sample techniques have increased but over-sampling technique is more valuable to use for solving the class imbalance issue.

Keywords


- Software prediction, Under-sampling, Over-sampling, Sampling, Class imbalance, Defect-Prone.

References