Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Pearson Correlation Coefficient K-Nearest Neighbor Outlier Classification on Real-Time Datasets


Affiliations
1 Department of Computer Science, Nandha Arts and Science College, India
     

   Subscribe/Renew Journal


Detection and classification of data that do not meet the expected behavior (outliers) plays the major role in wide variety of applications such as military surveillance, intrusion detection in cyber security, fraud detection in online transactions. Nowadays, an accurate detection of outliers with high dimension is the major issue. The trade-off between the high-accuracy and low computational time is the major requirement in outlier prediction and classification. The presence of large size diverse features need the reduction mechanism prior to classification approach. To achieve this, the Distance-based Outlier Classification (DOC) is proposed in this paper. The proposed work utilizes the Pearson Correlation Coefficient (PCC) to measure the correlation between the data instances. The minimum instance learning through PCC estimation reduces the dimensionality. The proposed work is split up into two phases namely training and testing. During the training process, the labeling of most frequent samples isolates them from the infrequent reduce the data size effectively. The testing phase employs the k-Nearest Neighborhood (κ-NN) scheme to classify the frequent samples effectively. The dimensionality and the k-value are inversely proportional to each other. In proposed work, the selection of large value of k offers the significant reduction in dimensionality. The combination of PCC-based instance learning and the high value of k reduces the dimensionality and noise respectively. The comparative analysis between the proposed PCC-κ-NN with the conventional algorithms such as Decision Tree, Naïve Bayes, Instance-Based K-means (IBK), Triangular Boundary-based Classification (TBC) regarding sensitivity, specificity, accuracy, precision, and recall proves its effectiveness in OC. Besides, the experimental validation of proposed PCC-κ-NN with the state-of art methods regarding the execution time assures trade-off between the low-time consumption and high-accuracy.

Keywords

Data Mining, Distance-Based Instance Learning, Outlier Detection, Outlier Classification, Pearson Correlation Coefficient, κ-Nearest Neighbor.
Subscription Login to verify subscription
User
Notifications
Font Size

  • M.A.G. Sagade and R. Thakur, “Study of Outlier Detection Techniques for Low and High Dimensional Data”, International Journal of Scientific Engineering and Technology, Vol. 3, No. 9, pp. 1-5, 2014.
  • T. Al-Khateeb, M. M. Masud, L. Khan, C. Aggarwal, H. Jiawei and B. Thuraisingham, ‘Stream Classification with Recurring and Novel Class Detection using Class-Based Ensemble”, Proceedings of 12th International Conference on Data Mining, pp. 31-40, 2012.
  • A. Albanese, S.K. Pal and A. Petrosino, “Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection”, IEEE Transactions on Knowledge and Data Engineering, Vol. 26, pp. 194-207, 2014.
  • Y. Qian, Q. Wang, H. Cheng, J. Liang and C. Dang, “Fuzzy-Rough Feature Selection Accelerator”, Fuzzy Sets and Systems, Vol. 258, pp. 61-78, 2015.
  • F. Angiulli, “Prototype-Based Domain Description for One Class Classification”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, pp. 1131-1144, 2012.
  • N. Pham and R. Pagh, “A Near-Linear Time Approximation Algorithm for Angle-based Outlier Detection in High-Dimensional Data”, Proceedings of 18th ACM International Conference on Knowledge Discovery and Data Mining, pp. 877-885, 2012.
  • L. Galluccio, O. Michel, P. Comon, M. Kliger and A.O. Hero, “Clustering with a New Distance Measure Based on a Dual-Rooted Tree”, Information Sciences, Vol. 251, pp. 96-113, 2013.
  • B. Krawczyk, M. Wozniak and B. Cyganek, “Clustering-Based Ensembles for One-Class Classification”, Information Sciences, Vol. 264, pp. 182-195, 2014.
  • H. Kriegel, P. Kroger, E. Schubert and A. Zimek, “Outlier Detection in Arbitrarily Oriented Subspaces”, Proceedings of 12th IEEE International Conference on Data Mining, pp. 379-388, 2012.
  • O.M.B. Saeed, S. Sankaran, A.R.M. Shariff, H.Z.M. Shafri, R. Ehsani and M.S. Alfatni, “Classification of Oil Palm Fresh Fruit Bunches based on their Maturity using Portable Four-Band Sensor System”, Computers and Electronics in Agriculture, Vol. 82, pp. 55-60, 2012.
  • A. Zimek, M. Gaudet, R.J. Campello and J. Sander, “Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles”, Proceedings of 19th ACM International Conference on Knowledge Discovery and Data Mining, pp. 428-436, 2013.
  • S. Krishnan and H.G. Kerkhoff, “Exploiting Multiple Mahalanobis Distance Metrics to Screen Outliers From Analog Product Manufacturing Test Responses”, IEEE Design and Test, Vol. 30, No. 3, pp. 18-24, 2013.
  • R. Todeschini, D. Ballabio, V. Consonni, F. Sahigara and P. Filzmoser, “Locally Centred Mahalanobis Distance: A New Distance Measure with Salient Features Towards Outlier Detection”, Analytica Chimica Acta, Vol. 787, pp. 1-9, 2013.
  • A. Akila and E. Chandra, “Slope Finder-A Distance Measure for DTW based Isolated Word Speech Recognition”, International Journal of Engineering and Computer Science, Vol. 2, No. 12, pp. 3411-3417, 2013.
  • Z. Chao, S. Huanfeng, Z. Mingliang, Z. Liangpei and W. Penghai, “Reconstructing MODIS LST Based on Multitemporal Classification and Robust Regression”, IEEE Geoscience and Remote Sensing Letters, Vol. 12, No. 3, pp. 512-516, 2015.
  • L. Xu, S.-M. Yan, C.B. Cai and X.P. Yu, “One-Class Partial Least Squares (OCPLS) Classifier”, Chemometrics and Intelligent Laboratory Systems, Vol. 126, pp. 1-5, 2013.
  • K. Zawadzki, C. Feenders, M.P. Viana, M. Kaiser and L.D.F. Costa, “Morphological Homogeneity of Neurons: Searching for Outlier Neuronal Cells”, Neuroinformatics, Vol. 10, No. 2, pp. 379-389, 2012.
  • M. Hund, M. Behrisch, I. Farber, M. Sedlmair, T. Schreck and T. Seidl, “Subspace Nearest Neighbor Search-Problem Statement, Approaches, and Discussion”, Proceedings of International Conference on Similarity Search and Applications, pp. 307-313, 2015.
  • H. Liu and S. Zhang, “Noisy Data Elimination using Mutual K-Nearest Neighbor for Classification Mining”, Journal of Systems and Software, Vol. 85, No. 2, pp. 1067-1074, 2012.
  • Z.-G. Liu, Q. Pan and J. Dezert, “A New Belief-based K-Nearest Neighbor Classification Method”, Pattern Recognition, Vol. 46, No. 3, pp. 834-844, 2013.
  • N. Tomasev and D. Mladenic, “Hubness-Aware Shared Neighbor Distances for High-Dimensional K-Nearest Neighbor Classification”, Proceedings of International Conference on Hybrid Artificial Intelligence Systems, pp. 116-124, 2014.
  • Z. Yu, Y. Deng and A.K. Jain, “Keystroke Dynamics for User Authentication”, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 117-123, 2012.
  • A.B.E.D. Ahmed and I.S. Elaraby, “Data Mining: A Prediction for Students Performance using Classification Method”, World Journal of Computer Application and Technology, Vol. 2, No. 1, pp. 43-47, 2014.
  • H. M. Harb and M.A. Moustafa, “Selecting Optimal Subset of Features for Student Performance Model”, International Journal of Computer Science Issues, Vol. 9, No. 5, pp. 253-262, 2012.
  • P. Lam, L. Wang, H.Y. Ngan, N.H. Yung and A.G. Yeh, “Outlier Detection In Large-scale Traffic Data By Naive Bayes Method and Gaussian Mixture Model Method”, Proceedings of International Symposium on Electronic Imaging, Intelligent Robotics and Electronic Imaging, Intelligent Robotics and Industrial Applications using Computer Vision, pp. 111-118, 2015.
  • D. Rajakumari and S. Pannirselvam, “A Novel Triangular Boundary Based Classification Approach for Effective Prediction of Outliers”, International Journal of Applied Engineering Research, Vol. 6, No. 1, pp. 322-328, 2015.
  • Veerabhadrappa and L. Rangarajan, “Multi-Level Dimensionality Reduction Methods using Feature Selection and Feature Extraction”, International Journal of Computer Applications, Vol. 4, No. 2, pp. 33-38, 2010.
  • T. Sridevi and A. Murugan, “A Novel Feature Selection Method for Effective Breast Cancer Diagnosis and Prognosis”, International Journal of Computer Applications, Vol. 88, No. 11, pp. 1-12, 2014.

Abstract Views: 200

PDF Views: 0




  • Pearson Correlation Coefficient K-Nearest Neighbor Outlier Classification on Real-Time Datasets

Abstract Views: 200  |  PDF Views: 0

Authors

D. Rajakumari
Department of Computer Science, Nandha Arts and Science College, India

Abstract


Detection and classification of data that do not meet the expected behavior (outliers) plays the major role in wide variety of applications such as military surveillance, intrusion detection in cyber security, fraud detection in online transactions. Nowadays, an accurate detection of outliers with high dimension is the major issue. The trade-off between the high-accuracy and low computational time is the major requirement in outlier prediction and classification. The presence of large size diverse features need the reduction mechanism prior to classification approach. To achieve this, the Distance-based Outlier Classification (DOC) is proposed in this paper. The proposed work utilizes the Pearson Correlation Coefficient (PCC) to measure the correlation between the data instances. The minimum instance learning through PCC estimation reduces the dimensionality. The proposed work is split up into two phases namely training and testing. During the training process, the labeling of most frequent samples isolates them from the infrequent reduce the data size effectively. The testing phase employs the k-Nearest Neighborhood (κ-NN) scheme to classify the frequent samples effectively. The dimensionality and the k-value are inversely proportional to each other. In proposed work, the selection of large value of k offers the significant reduction in dimensionality. The combination of PCC-based instance learning and the high value of k reduces the dimensionality and noise respectively. The comparative analysis between the proposed PCC-κ-NN with the conventional algorithms such as Decision Tree, Naïve Bayes, Instance-Based K-means (IBK), Triangular Boundary-based Classification (TBC) regarding sensitivity, specificity, accuracy, precision, and recall proves its effectiveness in OC. Besides, the experimental validation of proposed PCC-κ-NN with the state-of art methods regarding the execution time assures trade-off between the low-time consumption and high-accuracy.

Keywords


Data Mining, Distance-Based Instance Learning, Outlier Detection, Outlier Classification, Pearson Correlation Coefficient, κ-Nearest Neighbor.

References