Open Access Open Access  Restricted Access Subscription Access

Development of an Efficient Hierarchical Clustering Analysis using an Agglomerative Clustering Algorithm


Affiliations
1 Department of Computer Science, Lahore College for Women University, Lahore 54000, Pakistan
2 Department of Information Technology, Government College University Faisalabad 38000, Pakistan
3 Department of Computer Science, National Textile University, Faisalabad 37610, Pakistan
 

Clustering algorithms are used to generate clusters of elements having similar characteristics. Among the different groups of clustering algorithms, agglomerative algorithm is widely used in the document clustering domain. This study aimed to examine the effectiveness of agglomerative clustering algorithm in document clustering by enhancing its efficiency and evaluating it through implementation. The resulting values, precision = 0.8571, recall = 0.8571 and F-measure = 0.857076 indicate the highest level of accuracy and efficiency compared to existing algorithm.

Keywords

Cosine Similarity Measure, Document Clustering, F-Measure, Hierarchical Agglomerative Clustering, Preprocessing, TF-IDF.
User
Notifications
Font Size

  • Mishra, R. K., Saini, K. and Bagri, S., Text Document Clustering on the basis of Inter passage approach by using K-means. In International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 2015, pp. 110–113.
  • Sathiyakumari, K., Manimekalai, G., Preamsudha, V. and Scholar, M. P., A survey on various approaches in document clustering . Int. J. Comput. Technol. Appl., 2011, 2(5), 1534–1539.
  • Sunanda, P. and Vineela, A., An agglomerative hierarchical clustering for hybrid recommender systems. In International Conference on Power, Control, Communication and Computational Technologies for Sustainable Growth (PCCCTSG), Karnool, India, 11–12 December 2015, pp. 283–288.
  • Halkidi, M., Batistakis, Y. and Vazirgiannis, M., Clustering algorithms and validity measures. In Proceedings of the Thirteenth International Conference on Scientific and Statistical Database Management (SSDBM), Fairfax, USA, 18–20 July 2001, pp. 3–22.
  • Bhagat, A., Kshirsagar, N., Khodke, P., Dongre, K. and Ali, S., Penalty parameter selection for hierarchical data stream clustering. Proc. Comput. Sci., 2016, 79, 24–31.
  • Rafi, M., Maujood, M., Fazal, M. M. and Ali, S. M., A comparison of two suffix tree-based document clustering algorithms. In International Conference on Information and Emerging Technologies, Karachi, Pakistan, 14–16 June 2010.
  • Sun, H. S. H., Liu, Z. L. Z. and Kong, L. K. L., A document clustering method based on hierarchical algorithm with model clustering. In Proceedings of the 22nd International Conference on Advanced Information Networking and Applications, Okinawa, Japan, 25–28 March 2008, pp. 1229–1233.
  • Lu, Y. and Wan, Y., PHA: a fast potential-based hierarchical agglomerative clustering method. Pattern Recognit., 2013, 46(5), 1227–1239.
  • Liu, F., Wei, Y., Ren, M., Hou, X. and Liu, Y., An agglomerative hierarchical clustering algorithm based on global distance measurement. In Seventeenth International Conference on Information Technology in Medicine and Education (ITME), Huangshan, China, 13–15 November 2015, pp. 363–367.
  • García-Lapresta, J. L. and Pérez-Román, D., Consensus-based hierarchical agglomerative clustering in the context of weak orders. In IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), Edmonton, Canada, 24–28 June 2013, pp. 1010– 1015.
  • Zamir, O. and Etzioni, O., Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998, pp. 46–54.
  • Zhang, G. Z. G., Liu, Y. L. Y., Tan, S. T. S. and Cheng, X. C. X., A novel method for hierarchical clustering of search results. In International Conference on Web Intelligence and Intelligent Agent Technology, Fremont, USA, 2–5 November 2007, pp. 181– 184.
  • Zhou, S., Xu, Z. and Liu, F., Method for determining the optimal number of clusters based on agglomerative. IEEE Trans. Neur. Net. Learn. Syst., 2016, 28(12), 3007–3017.
  • Al-Anazi, S., Almahmoud, H. and Al-Turaiki, I., Finding similar documents using different clustering techniques. Proc. Comput. Sci., 2016, 82, 28–34.
  • Samat, N., Murad, M. A., Abdullah, M. T. and Atan, R., Malay documents clustering algorithm based on singular value decomposition. J. Theor. Appl. Infor. Technol., 2005, 8(2), 180–186.
  • Rajavat, A. and Gupta, M., Comparison of algorithms for document clustering. In International Conference on Computational Intelligence and Communication Networks (CICN), Bhopal, India, 14–16 November 2014, pp. 542–546.
  • Bsoul, Q., Salim, J. and Zakaria, L. Q., An intelligent document clustering approach to detect crime patterns. Proc. Technol., 2013, 11, 1181–1187.
  • Aggarwal, C. C. and Zhai, C. (eds), A survey of text clustering algorithms. In Mining Text Data, Springer, Boston, USA, 2012, pp. 77–128.
  • Yim, O. and Ramdeen, K. T., Hierarchical cluster analysis: comparison of three linkage measures and application to psychological data. Quant. Meth. Psychol., 2015, 11(1), 8–21.
  • Mattas, N., Kalra, P. and Mehrotra, D., Agglomerative hierarchical clustering technique for partitioning patent dataset. In Fourth International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), Noida, India, 2–4 September 2015, pp. 2–5.
  • Mishra, R. B., Modi, N. K. and Shah, R. R., Performance analysis of single and complete link during agglomerative clustering of question papers by tagging the questions and trend analysis using single link. In International Conference on Advanced Communications, Control and Computing Technologies (ICACCCT), Ramanathapuram, India, 8–10 May 2014, vol. 978, pp. 616–618.
  • Mohbey, K. K., An experimental survey on single linkage clustering. Int. J. Comput. Appl., 2013, 76(17), 6–10.
  • Takumi, S. and Miyamoto, S., Comparing different methods of agglomerative hierarchical clustering with pairwise constraints. In Joint Sixth International Conference on Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), Kobe, Japan, 20–24 November 2012, pp. 1545–1550.
  • Wu, J. W., Tseng, J. C. and Tsai, W. N., An efficient linear text segmentation algorithm using hierarchical agglomerative clustering. In Seventh International Conference on Computational Intelligence and Security (CIS), Hainan, China, 3–4 December 2011, pp. 1081–1085.
  • Wong, K. C., A short survey on data clustering algorithms. In Second International Conference on Soft Computing and Machine Intelligence, 2015, pp. 64–68.
  • Zamir, O. and Etzioni, O., Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998, pp. 46–54.
  • Gupta, M. and Rajavat, A., Comparison of algorithms for document clustering. In IEEE International Conference on Computational Intelligence and Communication Networks, Bhopal, India, 14–16 November 2014, pp. 541–545.
  • Sharma, D. and Jain, S., Evaluation of stemming and stop word techniques on text classification problem. Int. J. Sci. Res. Comput. Sci. Eng., 2015, 3(2), 1–4.
  • Xue, X. and Zhou, Z., Distributional features for text categorization. IEEE Trans. Knowl. Data Eng., 2009, 21(3), 428–442.
  • Raulji, J. and Saini, J., Stop-word removal algorithm and its implementation for Sanskrit language. Int. J. Comput. Appl., 2016, 150(2), 15–17.
  • Porter, M. F., An algorithm for suffix stripping. Program. Electron. Lib. Inf. Syst., 1980, 14(3), 130–137.
  • Agnihotri, D., Verma, K. and Tripathi, P., Pattern and cluster mining on text data. In Fourth International Conference on Communication Systems and Network Technologies Bhopal, India, 7–9 April 2014, pp. 428–432.
  • Subhashini, R. and Kumar, V. J. S., Evaluating the performance of similarity measures used in document clustering and information retrieval. In First International Conference on Integrated Intelligent Computing (ICIIC), Bangalore, India, 5–7 August 2010, pp. 27–31.
  • Liao, H. and Xu, Z., Approaches to manage hesitant fuzzy linguistic information based on the cosine distance and similarity measures for HFLTSs and their application in qualitative decision making. Expert Syst. Appl., 2015, 42(12), 5328–5336.
  • Zhao, Y. and Karypis, G., Evaluation of hierarchical clustering algorithms for document dataset. In Proceedings of the 11th International Conference on Information and Knowledge Management, Virginia, USA, 4–9 November 2002, pp. 515–524.
  • Nurazzah, A. R. et al., Malay document clustering using complete linkage clustering technique with cosine coefficient. In IEEE International Conference on Open Systems (ICOS), Bandar Melaka, Malaysia, 24–26 August 2015, pp. 103–107.
  • Ravindran, R. M. and Thanamani, A. S., K-means document clustering using vector space model. Int. J. Data Min., 2015, 5(2), 10–14.
  • Wang, Y. et al., Improved text clustering algorithm and application in microblogging public opinion analysis. In Proceedings of the Fourth World Congress on Software Engineering (WCSE), Hong Kong, China, 3–4 December 2013, pp. 27–31.
  • Chim, H., Deng, X. and Member, S., Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng., 2008, 20(9), 1217–1229.
  • Sun, H., Liu, Z. and Kong, L., A document clustering method based on hierarchical algorithm with model clustering. In 22nd International Conference on Advanced Information Networking and Applications, Okinawa, Japan, 25–28 March 2008, pp. 1229– 1233.

Abstract Views: 253

PDF Views: 78




  • Development of an Efficient Hierarchical Clustering Analysis using an Agglomerative Clustering Algorithm

Abstract Views: 253  |  PDF Views: 78

Authors

Arshia Naeem
Department of Computer Science, Lahore College for Women University, Lahore 54000, Pakistan
Mariam Rehman
Department of Information Technology, Government College University Faisalabad 38000, Pakistan
Maria Anjum
Department of Computer Science, Lahore College for Women University, Lahore 54000, Pakistan
Muhammad Asif
Department of Computer Science, National Textile University, Faisalabad 37610, Pakistan

Abstract


Clustering algorithms are used to generate clusters of elements having similar characteristics. Among the different groups of clustering algorithms, agglomerative algorithm is widely used in the document clustering domain. This study aimed to examine the effectiveness of agglomerative clustering algorithm in document clustering by enhancing its efficiency and evaluating it through implementation. The resulting values, precision = 0.8571, recall = 0.8571 and F-measure = 0.857076 indicate the highest level of accuracy and efficiency compared to existing algorithm.

Keywords


Cosine Similarity Measure, Document Clustering, F-Measure, Hierarchical Agglomerative Clustering, Preprocessing, TF-IDF.

References





DOI: https://doi.org/10.18520/cs%2Fv117%2Fi6%2F1045-1053