Open Access Open Access  Restricted Access Subscription Access

MODC: Multi-Objective Distance based Optimal Document Clustering by GA


Affiliations
1 Department of Computer Science and Engineering, MRCE, Hyderabad - 500100, Telangana State, India
2 Department of Computer Science,Sri Venkateswara University,Tirupathi - 517502, Andhra Pradesh, India
3 Annamacharya PG College of Computer Studies, Rajampet - 516126, Andhra Pradesh, India
 

Background/Objective: Unsupervised learning of text documents is an essential and significant process of knowledge discovery and data mining. The concept, context and semantic relevancy are the important and exclusive factors in text mining, where as in the case of unsupervised learning of record structured data, these factors are not in scope. Methods/Statistical Analysis: The current majority of benchmarking document clustering models is keen and relies on term frequency, and all these models are not considering the concept, context and semantic relations during document clustering. In regard to this, our earlier works introduced a novel document clustering approaches and one of that named as Document Clustering by Conceptual, Contextual and Semantic Relevance (DC3SR). The lessons learned from the empirical study of this contribution motivated us to propose aMulti-Objective Distance based optimal document Clustering (MODC) approach that optimizes resultant clusters using the well-known evolutionary computation technique called Genetic Algorithm.Findings: The significant contribution of this proposal is feature formation by concept, context and semantic relevance and optimizing resultant clusters by genetic algorithm. An unsupervised learning approach to form the initial clusters that estimates similarity between any two documents by concept, context and semantic relevance score and further optimizes by genetic algorithm is proposed. This novel method represents the concept as correlation between arguments and activities in given documents, context as correlation between meta-text of the documents and the semantic relevance is assessed by estimating the similarity between documents through the hyponyms of the arguments. The meta-text of the documents considered for context assessment contains the authors list, keywords list and list of document versioning time schedules. Application/Improvements:The experiments were conducted to assess the significance of the proposed model.The results obtained from experiments concluding that the MODC is performing exceptionally well under divergent document count and evincing the cluster formation accuracy as 97%. The dimensionality reduction by concept, context and semantic relevance is left for future enhancement of the proposed model.

Keywords

Concept Distance,Context Distance, Document Clustering, Meta-text,MODC, Multi Objective Distance Function, Text Mining, Unsupervised Learning.
User

Abstract Views: 138

PDF Views: 0




  • MODC: Multi-Objective Distance based Optimal Document Clustering by GA

Abstract Views: 138  |  PDF Views: 0

Authors

Annaluri Sreenivasa Rao
Department of Computer Science and Engineering, MRCE, Hyderabad - 500100, Telangana State, India
S. Ramakrishna
Department of Computer Science,Sri Venkateswara University,Tirupathi - 517502, Andhra Pradesh, India
P. Chitti Babu
Annamacharya PG College of Computer Studies, Rajampet - 516126, Andhra Pradesh, India

Abstract


Background/Objective: Unsupervised learning of text documents is an essential and significant process of knowledge discovery and data mining. The concept, context and semantic relevancy are the important and exclusive factors in text mining, where as in the case of unsupervised learning of record structured data, these factors are not in scope. Methods/Statistical Analysis: The current majority of benchmarking document clustering models is keen and relies on term frequency, and all these models are not considering the concept, context and semantic relations during document clustering. In regard to this, our earlier works introduced a novel document clustering approaches and one of that named as Document Clustering by Conceptual, Contextual and Semantic Relevance (DC3SR). The lessons learned from the empirical study of this contribution motivated us to propose aMulti-Objective Distance based optimal document Clustering (MODC) approach that optimizes resultant clusters using the well-known evolutionary computation technique called Genetic Algorithm.Findings: The significant contribution of this proposal is feature formation by concept, context and semantic relevance and optimizing resultant clusters by genetic algorithm. An unsupervised learning approach to form the initial clusters that estimates similarity between any two documents by concept, context and semantic relevance score and further optimizes by genetic algorithm is proposed. This novel method represents the concept as correlation between arguments and activities in given documents, context as correlation between meta-text of the documents and the semantic relevance is assessed by estimating the similarity between documents through the hyponyms of the arguments. The meta-text of the documents considered for context assessment contains the authors list, keywords list and list of document versioning time schedules. Application/Improvements:The experiments were conducted to assess the significance of the proposed model.The results obtained from experiments concluding that the MODC is performing exceptionally well under divergent document count and evincing the cluster formation accuracy as 97%. The dimensionality reduction by concept, context and semantic relevance is left for future enhancement of the proposed model.

Keywords


Concept Distance,Context Distance, Document Clustering, Meta-text,MODC, Multi Objective Distance Function, Text Mining, Unsupervised Learning.



DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i28%2F132458