Open Access Open Access  Restricted Access Subscription Access

A Survey on Different Similarity Join to Improve Clustering, Classification and Similarity Search


Affiliations
1 Sri Narayana Guru College, Coimbatore-641105, Tamil Nadu, India
 

Objectives: To analysis various similarity join techniques to improve the data mining process.

Findings: Similarity join is an evaluation of similarity between any two objects. Many applications such as data cleaning, data integration, near duplicate detection and all data mining process can extensively benefit from the similarity join measure. Thus the similarity join can be performed between objects or strings or nodes etc. It finds all pairs of objects whose similarity is not smaller than the similarity threshold. There are different techniques and approaches are used to find the similarity join between objects in homogeneous information network. This paper provides detailed information about the different similarity join techniques.

Results: In this paper various similarity join techniques are compared through parameters to prove path based similarity join is better than other techniques.

Application/Improvements: The findings of this work prove that the path based similarity join provides better result than other approaches.


Keywords

Similarity Join, Data Cleaning, Data Integration, Near Duplicate Detection.
User
Notifications

  • R. Nagaraj, V. Thiagarasu, B. Jeevithapriya. Optimization and scalable constrained clustering performances. Indian Journal of Innovations and Developments. 2015;4(7),1-7.
  • X. Lian, L. Chen. Set similarity join on probabilistic data.Very Large Database Endowment. 2010; 3(1-2), 650-659.
  • H. Lee, R. T. Ng, K. Shim. Similarity join size estimation using locality sensitive hashing. Very Large Database Endowment Endowment. 2011; 4(6), 338-349.
  • C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, T. Wu. Fast computation of simrank for static and dynamic information networks. International Conference on Extending Database Technology. 2010; 465-476.
  • J. Wang, J. Feng, G. Li. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Very Large Database Endowment. 2010; 3(1-2), 1219-1230.
  • L. Sun, C. K. Cheng, X. Li, D. W. L. Cheung, J. Han.On link-based similarity join. Very Large Database Endowment. 2011; 4(11), 714-725.
  • Y. Sun, J. Han, X. Yan, P. S. Yu, T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks.Very Large Database Endowment. 2011; 4(11), 992-1003.
  • C. Xiao, W. Wang, X. Lin, J. X. Yu, G. Wang. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS). 2011; 36(3), 131-140.
  • J. Wang, G. Li, J. Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. International Conference on Data Engineering. 2011; 458-469.
  • J. Feng, J. Wang, G. Li. Trie-join: a trie-based method for efficient string similarity joins. The VLDB Journal-The International Journal on Very Large Data Bases. 2012; 21(4), 437-461.
  • A. Metwally, C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Very Large Database Endowment. 2012; 5(8), 704-715.
  • V. Satuluri, S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. Very Large Database Endowment. 2012; 5(5), 430-441.
  • C. Shi, X. Kong, P. S. Yu, S. Xie, B. Wu. Relevance search in heterogeneous networks. In Proceedings of the 15th International Conference on Extending Database Technology. 2012; 180-191.
  • Y. Sun, B. Norick, J. Han, X. Yan, P. S. Yu, X .Yu. Pathselclus: Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. ACM Transactions on Knowledge Discovery from Data (TKDD). 2013; 7(3), 1348-1356.
  • G. Li, D. Deng, J. Wang, J Feng. Pass-join: A partition-based method for similarity joins. Very Large Database Endowment. 2011; 5(3), 253-264.
  • J. Wang, G. Li, J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. ACM SIGMOD International Conference on Management of Data 2012; 85-96.
  • D. Chen, L. Zheng, M. Zhou, S. Yu. Efficient Similarity Join for Time Sequences Using Locality Sensitive Hash and Mapreduce. In Cloud Computing and Big Data (CloudCom-Asia), 2013;529-533.
  • C. Shi, X. Kong, Y. Huang, S. Y. Philip, B. Wu. Hetesim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering, 2014; 26(10), 2479-2492.
  • W. Zheng, L. Zou, Y. Feng, L. Chen, D. Zhao. Efficient simrank-based similarity join over large graphs. Very Large Databse Endowment. 2013; 6(7), 493-504.
  • C. Meng, R. Cheng, S. Maniu, P. Senellart, W. Zhang. Discovering meta-paths in large heterogeneous information networks. International Conference on World Wide Web. 2015; 754-764.
  • Y. Xiong, Y. Zhu, S. Y. Philip.Top-k similarity join in heterogeneous information networks. IEEE Transactions on Knowledge and Data Engineering. 2015; 27(6), 1710-1723.

Abstract Views: 353

PDF Views: 0




  • A Survey on Different Similarity Join to Improve Clustering, Classification and Similarity Search

Abstract Views: 353  |  PDF Views: 0

Authors

C. P. Rushida
Sri Narayana Guru College, Coimbatore-641105, Tamil Nadu, India
V. R. Nagarajan
Sri Narayana Guru College, Coimbatore-641105, Tamil Nadu, India

Abstract


Objectives: To analysis various similarity join techniques to improve the data mining process.

Findings: Similarity join is an evaluation of similarity between any two objects. Many applications such as data cleaning, data integration, near duplicate detection and all data mining process can extensively benefit from the similarity join measure. Thus the similarity join can be performed between objects or strings or nodes etc. It finds all pairs of objects whose similarity is not smaller than the similarity threshold. There are different techniques and approaches are used to find the similarity join between objects in homogeneous information network. This paper provides detailed information about the different similarity join techniques.

Results: In this paper various similarity join techniques are compared through parameters to prove path based similarity join is better than other techniques.

Application/Improvements: The findings of this work prove that the path based similarity join provides better result than other approaches.


Keywords


Similarity Join, Data Cleaning, Data Integration, Near Duplicate Detection.

References