Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Comparison of String Similarity Algorithms to Measure Lexical Similarity


Affiliations
1 Shrimad Rajchandra Institute of Management and Computer Applications, UTU, Bardoli, India
2 Shrimad Rajchandra Inst. of Management & Comp. Appl., UTU, Bardoli, India
     

   Subscribe/Renew Journal


A string similarity represents the lexical similarity between two words. This can be further exploited to identify similarity between questions. Several string similarity algorithm exists in literature. In this paper the authors have implemented five string similarity algorithms viz. Dice coefficient, Jaccard similarity, Levenshtein distance, Jaro distance and Cosine similarity. The results of these algorithms are further compared with human judges to determine, which of them resembles the human way to dissimilarize the given strings. The experimentation is done over 1000 English word pairs.
Subscription Login to verify subscription
User
Notifications
Font Size


  • Yogish, Deepa, T. N. Manjunath, and Ravindra S. Hegadi, "A Survey of Intelligent Question Answering System Using NLP and Information Retrieval Techniques."
  • Hermjakob, U.; Hovy, E. H.; and Lin, C. 2000. "Knowledge-Based Question Answering." InProceedings of theSixth World Multiconference on Systems, Cybernetics, and Informatics (SCI-2002).Winter Garden, FL: International Institute of Informatics and Systemics.
  • Feng, D., Shaw, E., Kim, J., & Hovy, E. (2006, January). "An intelligent discussion-bot for answering student queries in threaded discussions." In Proceedings of the 11th international conference on Intelligent user interfaces (pp.171-177).IEEE.
  • Andrenucci, A., & Sneiders, E. (2005, July). "Automated question answering: Review of the main approaches". In Information Technology and Applications, 2005. ICITA 2005. Third International Conference on (pp.514-519). IEEE.
  • Hovy, Eduard H., Ulf Hermjakob, and Chin-Yew Lin. "The Use of External Knowledge of Factoid QA."TREC. 2001.
  • Brill, Eric, Susan Dumais, and Michele Banko. "An analysis of the AskMSR question-answering system." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.
  • Radev, Dragomir R., et al. "Mining the web for answers to natural language questions." Proceedings of the tenth international conference on Information and knowledge management. ACM, 2001.
  • START Natural Language QA system: http://start.csail.mit.edu/index.php
  • Wikipedia :https://en.wikipedia.org/wiki/Question_answering
  • W. Ahmed and B. P, "Question Analysis for Arabic Question Answering Systems", International Journal on Natural Language Computing, vol. 5, no. 6, pp. 21-30, 2016.
  • V. Thada, and V. Jaglan, "Comparison of jaccard, dice, cosine similarity coefficient to find best fitness value for web retrieved documents using genetic algorithm." International Journal of Innovations in Engineering and Technology 2.4, 2013.
  • L. Cai, G. Zhou, K. Liu, and J. Zhao. "Learning the Latent Topics for Question Retrieval in Community QA.", IJCNLP, vol. 11, pp. 273-281. 2011.
  • H. Duan, C. Yunb, C. Y. Lin, and Y. Yong, "Searching Questions by Identifying Question Topic and Question Focus." In ACL, pp. 156-164. 2008.
  • S. P. Abney, "The English noun phrase in its sentential aspect." PhD diss., Massachusetts Institute of Technology, 1987.
  • V. Shwartz, O. Levy, I. Dagan, and J. Goldberger, "Learning to Exploit Structured Resources for Lexical Inference." In CoNLL, pp. 175-184. 2015.
  • W. Song, M. Feng, N. Gu, and L. Wenyin. "Question similarity calculation for FAQ answering." In Semantics, Knowledge and Grid, Third International Conference on, pp. 298-301. IEEE, 2007.
  • Stop words list: http://xpo6.com/list-of-englishstopwords/
  • Part of Speech Information: https://en.wikipedia.org/wiki/Partofspeech_tagging/
  • Stop words Information: https://en.wikipedia.org/wiki/Stop_words
  • NLP Information: https://en.wikipedia.org/wiki/Natural_language_processing/
  • QA System Information: https://en.wikipedia.org/wiki/Question_answering/
  • G.Bathla, R.Jindal "Similarity Measures of Research Papers and Patents using Adaptive and Parameter Free Threshold," In International Journal of Computer Applications (0975 - 8887), vol.33- No.5,November 2011.
  • Maheshkumar B.Landge, Ramesh R.Naik, C. Namrata Mahender,"Measuring Author Impression Using Cosine Similarity Algorithm," In IOSR-JCE, e-ISSN: 2278-0661,p- ISSN: 2278-8727, PP 24-28.
  • Yufei Sun , Liangli Ma, Shuang Wang, "A Comparative Evaluation of String Similarity Metrics for Ontology Alignment", In Journal of Information and Computational Science 12:3 (2015) 957-964, February 10, 2015.
  • Maria del Pilar Angeles, Adrian Espino-Gamez, "Comparison of methods Hamming Distance, Jaro, and Monge-Elkan" In The Seventh International Conference on Advances in Databases, Knowledge, and Data Applications, DBKDA 2015.
  • Vikas Thada, Dr Vivek Jaglan, "Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm", In the International Journal of Innovations in Engineering and Technology, SSN: 2319-1058, Vol. 2, 4 August 2013.
  • https://en.wikipedia.org/wiki/Levenshtein_distance
  • https://en.wikipedia.org/wiki/Jaro-Winkler_distance
  • https://en.wikipedia.org/wiki/Cosine_similarity
  • Hakan Sundblad, "Question Classification in Question Answering Systems", In Diss. Institutionen for datavetenskap, 2007.
  • Poonam Gupta, Vishal Gupta,"A Survey of Text Question Answering Techniques," International Journal of Computer Applications (0975-8887),Volume 53- No.4, September, 2012.

Abstract Views: 331

PDF Views: 6




  • Comparison of String Similarity Algorithms to Measure Lexical Similarity

Abstract Views: 331  |  PDF Views: 6

Authors

Sagar J. Gandhi
Shrimad Rajchandra Institute of Management and Computer Applications, UTU, Bardoli, India
Mihirraj M. Thakor
Shrimad Rajchandra Institute of Management and Computer Applications, UTU, Bardoli, India
Jikitsha Sheth
Shrimad Rajchandra Inst. of Management & Comp. Appl., UTU, Bardoli, India
Hariom I. Pandit
Shrimad Rajchandra Institute of Management and Computer Applications, UTU, Bardoli, India
Hemin S. Patel
Shrimad Rajchandra Institute of Management and Computer Applications, UTU, Bardoli, India

Abstract


A string similarity represents the lexical similarity between two words. This can be further exploited to identify similarity between questions. Several string similarity algorithm exists in literature. In this paper the authors have implemented five string similarity algorithms viz. Dice coefficient, Jaccard similarity, Levenshtein distance, Jaro distance and Cosine similarity. The results of these algorithms are further compared with human judges to determine, which of them resembles the human way to dissimilarize the given strings. The experimentation is done over 1000 English word pairs.

References