Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

Generic Approach of Measuring Text Semantic Similarity


Affiliations
1 Department of Computer Science and Engineering, Bhilai Institute of Technology, India
     

   Subscribe/Renew Journal


Text Semantic Similarity can be viewed as one of the challenging tasks as evident from current profound interest in NLP research community that has created achievable milestones through active participation in SemEval task series of the recent decade. Amidst these developments, it was realized that exploring text to compare its semantics largely depends on valid grammatical structures of sentences and sentence formulation types. In this paper, the computation of text semantic similarity is addressed by devising a novel set of generic similarity metrics based on both, word-sense of the phrases constituting the text as well as the grammatical layout and sequencing of these word-phrases forming text with sensible meaning. We have used the combination of word-sense and grammatical similarity metrics over benchmark sentential datasets. Having obtained highest value of Pearson’s correlation coefficient (0.89) with mean human similarity scores, when compared against equivalent scores obtained through closely competent structured approach models, plagiarism-detection classification task was revisited on well-known paragraph-phrased Rewrite corpus articulated by Clough and Stevenson (2011) using our model to provide generic utility perspective to these novel devised similarity metrics. Here also, nearly competent classification model performance (with accuracy 76.8%) encouraged authors to work in directions that are more promising where the performance can be enhanced by improving upon dependency (grammatical relations) component in order to raise the count of true-positives and false-negatives.

Keywords

Structural Features, Word-Sense Similarity, Grammatical Similarity, Generic Similarity Metrics, Wikipedia Rewrite Corpus.
Subscription Login to verify subscription
User
Notifications
Font Size

  • R. Mihalcea, C. Corley and C. Strapparava, “Corpus-Based and Knowledge Based Measures of Text Semantic Similarity”, Proceedings of American Association for Artificial Intelligence, pp. 775-780, 2006.
  • Y. Li, D. McLean, Z.A. Bandar, J.D. O’Shea and K. Crockett, “Sentence Similarity based on Semantic Nets and Corpus Statistics”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 8, pp. 1138-1145, 2006.
  • A. Islam and D. Inkpen, “Semantic Text Similarity using Corpus based Word Similarity and String Similarity”, ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 2, pp. 1-10, 2008.
  • M.C. Lee, “A Novel Sentence Similarity Measure for Semantic based Expert Systems”, Expert Systems with Applications, Vol. 38, No. 5, pp. 6392-6399, 2011.
  • D. Gupta, “Detection of Idea Plagiarism using Syntax - Semantic Concept Extractions with Genetic Algorithm”, Expert Systems with Applications, Vol. 73, No. 3, pp. 11-26 ,2017.
  • S. Ozates., A. Ozgur and D. Radev, “Sentence Similarity based on Dependency Tree Kernels for Multi-document Summarization”, Proceedings of International Conference on Language Resources and Evaluation, pp. 2833-2838, 2016.
  • P. Zhang, X. Huang, L. Zhang, “Information Mining and Similarity Computation for Semi- Un-Structured Sentences from the Social Data”, IEEE Internet of Things, Vol. 34, No. 2, pp. 2352-8648 ,2020.
  • S. Alzahrani, M. Salmon and A. Abraham, “An Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods”, IEEE Transactions on Systems, Man, and Cybernetics Part C: Application and Reviews, Vol. 42, No. 2 pp. 133-149,2012.
  • Ercan Canhasi, “Measuring the Sentence Level Similarity”, Master Thesis, Department of Computer Science, University of Prizren, pp. 1-42, 2013.
  • S. Alzahrani, N. Salim, and V. Palade, “Uncovering Highly Obfuscated Plagiarism Cases using Fuzzy Semantic-Based Similarity Model”, Journal of King Saud University - Computer and Information Sciences, Vol. 27, pp. 248-268, 2015.
  • A. Pawar and V. Mago, “Calculating the Similarity between Words and Sentences using a Lexical Database and Corpus Statistics”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18 pp. 1-14, 2018.
  • K. Vani and D. Gupta, “A Study on Extrinsic Text Plagiarism Detection Techniques and Tools”, Journal of Engineering Science and Technology, Vol. 9, No. 4, pp. 150-164. 2013.
  • S. Alzahrani and N. Salim, “Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection”, Proceedings of International Conference and Workshop on Multilingual and Multimodal Information Systems, pp. 145-155, 2010.
  • M. Potthast, B. Stein, A. Eiselt, A. Barron-Cedeno and P. Rosso, “Overview of the 1st International Competition on Plagiarism Detection.”, Proceedings of International Conference on Spanish Society for Natural Language Processing, pp. 1-69, 2009.
  • M. Potthast, B. Stein, A. Eiselt, A. Barron-Cedeno and P. Rosso, “Overview of the 2nd International Competition on Plagiarism Detection”, Proceedings of International Conference on Spanish Society for Natural Language Processing, pp. 1-71, 2010.
  • M. Potthast, B. Stein, A. Eiselt, A. Barron-Cedeno and P. Rosso, “Overview of the 3rd International Competition on Plagiarism Detection”, Proceedings of International Conference on Spanish Society for Natural Language Processing, pp. 1-78, 2011.
  • M. Potthast, T. Gollub, M. Hagen, J. Grabegger, J. Kiesel, M. Michel, A. Barron-Cedeno and P. Rosso, “Overview of the 4th International Competition on Plagiarism Detection”, Proceedings of International Conference on Spanish Society for Natural Language Processing, pp. 1-68, 2012.
  • M. Potthast, T. Gollub, M. Hagen, M. Tippmann, J. Kiesel, P. Rosso, E. Stamatatos and B. Stein, “Overview of the 5th International Competition on Plagiarism Detection”, Proceedings of International Conference on Spanish Society for Natural Language Processing, pp. 1-58, 2013.
  • M. Potthast, M. Hagen, B. Anna, B. Matthias, Martin Tippmann, Rosso Paolo and Stein Benno, “Overview of the 6th International Competition on Plagiarism Detection”, Proceedings of International Conference on Spanish Society for Natural Language Processing, pp. 1-66,2014.
  • Gaizauskas J. Foster and Y. Wilks., “The METER Corpus: A Corpus for Analyzing Journalistic Text Reuse”, Proceedings of International Conference on Corpus Linguistics, pp. 214-223, 2001.
  • Brown Corpus Information, Available at http://clwww.essex.ac.uk/w3c/corpus_ling/content/corpora/list/private/brown/ brown.html, Accessed at 2005.
  • P. Clough and M. Stevenson, “Developing a Corpus of Plagiarized Short Answers”, Language Resources and Evaluation: Special Issue on Plagiarism and Authorship Analysis, Vol. 45, No. 1, pp. 5-24, 2011.
  • S. Burrows, M. Potthast, B. Stein and A. Eiselt, “Webis Crowd Paraphrase Corpus 2011”, Available at https://webis.de/data/webis-cpc-11.html, Accessed at 2013.
  • B. Dolan, C. Quir and C. Brockett, “Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources”, Proceedings of International Conference on Computational Linguistics, pp. 350-355, 2004.
  • B. Pang, K. Knight and D. Marcu, “Syntax-Based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences”, Proceedings of International Conference on Human Language Technology, pp. 181-188, 2003.
  • P. Resnik, “Using Information Content to Evaluate Semantic Similarity in a Taxonomy”, Proceedings of International Joint Conference on Artificial Intelligence, pp. 448-453,1995.
  • D. Lin, “An Information-Theoretic Definition of Similarity”, Proceedings of International Conference on Machine Learning, pp. 296-304, 1998.
  • J.J. Jiang and D. W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy”, Proceedings of International Conference on Research in Computational Linguistics, pp. 19-33, 1997.
  • C. Leacock, M. Chodorow and A. Miller George, “Combining Local Context and WordNet Sense Similarity for Word Sense Identification”, MIT Press, 1998.
  • Z. Wu and M. Palmer “Verb Semantics and Lexical Selection”, Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 133-138,1994.
  • M. Honnibal, ‘‘Spacy (version 1.3.0)”, Available at https://spacy.io/, Accessed at 2016.

Abstract Views: 183

PDF Views: 1




  • Generic Approach of Measuring Text Semantic Similarity

Abstract Views: 183  |  PDF Views: 1

Authors

Richa Dhagat
Department of Computer Science and Engineering, Bhilai Institute of Technology, India
Arpana Rawal
Department of Computer Science and Engineering, Bhilai Institute of Technology, India
Sunita Soni
Department of Computer Science and Engineering, Bhilai Institute of Technology, India

Abstract


Text Semantic Similarity can be viewed as one of the challenging tasks as evident from current profound interest in NLP research community that has created achievable milestones through active participation in SemEval task series of the recent decade. Amidst these developments, it was realized that exploring text to compare its semantics largely depends on valid grammatical structures of sentences and sentence formulation types. In this paper, the computation of text semantic similarity is addressed by devising a novel set of generic similarity metrics based on both, word-sense of the phrases constituting the text as well as the grammatical layout and sequencing of these word-phrases forming text with sensible meaning. We have used the combination of word-sense and grammatical similarity metrics over benchmark sentential datasets. Having obtained highest value of Pearson’s correlation coefficient (0.89) with mean human similarity scores, when compared against equivalent scores obtained through closely competent structured approach models, plagiarism-detection classification task was revisited on well-known paragraph-phrased Rewrite corpus articulated by Clough and Stevenson (2011) using our model to provide generic utility perspective to these novel devised similarity metrics. Here also, nearly competent classification model performance (with accuracy 76.8%) encouraged authors to work in directions that are more promising where the performance can be enhanced by improving upon dependency (grammatical relations) component in order to raise the count of true-positives and false-negatives.

Keywords


Structural Features, Word-Sense Similarity, Grammatical Similarity, Generic Similarity Metrics, Wikipedia Rewrite Corpus.

References