Open Access Open Access  Restricted Access Subscription Access

Word Embedding Models for Finding Semantic Relationship between Words in Tamil Language


Affiliations
1 Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore - 641112, Tamil Nadu, India
 

Objective: Word embedding models were most predominantly used in many of the NLP tasks such as document classification, author identification, story understanding etc. In this paper we make a comparison of two Word embedding models for semantic similarity in Tamil language. Each of those two models has its own way of predicting relationship between words in a corpus. Method/Analysis: The term Word embedding in Natural Language Processing is a representation of words in terms of vectors. Word embedding is used as an unsupervised approach instead of traditional way of feature extraction. Word embedding models uses neural networks to generate numerical representation for the given words. In order to find the best model that captures semantic relationship between words, using a morphologically rich language like Tamil would be great. Tamil language is one of the oldest Dravidian languages and it is known for its morphological richness. In Tamil language it is possible to construct 10,000 words from a single ischolar_main word. Findings: Here we make comparison of Content based Word embedding and Context based Word embedding models respectively. We tried different feature vector sizes for the same word to comment on the accuracy of the models for semantic similarity. Novelty/Improvement: Analysing Word embedding models for morphologically rich language like Tamil helps us to classify the words better based on its semantics.

Keywords

CBOW, Content based Word Embedding, Context based Word Embedding, Morphology, Semantic and Syntactic, Skip Gram.
User

Abstract Views: 196

PDF Views: 0




  • Word Embedding Models for Finding Semantic Relationship between Words in Tamil Language

Abstract Views: 196  |  PDF Views: 0

Authors

S. G. Ajay
Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore - 641112, Tamil Nadu, India
M. Srikanth
Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore - 641112, Tamil Nadu, India
M. Anand Kumar
Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore - 641112, Tamil Nadu, India
K. P. Soman
Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Amrita University, Coimbatore - 641112, Tamil Nadu, India

Abstract


Objective: Word embedding models were most predominantly used in many of the NLP tasks such as document classification, author identification, story understanding etc. In this paper we make a comparison of two Word embedding models for semantic similarity in Tamil language. Each of those two models has its own way of predicting relationship between words in a corpus. Method/Analysis: The term Word embedding in Natural Language Processing is a representation of words in terms of vectors. Word embedding is used as an unsupervised approach instead of traditional way of feature extraction. Word embedding models uses neural networks to generate numerical representation for the given words. In order to find the best model that captures semantic relationship between words, using a morphologically rich language like Tamil would be great. Tamil language is one of the oldest Dravidian languages and it is known for its morphological richness. In Tamil language it is possible to construct 10,000 words from a single ischolar_main word. Findings: Here we make comparison of Content based Word embedding and Context based Word embedding models respectively. We tried different feature vector sizes for the same word to comment on the accuracy of the models for semantic similarity. Novelty/Improvement: Analysing Word embedding models for morphologically rich language like Tamil helps us to classify the words better based on its semantics.

Keywords


CBOW, Content based Word Embedding, Context based Word Embedding, Morphology, Semantic and Syntactic, Skip Gram.



DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i45%2F128521