Open Access Open Access  Restricted Access Subscription Access

Indian Languages IR using Latent Semantic Indexing


Affiliations
1 Department of Computer Science Engineering, JNTUACE, Anantapur, India
2 Department of Computer Science Engineering, Osmania University, Hyderabad, India
3 Department of Computer Science Engineering, JNTUHCE, Nachupalli, India
 

Retrieving information from different languages may lead to many problems like polysemy and synonymy, which can be resolved by Latent Semantic Indexing (LSI) techniques. This paper uses the Singular Value Decomposition (SVD) of LSI technique to achieve effective indexing for English and Hindi languages. Parallel corpus consisting of both Hindi and English documents is created and is used for training and testing the system. Removing stop words from the documents is performed followed by stemming and normalization in order to reduce the feature space and to get language relations. Then, cosine similarity method is applied on query document and target document. Based on our experimental results it is proved that LSI based CLIR gets over the non-LSI based retrieval which have retrieval successes of 67% and 9% respectively.

Keywords

Latent Semantic Indexing, Cross Language Information Retrieval, Indexing, Singular Value Decomposition.
User
Notifications
Font Size

Abstract Views: 214

PDF Views: 113




  • Indian Languages IR using Latent Semantic Indexing

Abstract Views: 214  |  PDF Views: 113

Authors

A. P. SivaKumar
Department of Computer Science Engineering, JNTUACE, Anantapur, India
P. Premchand
Department of Computer Science Engineering, Osmania University, Hyderabad, India
A. Govardhan
Department of Computer Science Engineering, JNTUHCE, Nachupalli, India

Abstract


Retrieving information from different languages may lead to many problems like polysemy and synonymy, which can be resolved by Latent Semantic Indexing (LSI) techniques. This paper uses the Singular Value Decomposition (SVD) of LSI technique to achieve effective indexing for English and Hindi languages. Parallel corpus consisting of both Hindi and English documents is created and is used for training and testing the system. Removing stop words from the documents is performed followed by stemming and normalization in order to reduce the feature space and to get language relations. Then, cosine similarity method is applied on query document and target document. Based on our experimental results it is proved that LSI based CLIR gets over the non-LSI based retrieval which have retrieval successes of 67% and 9% respectively.

Keywords


Latent Semantic Indexing, Cross Language Information Retrieval, Indexing, Singular Value Decomposition.