Open Access Open Access  Restricted Access Subscription Access
Open Access Open Access Open Access  Restricted Access Restricted Access Subscription Access

A Probabilistic Smoothing Approach for Language Models Applied to Protein Sequence Data


Affiliations
1 Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli-627012, India
2 Department of Mathematics, Sathyabama University, Chennai, India
     

   Subscribe/Renew Journal


Most modern techniques for statistical processing of language modeling are widely applied to many domains such as Speech recognition, Machine translation and Information Retrieval etc. The basic idea behind the language model is probabilistic, which describes the task of probability estimation defined over strings frequently designed as a sentence. One of the core problem addresses a language model is termed as smoothing, its primitive goal is to improve the model accuracy by ajusting the maximum likelihood estimate of probabilities. To retrieve this challenge, the paper focuses a well-known smoothing technique called Good-Turing, applied over a bioinformatics task of protein sequence. Also, the computational procedure of this technique uses an R program to estimate bigram and trigram probabilities of language models for the protein sequence. Experimental results shows the appropriate fitting of exponential and linear smoothing curves defined over bigram and trigram sequences respectively, with very high model accuracy.

Keywords

Bigram Model, Language Model, Smoothing N-Gram Model, Trigram Model.
User
Subscription Login to verify subscription
Notifications
Font Size

Abstract Views: 152

PDF Views: 2




  • A Probabilistic Smoothing Approach for Language Models Applied to Protein Sequence Data

Abstract Views: 152  |  PDF Views: 2

Authors

Gopal Suresh
Department of Statistics, Manonmaniam Sundaranar University, Tirunelveli-627012, India
Chellapa Vijayalakshmi
Department of Mathematics, Sathyabama University, Chennai, India

Abstract


Most modern techniques for statistical processing of language modeling are widely applied to many domains such as Speech recognition, Machine translation and Information Retrieval etc. The basic idea behind the language model is probabilistic, which describes the task of probability estimation defined over strings frequently designed as a sentence. One of the core problem addresses a language model is termed as smoothing, its primitive goal is to improve the model accuracy by ajusting the maximum likelihood estimate of probabilities. To retrieve this challenge, the paper focuses a well-known smoothing technique called Good-Turing, applied over a bioinformatics task of protein sequence. Also, the computational procedure of this technique uses an R program to estimate bigram and trigram probabilities of language models for the protein sequence. Experimental results shows the appropriate fitting of exponential and linear smoothing curves defined over bigram and trigram sequences respectively, with very high model accuracy.

Keywords


Bigram Model, Language Model, Smoothing N-Gram Model, Trigram Model.