Open Access Open Access  Restricted Access Subscription Access

Classification of Gujarati Documents using Naïve Bayes Classifier


Affiliations
1 School of Computer Science, R. K. University, Rajkot - 360020, Gujarat, India
2 Narmada College of Computer Application, Bharuch - 392011, Gujarat, India
 

Objectives: Information overload on the web is a major problem faced by institutions and businesses today. Sorting out some useful documents from the web which is written in Indian language is a challenging task due to its morphological variance and language barrier. As on date, there is no document classifier available for Gujarati language. Methods: Keyword search is a one of the way to retrieve the meaningful document from the web, but it doesn’t discriminate by context. In this paper we have presented the Naïve Bayes (NB) statistical machine learning algorithm for classification of Gujarati documents. Six pre-defined categories sports, health, entertainment, business, astrology and spiritual are used for this work. A corpus of 280 Gujarat documents for each category is used for training and testing purpose of the categorizer. We have used k-fold cross validation to evaluate the performance of Naïve Bayes classifier. Findings: The experimental results show that the accuracy of NB classifier without and using features selection was 75.74% and 88.96% respectively. These results prove that the NB classifier contribute effectively in Gujarati documents classification. Applications: Proposed research work is very useful to implement the functionality of directory search in many web portals to sort useful documents and many Information Retrieval (IR) applications.

Keywords

Classification, Document Categorization, Gujarati Language, Naïve Bayes.
User

Abstract Views: 184

PDF Views: 0




  • Classification of Gujarati Documents using Naïve Bayes Classifier

Abstract Views: 184  |  PDF Views: 0

Authors

Rajnish M. Rakholia
School of Computer Science, R. K. University, Rajkot - 360020, Gujarat, India
Jatinderkumar R. Saini
Narmada College of Computer Application, Bharuch - 392011, Gujarat, India

Abstract


Objectives: Information overload on the web is a major problem faced by institutions and businesses today. Sorting out some useful documents from the web which is written in Indian language is a challenging task due to its morphological variance and language barrier. As on date, there is no document classifier available for Gujarati language. Methods: Keyword search is a one of the way to retrieve the meaningful document from the web, but it doesn’t discriminate by context. In this paper we have presented the Naïve Bayes (NB) statistical machine learning algorithm for classification of Gujarati documents. Six pre-defined categories sports, health, entertainment, business, astrology and spiritual are used for this work. A corpus of 280 Gujarat documents for each category is used for training and testing purpose of the categorizer. We have used k-fold cross validation to evaluate the performance of Naïve Bayes classifier. Findings: The experimental results show that the accuracy of NB classifier without and using features selection was 75.74% and 88.96% respectively. These results prove that the NB classifier contribute effectively in Gujarati documents classification. Applications: Proposed research work is very useful to implement the functionality of directory search in many web portals to sort useful documents and many Information Retrieval (IR) applications.

Keywords


Classification, Document Categorization, Gujarati Language, Naïve Bayes.



DOI: https://doi.org/10.17485/ijst%2F2017%2Fv10i5%2F150085