Open Access Open Access  Restricted Access Subscription Access

Automatic Text Correction for Devanagari OCR


Affiliations
1 Department of Computer Science, Punjabi University, Patiala – 147002, Punjab, India
 

Objectives: This paper proposes a new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix. Methods/Statistical Analysis: Confusion matrix is generated from large corpus of Hindi. The system takes each word of OCR output and generate number of strings from topmost five confused characters for each character of input word along with probability of these strings for ranking. Each string is validated with the character trigram dictionary and these valid strings are used for best suggestions. Findings: The topmost five words is taken as suggestions. The system has been tested for variety of OCR outputs documents of Devanagari script. The system provides suggestions for all the correct words at top position. For more than 10000 unique words in Devanagari OCR output, system gives the accuracy of 97%. Application/Improvements: This system is used in post-processing of Devanagari OCR. With some improvements, the system can also be used for Gurumukhi Script and Urdu script.

Keywords

Automatic Text Correction, Confusion Matrix, Devanagari, OCR, Trigram.
User

Abstract Views: 163

PDF Views: 0




  • Automatic Text Correction for Devanagari OCR

Abstract Views: 163  |  PDF Views: 0

Authors

Atul Kumar
Department of Computer Science, Punjabi University, Patiala – 147002, Punjab, India
Gurpreet Singh Lehal
Department of Computer Science, Punjabi University, Patiala – 147002, Punjab, India
Gurpreet Singh Lehal
Department of Computer Science, Punjabi University, Patiala – 147002, Punjab, India

Abstract


Objectives: This paper proposes a new technique for correcting errors done by Devanagari OCR (Optical Character Reader) system based on confusion matrix. Methods/Statistical Analysis: Confusion matrix is generated from large corpus of Hindi. The system takes each word of OCR output and generate number of strings from topmost five confused characters for each character of input word along with probability of these strings for ranking. Each string is validated with the character trigram dictionary and these valid strings are used for best suggestions. Findings: The topmost five words is taken as suggestions. The system has been tested for variety of OCR outputs documents of Devanagari script. The system provides suggestions for all the correct words at top position. For more than 10000 unique words in Devanagari OCR output, system gives the accuracy of 97%. Application/Improvements: This system is used in post-processing of Devanagari OCR. With some improvements, the system can also be used for Gurumukhi Script and Urdu script.

Keywords


Automatic Text Correction, Confusion Matrix, Devanagari, OCR, Trigram.



DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i45%2F128528