A Fast and Efficient Framework for Creating Parallel Corpus

B. Premjith; S. Sachin Kumar; R. Shyam; M. Anand Kumar; K. P. Soman

doi:10.17485/ijst/2016/v9i45/128508

A Fast and Efficient Framework for Creating Parallel Corpus

B. Premjith , S. Sachin Kumar , R. Shyam , M. Anand Kumar , K. P. Soman

Affiliations
1 Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

Abstract
References
Article Metrics
Refbacks

Objectives: A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Keywords

Google OCR, Machine Translation, Parallel Corpus, Statistical Machine Translation, Scansnap SV600 Scanner.

About the Journal

Editorial Board

Current Issue

Archives

Advanced Search

Article Submission

Registration

Subscription

User

Information

Journal Content
Browse

Donations

Abstract Views: 200

PDF Views: 0

A Fast and Efficient Framework for Creating Parallel Corpus

Abstract Views: 200 | PDF Views: 0

Authors

B. Premjith
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

S. Sachin Kumar
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

R. Shyam
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

M. Anand Kumar
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

K. P. Soman
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

Abstract

Keywords

Google OCR, Machine Translation, Parallel Corpus, Statistical Machine Translation, Scansnap SV600 Scanner.

DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i45%2F128508

Username
Password
Remember me

Username
Password
Remember me

Indian Journal of Science and Technology

A Fast and Efficient Framework for Creating Parallel Corpus

Keywords

A Fast and Efficient Framework for Creating Parallel Corpus

Authors

Abstract

Keywords