Open Access Open Access  Restricted Access Subscription Access

A Fast and Efficient Framework for Creating Parallel Corpus


Affiliations
1 Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India
 

Objectives: A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Keywords

Google OCR, Machine Translation, Parallel Corpus, Statistical Machine Translation, Scansnap SV600 Scanner.
User

Abstract Views: 200

PDF Views: 0




  • A Fast and Efficient Framework for Creating Parallel Corpus

Abstract Views: 200  |  PDF Views: 0

Authors

B. Premjith
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India
S. Sachin Kumar
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India
R. Shyam
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India
M. Anand Kumar
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India
K. P. Soman
Centre for Computational Engineering and Networking (CEN),Amrita School of Engineering, Amrita University, Amrita Vishwa Vidyapeetham, Amritanagar, Coimbatore – 641 112, Tamilnadu, India

Abstract


Objectives: A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Keywords


Google OCR, Machine Translation, Parallel Corpus, Statistical Machine Translation, Scansnap SV600 Scanner.



DOI: https://doi.org/10.17485/ijst%2F2016%2Fv9i45%2F128508