Open Access Open Access  Restricted Access Subscription Access

A Phrase Table Filtering Model Based on Binary Classification for Uyghur-Chinese Machine Translation


Affiliations
1 Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China
 

In statistical machine translation, large amount of unreasonable phrase pairs in a phrase table can affect the decoding efficiency and the overall translation performance, especially in Uyghur-Chinese machine translation. In this paper, we present a novel phrase table filtering model based on binary classification, which consider differences between Uyghur and Chinese, and draw lessons from binary classification in machine learning. In our model, four features are considered: 1) Difference in length between source and target phrase; 2) Proportion of translated words in phrase pairs; 3) Proportion of symbol words; 4) Average number of co-occurrence words in training corpus. We use this model to generate a filtered phrase table. Experimental results show that this new filtering model can improve the performance and efficiency of our current Uygur-Chinese machine translation system.

Keywords

Uyghur-Chinese Machine Translation, Phrase Table Filtering, Binary Classification.
User
Notifications
Font Size

  • Philipp Koehn , Franz Josef Och , Daniel Marcu, Statistical phrase-based translation[C]// Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Edmonton, Canada, 2003: 48-54.
  • Wang Q, Zhang L, Chang C. Syntactic Function-Based Chinese Lexical Categories and Category Grammar Parsing [J]. Journal of Software, 2014, 9(5): 1270-1274.
  • Khan M A S, Yamada S, Nishino T. How to Translate Unknown Words for English to Bangla Machine Translation Using Transliteration [J]. Journal of Computers, 2013, 8(5): 1167-1174.
  • Peter F. Brown , Vincent J. Della Pietra , Stephen A. Della Pietra , Robert L. Mercer, The mathematics of statistical machine translation: parameter estimation[J], Computational Linguistics, 1993, 19(2): 263-311.
  • Franz Josef Och, Hermann Ney, The Alignment Template Approach to Statistical Machine Translation [J], Computational Linguistics, 2004, 30(4): 417-449.
  • Zhongjun He, Qun Liu, Shouxun Lin, Partial matching strategy for phrase-based statistical machine translation//Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, Columbus, Ohio, 2008: 161-164.
  • Xiong W, Jin Y, Liu Z. Recognizing Chinese Number and Quantifier Prefix to Enhance Statistical Parser in Machine Translation [J]. Journal of Computers, 2014, 9(4): 867-874.
  • Zhao Shi-qi,Zhao Lin,Liu Ting,Li Sheng. Paraphrase Collocation Extraction Based on Binary Classification [J]. Journal of Software, 2010, 21(6):1267-1276. (in Chinese)
  • Matthias Eck, Stephan Vogel, and Alex Waibel. 2007b. Translation model pruning via usage statistics for statistical machine translation[C]//In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, , Rochester, New York, 2007: 21- 24.
  • Yu Chen, Andreas Eisele, and Martin Kay. Improving statistical machine translation efficiency by triangulation[C]// Proceedings of the Sixth International Conference on Language Resources and Evaluation(LREC’08),Marrakech, Morocco, 2008: 2875- 2880.
  • Nadi Tomeh, Nicola Cancedda, and Marc Dymetman. Complexity-based phrase-table filtering for statistical machine translation[C]//In Proceedings of MT Summit XII, Ottawa, Ontario, Canada, 2009.
  • Stephan Vogel, Hermann Ney, Christoph Tillmann. HMM Based Word Alignment in Statistical Translation [C]//International Conference on Computational Linguistics (COLING), Copenhagen, 1996: 836-841.
  • Franz Josef Och, Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models [J]. Computational Linguistics, 2003, 29(1):19-51.
  • Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard
  • Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst, Moses: Open Source Toolkit for Statistical Machine Translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, 2007: 177-180.
  • [A. Stolcke. SRILM -An Extensible Language Modeling Toolkit[C]// Proceedings of the 7th International Conference on Spoken Language Processing, Denver, 2002: 901-904.

Abstract Views: 175

PDF Views: 77




  • A Phrase Table Filtering Model Based on Binary Classification for Uyghur-Chinese Machine Translation

Abstract Views: 175  |  PDF Views: 77

Authors

Chenggang Mi
Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China
Yating Yang
Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China
Xi Zhou
Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China
Lei Wang
Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China
Xiao Li
Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China
Eziz Tursun
Xinjiang Technical Institute of Physics and Chemistry of Chinese Academy of Sciences, Urumqi 830011, China

Abstract


In statistical machine translation, large amount of unreasonable phrase pairs in a phrase table can affect the decoding efficiency and the overall translation performance, especially in Uyghur-Chinese machine translation. In this paper, we present a novel phrase table filtering model based on binary classification, which consider differences between Uyghur and Chinese, and draw lessons from binary classification in machine learning. In our model, four features are considered: 1) Difference in length between source and target phrase; 2) Proportion of translated words in phrase pairs; 3) Proportion of symbol words; 4) Average number of co-occurrence words in training corpus. We use this model to generate a filtered phrase table. Experimental results show that this new filtering model can improve the performance and efficiency of our current Uygur-Chinese machine translation system.

Keywords


Uyghur-Chinese Machine Translation, Phrase Table Filtering, Binary Classification.

References