Open Access Open Access  Restricted Access Subscription Access

Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus


Affiliations
1 Department of Informatica, Universita di Pisa, Pisa, Italy
2 Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran, Islamic Republic of
3 Department of Computer Science and Engineering, Shiraz University, Iran, Islamic Republic of
4 Department of Electronics, Informatics and Systems, University of Calabria, Rende, Italy
 

In recent years, many studies on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Nearly all apply an existing small dictionary or other resource to make an initial list named seed dictionary. In this paper we discuss on using different types of dictionaries and their combinations as the initial starting list to produce a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments applied state of the art techniques on four different seed dictionaries; an existing dictionary and three dictionaries created with pivot-based schema considering three different languages as pivot. We have used English, Arabic and French as pivot languages to extract these three pivot based dictionaries. An interesting challenge in our approach is proposing a method to combine different dictionaries together producing a better and more accurate lexicon. In order to combine seed dictionaries we proposed two novel combination models and examine the effect of them on comparable corpora which are collected from News Agencies. The experimental results exploited by our implementation show the efficiency of our proposed combinations.

Keywords

Bilingual Lexicon, Comparable Corpus, Pivot Language
User

Abstract Views: 229

PDF Views: 0




  • Combining Different Seed Dictionaries to Extract Lexicon from Comparable Corpus

Abstract Views: 229  |  PDF Views: 0

Authors

Ebrahim Ansari
Department of Informatica, Universita di Pisa, Pisa, Italy
M. H. Sadreddini
Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran, Islamic Republic of
Alireza Tabebordbar
Department of Computer Science and Engineering, Shiraz University, Iran, Islamic Republic of
Mehdi Sheikhalishahi
Department of Electronics, Informatics and Systems, University of Calabria, Rende, Italy

Abstract


In recent years, many studies on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Nearly all apply an existing small dictionary or other resource to make an initial list named seed dictionary. In this paper we discuss on using different types of dictionaries and their combinations as the initial starting list to produce a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments applied state of the art techniques on four different seed dictionaries; an existing dictionary and three dictionaries created with pivot-based schema considering three different languages as pivot. We have used English, Arabic and French as pivot languages to extract these three pivot based dictionaries. An interesting challenge in our approach is proposing a method to combine different dictionaries together producing a better and more accurate lexicon. In order to combine seed dictionaries we proposed two novel combination models and examine the effect of them on comparable corpora which are collected from News Agencies. The experimental results exploited by our implementation show the efficiency of our proposed combinations.

Keywords


Bilingual Lexicon, Comparable Corpus, Pivot Language



DOI: https://doi.org/10.17485/ijst%2F2014%2Fv7i9%2F59466