Translation

Shallow-Transfer Translation System

WP Leader: Hrafn Loftsson; Linguistic Expert: Eiríkur Rögnvaldsson; Machine Translation Expert: Mikel L. Forcada; Masters Students: Martha Dís Brandt and NN.

Hrafn Loftsson is the lead investigator and provides expertise in computer science, pro­gramming, and parsing methods. Eiríkur Rögnvaldsson provides expertise in Icelandic syntax and morphology. Michel L. Forcada provides expertise in shallow-transfer machine transla­tion and the Apertium system. Martha Dís Brandt works on training the system, PoS tagging, dictionary building, and structural transfer rules, and writes her Master’s thesis within the pro­ject. Master’s student NN works on dictionary creation, shallow parsing, target language gen­eration, evaluation, and web interface, and writes his/her Master’s thesis within the pro­ject.

Machine translation (MT) is the attempt to automate all, or part of the process of translating from one human language, the source language (SL), to another language, the target language (TL). One MT approach is the transfer method, in which the SL is translated to the TL with the help of linguistic knowledge in the form of rules. This approach generally requires lexicons, tools like taggers and parsers, and a set of transfer rules which transfer a representation of a sentence in the SL to a corresponding sentence in the TL. The SYSTRAN MT system is probably the best known transfer system in use today (Flanagan and McClure 2002). A contrasting method is statistical MT, which is based on bilingual (parallel) text corpora. Due to the increasing availability of large text corpora and the original work by Brown et al. (1990), this method has gained increasing popularity in the last 15 years or so. For example, the Google language translation tools (http://www.google.com/language_tools) are based on statistical MT.

Shallow-transfer MT (STMT) is a simplified version of the transfer-based approach. In an STMT system, full parsing of the SL is not carried out and the transfer-rules are typically operations on groups of lexical units, i.e. the operations apply to a shallow syntactic analysis instead of full parse trees. This is the main advantage of an STMT system, because developing parsers able to perform full parsing is a very time-consuming task. While full parsing may become available for Icelandic as a result from the treebank part of our project, it is definitely important to be able to implement MT without such a resource, and such techniques will be relevant to many less-resourced languages.

The open-source Apertium platform (Armentano-Oller et al. 2005) is an STMT platform which has been used to develop a number of MT systems during the last few years. The purpose of Apertium is to achieve a reasonable translation quality between related languages. More specifically, the goal is to a) produce drafts that can be post-edited to make them adequate for a purpose, and b) produce text that can be read for understanding. The development of Apertium has been led by Professor Mikel L. Forcada at the Universitat d’Alacant. He has argued that in the case of minor languages the development of a rule-based MT system is easier than developing a statistical MT system, because of “the amounts of sentence-aligned parallel text (of the order of hundreds of thousands or millions of words) required to get reasonable results in pure corpus-based MT, such as statistical MT” (Forcada 2006). Note that no parallel corpus exists with Icelandic text.

In this work package, we will develop a STMT system. The project has the following three main objectives: i) to find the most economic methods for creating the rules and data needed for a successful implementation of an STMT system; ii) to find ways of incorporating existing LT tools into the Apertium platform; and iii) to use these means to develop a prototype of an STMT system. For this purpose, we will use an Icelandic-English translation system as a test case.

The existing BLARK tools that we intend to incorporate into the Apertium platform are our POS tagger (Loftsson 2008), our lemmatiser (Ingason et al. 2008), and our shallow parser (Loftsson and Rögnvaldsson 2007a). All these tools will be made open-source during the project as well as the data constructed (GPL open source license will be used; see Section 7). Instead of spending a considerable amount of time constructing a morphological dictionary/analyser for the SL (Icelandic), as is generally needed in Apertium, we put emphasis on developing methods that utilize our existing tools (a morphological dictionary for the TL is also needed, but since our TL is English we can use an open-source English morphological dictionary generated in other Apertium projects.). We assume that our eventual findings will be beneficial for other (inflectional) languages, for which similar BLARK tools exist. Moreover, if such tools do not exist for a language (e.g. some minor language), our findings may encourage the development of such tools for the purpose of STMT. Translation data created during the project will be made available to other researchers, thus encouraging others to develop alternative ways to implement an MT system for Icelandic.

We will use two evaluation criteria in this work package. First, a group of users will be asked to evaluate the system by answering the following question: Does our system produce an output of such quality that it is more worth correcting it than translating the text from scratch? Second, the system will be evaluated with regard to the error rate, measured as the number of words that have to be inserted, deleted or substituted per 100 words.

Advertisements



%d bloggers like this: