Treebank

Development of Parsing Strategies and a Treebank

WP Leader: Eiríkur Rögnvaldsson; Computer Science Expert: Hrafn Loftsson; Treebanking and Parsing Expert: Anthony Kroch; Post-Doc: Joel Wallenberg; Doctoral Researcher: Sigrún Helgadóttir; Masters Student: Anton Karl Ingason; Researcher: NN.

Eiríkur Rögnvaldsson is the lead investigator and contributes with expertise in Modern and Old Icelandic syntax. Hrafn Loftsson contributes with expertise in computer science and parsing algorithms. Anthony Kroch provides expertise in corpus building, parsing, and tree­bank construction. Joel Wallenberg develops and adapts parsing methods used at the Univer­sity of Pennsylvania and works on creating a treebank of Icelandic texts. Sigrún Helgadóttir provides expertise in statistical methods. Her dissertation is written within the project and concentrates on the construction and training of statistical parsers and their evaluation. Anton Karl Ingason works on annotation software and related tasks and writes his Master’s thesis within the project. The researcher(s) manually correct annotation, collect data etc.

A treebank is a linguistically annotated corpus that includes grammati­cal analy­sis be­yond the part-of-speech level. Treebanks are now generally recognized as extremely impor­tant resources, both for linguis­tic studies and LT, and they are therefore being built for vari­ous languages. These treebanks differ widely, both with respect to annotation schemes and theoreti­cal basis (cf. Nivre et al. 2005). The impact of the well-known Penn Treebank for English (http://www.cis.upenn.edu/~treebank/) clearly demonstrates the importance of such a resource for the LT of a given language. Treebanks are especially needed for training statistical parsers and they are a valuable resource for syntactic research, synchronic as well as diachronic (e.g. Kroch 1989; Kroch and Taylor 2000; Kroch et al. 2004). However, it is not a trivial task to construct a treebank for a new language, particularly if resources are limited, as is the case for Icelandic and many other languages. Also, the issue of rich morphology in the context of treebank construction and training of statistical parsers remains an important but poorly addressed issue in the literature, due to the English-driven nature of the field. Our goal is to make it realistic to develop a treebank for a less-resourced language and thus we focus on semi-automatic/machine assisted annotation.

The first stage of our project will involve developing and refining software tools and methodologies for constructing treebanks with limited resources – taking advantage of the resources that are available. A full parser for Icelandic does not exist but experiments have shown (Kroch and Wallenberg 2008) that a combination of other resources can achieve reasonable results for the purpose of semi-automatic treebank construction (cf. also Kulic et al. 2006). Using IceParser, a finite-state parser (Loftsson and Rögnvaldsson 2007a,b, 2008) for chunking (shallow parsing) and feeding the output to the Collins/Bikel statistical parser (Collins 1999, 2003; Bikel 2004a,b), trained on the syntactic structure of Early Modern English (EME), it is possible to get fully parsed results which are good enough to make hand-correction feasible. This works reasonably well since Icelandic syntax is a subset of EME syntax (e.g. V2, VO, Quantifier Movement, and Object Shift (cf. Thráinsson 2007) all exist in EME). Building on those experiments we expect to be able to further improve results by integration of linguistic knowledge into the method, focusing on the use of information encoded in the morphology of the language. In the first phase of the project we will also focus on developing and improving software tools used for hand correction. For this we will build on existing solutions such as CorpusDraw, a part of the CorpusSearch software written by Beth Randall and directed by Anthony Kroch (http://corpussearch.sourceforge.net/).

In the second stage of the project we will start creating a treebank for Icelandic using the methods and tools developed in the first stage of the project. Two linguists will carry out manual correction of the machine annotation, one from the University of Iceland and one from the University of Pennsylvania. During the course of the treebank construction we will continue to refine our methods and as the work progresses we will start training the statistical parser on the new Icelandic treebank in combination with the EME treebank. We will use a combination of Modern Icelandic and Old Icelandic data, and thus the treebank will be a fully parsed corpus of Icelandic with a diacronic dimension, opening up important possibilities for the study of the historical syntax of Icelandic, in addition to the general advantages of having such a resource for doing Icelandic LT (cf. also Rögnvaldsson and Helgadóttir 2008).

All the software tools we develop will be made open-source during the project as well as the data constructed. Evaluation will focus on quality (the annotators reviewing each others work) and efficiency (the output/hour for a given work setting). These tools, as well as documentation of the methods employed will be published online and we expect our results to make important contributions to the creation of treebanks for other languages with limited LT resources and/or rich morphology.

Advertisements



%d bloggers like this: