Wordnet

Semantic Network with Semantic Mining

WP Leader: Matthew J. Whelpton; Consultant Lexicographer: Kristín Bjarnadóttir; Doctoral Researcher: Anna Björk Nikulásdóttir; Researcher: NN.

Matthew Whelpton is the lead investigator in this work package and will contribute with expertise in semantics and syntax. Anna Nikulásdóttir will carry out most of the work and use it as her Ph.D. project. Kristín Bjarnadóttir will provide expertise in Icelandic morphology and lexicography and the analysis of textual data. The researcher(s) will assist in collecting and analysing the data.

Extraction of semantic information from linguistic input (semantic mining) has become one of the central challenges of contemporary LT and is essential for the future development of the field, as witnessed for example by the resources being invested by the large internet companies in the “semantic web”. Considerable efforts have already been made to manually build semantic networks, both lexical (WordNet, http://wordnet.princeton.edu) and ontological (Cyc, http://www.cyc.com). However, manual construction of such resources is not a realistic aim for Icelandic LT given manpower and financial constraints. In the present project, we therefore intend to develop methods to (semi) automatically build a database of semantic relations from existing lexical and textual resources at the Árni Magnússon Institute for Icelandic Studies; the resulting network of semantic relations will allow for future mapping to existing WordNet networks (cf http://www.globalwordnet.org/).

The first stage of the project will address the central aim of extracting semantic relations from unstructured texts, using syntactic and lexicosyntactic patterns (see e.g. Hearst 1998). Project doctoral researcher, Anna Nikulásdóttir, has already shown in her M.A. thesis (2007a) that, using syntactic and lexicosyntactic patterns, semantic relations of noun lemmata from a monolingual Icelandic dictionary can be extracted with an overall accuracy of 94.8%. These methods will be further developed in her doctoral work for the project. The project will extend these methods to text corpora (i.e. regular texts rather than dictionary entries) with a view to extracting encyclopaedic (real-world) knowledge. The lexical semantic method will be extended to include prepositions and their associated noun phrases; patterns of frequent co-occurence will also be exploited.

Such work requires vast amounts of text and just such resources are being developed at the Árni Magnússon Institute for Icelandic Studies (Helgadóttir 2004a). When these texts have been tagged and parsed for morphological and syntactic information (using the IceNLP toolkit described in Loftsson and Rögnvaldsson 2007b), noun phrases and preposition phrases can be extracted and their syntactic distributions analysed. The most frequent distributional patterns can then be analysed manually and evaluated for systematic semantic relations. The results of the dictionary-based pattern extraction can then be compared to the text-based pattern extraction. This method can also be extended to other relation-building words.

The second stage involves testing of the results from the first stage. A reserved portion of the parsed corpus will be used a test corpus. The syntactic and lexical syntactic patterns iden­tified as significant in the first stage will be extracted using the automated methods developed in the first stage. The resulting semantic relations will then be verified manually to test accu­racy. The patterns that give reliable results according to this analysis will form the core of a tool for extracting semantic relations from parsed texts. The user inteface for this Semantic Extraction Tool will allow for manual corrections and extensions of the automatically ex­tracted relations. A method will be developed to partially connect the results internally. The central pattern-based methodology will also be tested against established statistical methods for automatic thesaurus construction (cf. Grossman and Frieder 2004). The aim is to exploit as broad a range of methodologies as possible to refine the success of the automated tool.

Advertisements



%d bloggers like this: