Odysseus 2008

For ODCSSS 2008 our theme was, The Global Family; The Global Workplace - "Technologies for Social Connectedness". We had 16 students in 2008 working under this theme from around the world.

rojects 0706-dcu and 0806-dcu: parsing wikipedia into predicate-argument structures (2 projects)

Wikipedia (http://www.Wikipedia.org/) is an on-line, publicly available, non-commercial encyclopedia resource where individual entries are generated by the collaborative effort of users. Currently, the English version ofWikipedia contains more than 850,000 articles on various topics. It is an extremely rich source of "world-knowledge" information relevant to many NLP applications including questionanswering, anaphora resolution, disambiguation and machine translation. In order to automatically mine this information into, e.g., a concept hierarchy, two complementary types of information encoded in Wikipedia are required. The first type of information is the link structure (topological information) relating head words inWikipedia, and the second is the propositional content of the individual articles represented in terms of predicate-argument structures (who did what to whom, where, when, ...). For example, the predicate-argument structure of "Pakistani forces surrendered in Bangladesh 1971" becomes:

surrender(e) & subj(e,"Pakistani forces") & loc(e,Bangladesh) & tmp(e,past) & tmp(e,1971)

meaning that there is a surrendering event e, that the subj(ect) of e is "Pakistani forces", that e is geographically loc(ated) in Bangladesh, that e is temporally located in the past in 1971.

In the proposed project we will use the treebank-based, automatically induced, wide-coverage, robust, probabilistic Lexical-Functional Grammar (LFG) resources of Cahill et al. (2005)6 to parse a 50,000,000 word section of the English Wikipedia into predicate-argument structures. In order to do this the resources of Cahill et al. (2005) need to be integrated with named entity (NE) recognisers and multi-word expression (MWE) recognisers. In addition, a processing infrastructure needs to be programmed that automatically downloads and preprocesses textual resources fromWikipedia, parses the downloads and stores the results.

The proposal is to have two students working on this area. The first student will integrate the wide-coverage parsing software of Cahill et al. (2005) with named entity and multi word expression recognisers while the second student will design and implement an interface environment to automatically access, download, parse and store parse results for Wikipedia articles and to carry out the parsing jointly with the second funded student. The projects collaborate closely with Dr. Tony Veale's (UCD) project on the Wikipedia-based development of resources supporting the computational processing of metaphor, analogy and metonymy (the cross-referential structure, through which related concepts are explicitly connected).

Relevance of Project to the Host Laboratories:

At the National Centre for Language Technology (NCLT) at DCU we have pioneered the treebank-based automatic induction of wide-coverage, robust probabilistic LFG resources Cahill et al. (2005). This technology is currently outperforming the best hand-crafted resources. To date our resources have been used to parse a 90,000,000 word subsection of the British National Corpus (BNC) into predicate-argument structures for the automatic extraction of lexical (subcategorisation) information.

Supervisors:

Prof. Josef van Genabith, (NCLT, DCU)

Keywords:

Wikipedia, parsing, predicate-argument structures

Links:

 
 

6 (Cahill et al. 2005) Aoife Cahill, Michael Burke, Martin Forst, Ruth O'Donovan, Christian Rohrer, Josef van Genabith, Andy Way. 2005. Treebank-Based Acquisition of Multilingual Unification Grammar Resources. Journal of Research on Language and Computation, Volume 3, Number 2, Springer, pp247-279.