3LB: Building a syntactic-semantic-trees-based database

FIT-15050-2002-244
Financed by the Science and Technology Ministry (PROFIT)

OBJECTIVE
PROJECT OUTLINE
PARTICIPANT RESEARCH GROUPS
PARTICIPANT ENTITIES

Access to Project's Information Server

OBJECTIVE

The main objective of this project is to build three treebanks (syntactically annotated corpus) for Spanish, Catalan and Basque. Besides the syntactic annotation, it will be carried out a semantic annotation by means of the synsets of the different wordnets (http://www.cogsci.princeton.edu/~wn/w3wn.html) built for each language, as well as an annotation of anaphoric and elliptic elements just as the co-reference. Corpus extension for Spanish and Catalan will be 100.000 words, and 50.000 for Basque due to more notational complexity and smaller covering of its wordnet (35.000 entries instead of 100.000 for the Spanish and 65.000 for the Catalan).
Click-TALP Corpus for Spanish [25] consists at the moment of 100.000 of words manually annotated in a morpho-syntactic level. The rest of the corpus, up to 5,5 million words, is automatically tagged, with an estimated error rate of 3%.
The corpus for the euskara in this project consists on 40.000 words morpho-syntactically tagged by hand. The goal during this project is to tag it syntactic and semantically according to the proposal and to enlarge it up to 50.000 words with morphological, syntactic and semantic annotation.
Although the construction of a treebank is an expensive task, we believe that it is an indispensable work for the development of real applications in the Natural Languaje Processing field (NLP) and also for the development of the Information Society. In these applications it is essential to obtain the computational grammnars [14][15] from the corpus that is a first step toward later processes that require more elaboration. Among these processes, we find the detection of discourse entities that, together with the identification of anaphoric and co-referent elements improves the quality of any all the Machine Translation (MT), Information Extraction (IE), Information Retrieval (IR), summarization and Question Answering (QA) systems. Other linguistic tasks that can be tackled with a treebank are the learning of selectional constraints or the verbal pattern subcategorization. The first of these two tasks is approached in section 2.6 of this proposal like a validation task of the usefulness of the built treebank.
At purely linguistic level, the treebank is a database essential for the study of the language due to it provides analyzed/annotated examples of real language. The linguistic study directly reverts in the quality improvement of the previously mentioned resources, endowing them with a bigger robustness.

PROJECT OUTLINE
This is the working plan proposed to carry out the project. It is detailed in modules that are divided in in activities. Bellow, the state of the technique in Spain and the work lines carried by the research groups are described.

MODULE 1: COORDINATION OF THE PROJECT
This module consists on the coordination of the project itself. An initial contact among the participants of the project will allow the establishment of the bases and performance protocols for the next modules.
MODULE 2: INTEGRATION OF TOOLS AND RESOURCES FOR THE TAGGING ELABORATION

Activity 2.1: Building a tree editor.
On the one hand, the annotators will have a friendly tool that facilitates them the tagging task and, on the other hand, the editor will incorporate a learning system to add new knowledge while the annotation is being carried out [10]. This tool will be used for marking the constituents, the roles for Spanish and Catalan, and the dependences for the Basque. Any case, the tool will help in the detection of the anaphoric reference and the EuroWordNet senses (http://www.hum.uva.nl/~ewn /). Knowledge adding implies the progressive improvement of the efficiency in the annotation as well as the consistency and the speed of the process. The tool will be the flexible enough to allow several annotation levels independently, progressively or simultaneously.
Activity 2.2: XML CONVERSOR
XML marks facilitate the portability of data in electronic support. At the present this process is indispensable, keeping in mind that it supposes a data structuring and that it is a very stable and standard format. Furthermore it will be prepared the software for the web-query of the corpus. Tagging levels of the corpus will be morphological, syntactic and semantic as proposed in this project.
In this activity the tools will be developed for the document loading in XML format and the later conversion of the result from the manual annotation to the XML format.

Activity 2.3: Development of selection tools
Tools that allow the exploitation of the corpus will be developed. With these tools one will be able to select information related to the different annotation levels. Also, they will be flexible when selecting different formats according to the application needs. The system will allow the interactive manual inspection of the corpus as well as the massive exploitation of its content.
Activity 2.4: Building an evaluation system
Developing of the software that carries out the comparison of any annotation of the corpus in terms of the different annotation levels with the reference corpus. For this purpose, the metrics usually established in the bibliography will be used.
Activity 2.5: Integration of a anaphoric tagging system for Spanish and its adaptation to the Basque and Catalan.
Starting from the experiences in anaphora resolution of the participant research groups in the project, it will be integrated a system that facilitates the automatic identification of anaphoric expressions, their possible candidates as well as the antecedent proposed by the system. This will facilitate the annotator's task that validates or not the system proposal.
Activity 2.6: Construction of a system to obtain selectional restrictions for the verbs.
It will be built a system that obtains the selectional restrictions of the verbs, that is to say, the group of semantic restrictions that each verb imposes for its arguments. It will be used the systems word sense disambiguation (WSD) developed for the Basque, Catalan and Spanish by the different participant research groups. The process for obtaining the selectional restrictions will be made in a semi-automatic way, from the localization in the treebank of all the <verb, argument type, argument> patterns, from the semantic disambiguation of the arguments, form the synsets of WN associated to them and from the generalization until obtaining one or several common elements.
MODULE 3: ANNOTATION AND SUPERVISION OF THE CORPUS
Activity 3.1: Annotation proposal
It will be defined and designed a syntactic, semantic and anaphoric annotation scheme with a solid linguistic and methodological basis. It will be defined the depth degree for the annotation and it will be decided how to treat the problematic cases such as the discontinuous constituents, the coordination, the comparative elements, the ellipsis, etc.

Activity 3.2: Annotation of syntactic constituents.
It will be manually annotated the syntactic constituents of each sentence in the corpus with the help of the tools of module 2.
Activity 3.3: Annotation of syntactic functions/dependencies
In the same way, and with the help of the tools of module 2, the syntactic functions of the sentences will be written down.
Activity 3.4: Elaboration of the syntactic rules to improve the existent analyzers
Elaboration of the syntactic rules that allow to settle down in a more precise way the constituents and syntactic functions and syntactic dependencies.
In the case of the Basque, the constraint grammar formalism will be used [26] and in the case of the Catalan/Spanish it will be define a set of rules to improve the current analyses.
Activity 3.5: Annotation of senses
It will be to carry out the annotation of the words corresponding to the categories with more semantic meaning, such us nouns and verbs, with their corresponding sense (synset) in EuroWordNet.
Activity 3.6: Elliptic elements
In the same way that in the previous activity, and with the help of the tools of module 2, the elliptic elements will be manually supervised.
Activity 3.7: Anaphoric elements
In the same way that in the previous activity, and with the help of the tools of module 2, the anaphoric elements and their corresponding antecedents will be manually supervised.
Activity 3.8: Co-referent elements
In the same way that in the previous activity, and with the help of the tools of module 2, the co-reference chains will be manually supervised.
MODULE 4: EVALUATION AND DISSEMINATION OF THE RESULTS
This module will consist on the quantitative and qualitative evaluation of the obtained results as well as their periodic dissemination.

PARTICIPANT RESEARCH GROUPS

Natural Language Processing and Information Systems Group of the University of Alicante

Natural Language Processing Group of the Technical University of Catalonia

Natural Language Processing Group de la Technical University of Valencia

Language and Computation Center (CLiC) de la University of Barcelona

Natural Language Processing Group of the University of the Basque Country

PARTICIPANT ENTITIES

Updated martes, 19 de noviembre de 2002