Construction of a Cross-Language Information Retrieval system for the Web

This project, with Code: Fit-150500-2002-416, is subsidized by Ministry of Science and Technology (Project PROFIT), and has a period from July 2002 to December 2003. The participant organizations are:
University of Alicante:
University of Jaén:
University of Sevilla:


The primary target of the project is to construct a information retrieval system (IR) in which a series of tools of processing of the natural language are integrated. This IR tries to improve the traditional IR systems that work on the Web from three points of view: The scientific and technological primary target of the project is focused on the Cross-Language Information Retrieval research field. This field appears like an extension of traditional Information Retrieval that works on an only language, that is to say, the question as the documents on which it looks for the information are in the same language. The extension to "multilingual" supposes that the question as much as the documents do not need to be in the same language. For that reason, the objective of this project is to make information searches on a document collection that can be in different languages, independently of the language in which the question is made. Although it is anticipated to develop a technology that facilitates the incorporation of new languages in the future, initially we will focus on the languages of the European Economic Community, delimiting the application of techniques of Natural Language Processing (NLP) to English and Spanish. Within this field of investigation, it also appears an extension to Question Answering applications, in which the result is not the complete document, but the text snippet that contains the answer of the user. One of the objectives of the project is fitted indeed in this field, although it will be only applied on English and Spanish, since for this type of applications it is made indispensable to apply techniques of NLP, that increase the degree of understanding of the texts on which the search is made. In addition, another one of the scientific objectives of this project is centered within the field of investigation of the Computational Linguistic, concretely in the one of the NLP, in which it it is tried to add new sources of intelligence to the process of the search, which will allow to improve the precision and quality of the results to give back. The information that is expected to incorporate would be the lexical, syntactic analysis, resolution of linguistic problems and word sense desambiguation. This kind of information is not contemplated in the traditional IR systems available at the moment, that usually are based solely on information referring to the occurrences of words in documents. For example, these systems discard the pronouns as non-content words, therefore the information that is referred by these probonouns is also discarded. When we propose a previous resolution of this type of anaphoras, we will be able to improve the precision of the searches because the referring information is not discarded. The set of documents on which it will work will not be restricted, although the later specialization to restricted dominions is anticipated, in which it is easy to think that the precision of the system would improve. It will be taken like the data set from entrance on which information will be looked for, like heterogenous and not structured documents, that is to say, in natural language, adding to the capacity multinlingual described previously.

Tools to be used in the project

Publications derived from the project

Any doubt or suggestion to consult, email Antonio Ferrández Rodríguez

Last update: January, 17th, 2002