Department of Software and Computing Systems

Lecture

Title:Building machine translation systems for language pairs with scarce resources Import to your calendar:
[CSV]
defensa de tesi doctoral
Presenter:Víctor M. Sánchez Cartagena
Venue:Sala Claude Shannon (EPS IV)
Date&time:10:30 02/07/2015
Estimated duration:2:30 horas
More information:http://edua.ua.es/es/secretaria/tesis-doctoral/tesis-en-proceso-de-tramitacion/victor-manuel-sanchez-cartagena.html
Contact person:

Sánchez Martínez, Felipe (fsanchez[Perdone'm]dlsi.ua.es)
Abstract:
Machine translation can be defined as the process carried out by a computer
in order to automatically translate a text in a natural language, the source
language (SL), into another language, the target language (TL).  According to
the kind of knowledge used in their development, machine translation systems
may be said to be corpus based or rule based. Corpus-based approaches use
large collections of parallel texts as the source of knowledge ---statistical
machine translation (SMT) being the leading corpus-based approach---, while
rule-based MT (RBMT) systems use linguistic resources such as dictionaries
and structural transfer rules.

As SMT systems need relatively big corpora in order to be competitive ---in
our experiments, a phrase-based SMT system needs up to a few million words
in each language in order to outperform the Apertium RBMT system---, they
are unsuitable for under-resourced language pairs for which the required
amount of parallel corpora is not available. Thus, RBMT becomes the only
alternative. However, building RBMT systems usually implies a considerable
investment in the development of the linguistic resources, some of which can
only be developed by trained experts.  In this dissertation, we introduce
three new methods that ease the building of both kind of MT systems when
resources (both parallel corpora and RBMT linguistic data) are scarce,
together with empirical proofs of their successful application for building
MT systems for different language pairs.

Firstly, a new method that uses scarce parallel corpora (barely a few
hundreds of parallel sentences) and existing morphological dictionaries to
automatically infer a set of shallow-transfer rules to be integrated into an
RBMT system is described. This new method avoids the need for human experts
to handcraft these rules and  overcomes many relevant limitations of previous
rule inference approaches. Namely, it is able to achieve a higher degree of
generalisation over the linguistic phenomena observed in the training corpus,
and it is able to select the proper subset of rules which ensure the most
appropriate segmentation of the input sentences to be translated. In addition,
this new rule inference approach is the first one in which conflicts between
rules are resolved by choosing the most appropriate ones according to a
global minimisation function rather than proceeding in a pairwise greedy
fashion. Experiments conducted using five different language pairs with
the free/open-source RBMT platform Apertium show that translation quality
significantly improves when compared to previous approaches and it is close
to that obtained using hand-crafted rules. Moreover, the resulting number of
rules is considerably smaller, which eases human revision and maintenance. The
adoption of the rule inference approach presented in this dissertation will
hopefully contribute towards making the development of transfer rules for
new language pairs in MT systems like Apertium a much more cost-effective
and technically feasible process.

Secondly, we present a new hybridisation strategy aimed at integrating
shallow-transfer rules and dictionaries from RBMT into phrase-based SMT. The
new hybridisation strategy, which is specific for shallow-transfer RBMT,
addresses the main limitations of existing strategies for integrating RBMT
resources into SMT; namely, the presence of alignment errors in the phrase
pairs obtained from the RBMT system and the inability to find an adequate
balance between the weight of the phrase pairs extracted from the parallel
corpus and those obtained from the RBMT system. The experiments performed
confirm that the new approach delivers a higher translation quality than
existing ones, and that shallow-transfer rules are specially useful when
the parallel corpus available for training is small or when translating
out-of-domain texts that are well covered by the shallow-transfer RBMT system.
Indeed, a system built by following this hybridisation approach was one of the
winners of the pairwise manual evaluation of the WMT 2011 shared translation
task for the Spanish--English language pair. In addition, the translation
quality achieved by hybrid systems built with automatically inferred rules
is similar to that obtained with hand-crafted rules. The combination of the
hybridisation strategy and the rule inference algorithm presented in this
dissertation will contribute to alleviate the data sparseness problem suffered
by SMT systems, since the resulting hybrid system is able to generalise the
translation knowledge contained in the parallel corpus to sequences of lemmas
that have not been observed in the corpus.

Thirdly, as the two aforementioned approaches need morphological dictionaries
in order to be applied, and with the aim of easing the task of building
them, a novel approach that allows non-expert users to insert new entries in
monolingual morphological dictionaries is presented. The scenario considered
is that of non-expert users of an RBMT systems who have to introduce into its
dictionaries the words found in an input text that are unknown to the system,
so that it can subsequently correctly translate them. Given a SL surface form
(i.e., a word as it is found in running texts, without any kind of analysis)
to be inserted, the proposed strategy iteratively asks the users (average
speakers of a language) polar questions to validate whether certain inflected
forms of the word to be inserted are correct. The new approach uses the
answers of the users and the existing inflection paradigms in the monolingual
dictionary in order to automatically insert the corresponding entry in the
dictionary. An inflection paradigm may be defined as a collection of suffixes
and their corresponding morphological information; they are commonly used in
RBMT systems to group regularities in the inflection of a set of words. In
addition, a monolingual corpus, a hidden Markov model and a binary decision
tree built with the ID3 algorithm are used to reduce the number of polar
questions that need to be asked for gathering all the necessary information for
the insertion of the entry. The experiments carried out show that non-expert
users are able to successfully answer the polar questions in most cases, and
that the ID3 algorithm increases the efficiency of the approach (compared
to an heuristic approach previously developed) and the robustness against
possible erroneous information extracted from the monolingual corpus. If the
user is bilingual and provides the translation of the inserted SL word, the
process is repeated to insert the corresponding entry in the TL monolingual
dictionary. In this case, the information about the SL entry already inserted
and the correlation between morphological features in both languages is used
to further increase the efficiency of the approach. Once the entries have been
inserted in both monolingual morphological dictionaries, the corresponding
entry in the bilingual dictionary can be inserted automatically.

[ Close ]