Títol: | Building machine translation systems for language pairs with scarce resources |
Importa'l al teu calendari: |
---|---|---|
Tipus: | defensa de tesi doctoral | |
Per: | Víctor M. Sánchez Cartagena | |
Lloc: | Sala Claude Shannon (EPS IV) | |
Dia/hora: | 10.30 02/07/2015 | |
Duració aproximada: | 2:30 hores | |
Més informació: | http://edua.ua.es/es/secretaria/tesis-doctoral/tesis-en-proceso-de-tramitacion/victor-manuel-sanchez-cartagena.html | |
Persona de contacte: | Sánchez Martínez, Felipe (fsanchezdlsi.ua.es) | |
Resum: | Machine translation can be defined as the process carried out by a computer in order to automatically translate a text in a natural language, the source language (SL), into another language, the target language (TL). According to the kind of knowledge used in their development, machine translation systems may be said to be corpus based or rule based. Corpus-based approaches use large collections of parallel texts as the source of knowledge ---statistical machine translation (SMT) being the leading corpus-based approach---, while rule-based MT (RBMT) systems use linguistic resources such as dictionaries and structural transfer rules. As SMT systems need relatively big corpora in order to be competitive ---in our experiments, a phrase-based SMT system needs up to a few million words in each language in order to outperform the Apertium RBMT system---, they are unsuitable for under-resourced language pairs for which the required amount of parallel corpora is not available. Thus, RBMT becomes the only alternative. However, building RBMT systems usually implies a considerable investment in the development of the linguistic resources, some of which can only be developed by trained experts. In this dissertation, we introduce three new methods that ease the building of both kind of MT systems when resources (both parallel corpora and RBMT linguistic data) are scarce, together with empirical proofs of their successful application for building MT systems for different language pairs. Firstly, a new method that uses scarce parallel corpora (barely a few hundreds of parallel sentences) and existing morphological dictionaries to automatically infer a set of shallow-transfer rules to be integrated into an RBMT system is described. This new method avoids the need for human experts to handcraft these rules and overcomes many relevant limitations of previous rule inference approaches. Namely, it is able to achieve a higher degree of generalisation over the linguistic phenomena observed in the training corpus, and it is able to select the proper subset of rules which ensure the most appropriate segmentation of the input sentences to be translated. In addition, this new rule inference approach is the first one in which conflicts between rules are resolved by choosing the most appropriate ones according to a global minimisation function rather than proceeding in a pairwise greedy fashion. Experiments conducted using five different language pairs with the free/open-source RBMT platform Apertium show that translation quality significantly improves when compared to previous approaches and it is close to that obtained using hand-crafted rules. Moreover, the resulting number of rules is considerably smaller, which eases human revision and maintenance. The adoption of the rule inference approach presented in this dissertation will hopefully contribute towards making the development of transfer rules for new language pairs in MT systems like Apertium a much more cost-effective and technically feasible process. Secondly, we present a new hybridisation strategy aimed at integrating shallow-transfer rules and dictionaries from RBMT into phrase-based SMT. The new hybridisation strategy, which is specific for shallow-transfer RBMT, addresses the main limitations of existing strategies for integrating RBMT resources into SMT; namely, the presence of alignment errors in the phrase pairs obtained from the RBMT system and the inability to find an adequate balance between the weight of the phrase pairs extracted from the parallel corpus and those obtained from the RBMT system. The experiments performed confirm that the new approach delivers a higher translation quality than existing ones, and that shallow-transfer rules are specially useful when the parallel corpus available for training is small or when translating out-of-domain texts that are well covered by the shallow-transfer RBMT system. Indeed, a system built by following this hybridisation approach was one of the winners of the pairwise manual evaluation of the WMT 2011 shared translation task for the Spanish--English language pair. In addition, the translation quality achieved by hybrid systems built with automatically inferred rules is similar to that obtained with hand-crafted rules. The combination of the hybridisation strategy and the rule inference algorithm presented in this dissertation will contribute to alleviate the data sparseness problem suffered by SMT systems, since the resulting hybrid system is able to generalise the translation knowledge contained in the parallel corpus to sequences of lemmas that have not been observed in the corpus. Thirdly, as the two aforementioned approaches need morphological dictionaries in order to be applied, and with the aim of easing the task of building them, a novel approach that allows non-expert users to insert new entries in monolingual morphological dictionaries is presented. The scenario considered is that of non-expert users of an RBMT systems who have to introduce into its dictionaries the words found in an input text that are unknown to the system, so that it can subsequently correctly translate them. Given a SL surface form (i.e., a word as it is found in running texts, without any kind of analysis) to be inserted, the proposed strategy iteratively asks the users (average speakers of a language) polar questions to validate whether certain inflected forms of the word to be inserted are correct. The new approach uses the answers of the users and the existing inflection paradigms in the monolingual dictionary in order to automatically insert the corresponding entry in the dictionary. An inflection paradigm may be defined as a collection of suffixes and their corresponding morphological information; they are commonly used in RBMT systems to group regularities in the inflection of a set of words. In addition, a monolingual corpus, a hidden Markov model and a binary decision tree built with the ID3 algorithm are used to reduce the number of polar questions that need to be asked for gathering all the necessary information for the insertion of the entry. The experiments carried out show that non-expert users are able to successfully answer the polar questions in most cases, and that the ID3 algorithm increases the efficiency of the approach (compared to an heuristic approach previously developed) and the robustness against possible erroneous information extracted from the monolingual corpus. If the user is bilingual and provides the translation of the inserted SL word, the process is repeated to insert the corresponding entry in the TL monolingual dictionary. In this case, the information about the SL entry already inserted and the correlation between morphological features in both languages is used to further increase the efficiency of the approach. Once the entries have been inserted in both monolingual morphological dictionaries, the corresponding entry in the bilingual dictionary can be inserted automatically. |
[ Tancar ]