Resources

July 12, 2021

GoURMET parallel corpora for low-resource languages

Monolingual and parallel corpora for a number of low-resource languages (such as Swahili, Turkish, Amharic and Kyrgyz, among others) crawled as part of the Universtat d'Alacant contribution to the GoURMET project (Global Under-Resourced MEdia Translation), funded by the European Union (grant agreement id 825299). The corpora are available through the GoURMET webpage.
Download corpora

GoURMET translation models for low-resource language pairs

Neural machine translation models for the translation between English and a number of low-resource languages (such as Swahili, Pastho and Macedonian, among others) developed as part of the Universtat d'Alacant contribution to the GoURMET project (Global Under-Resourced MEdia Translation; grant agreement id 825299). Dockerised transaltion models are available through the GoURMET webpage.
Download translation models

Morphological segmentation using Apertium resources

Free/open-source tool for using Apertium resources for the segmentation of texts. Useful as a pre-processing step before using BPE for training neural machine translation systems. Funded by the EU through the GoURMET project (grant agreement id 825299).
Download

LinguaCrawl: Top-level domain crawler

Free/open-source tool implemented in Python3 to crawl a number of top-level domains to download any text documents in the languages specified by the user. Funded by the EU through the GoURMET project (grant agreement id 825299).
Download

LASERtrain (language-agnostic sentence embeddings)

Free/open-source piece of software that reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings. Funded by the EU through the GoURMET project (grant agreement id 825299).
Download

IMPACT-es diachronic corpus

Diachronic corpus of historical Spanish that compiles over one hundred books -containing approximately 8 million words- in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. Released under an open Creative Commons by-nc-sa license.
Download : Related paper

ruLearn: toolkit for the automatic inference of shallow-transfer rules for MT

Free/open-source toolkit for the automatic inference of rules for shallow-transfer MT from scarce parallel corpora and morphological dictionaries. ruLearn allows to build machine translation systems for under-resourced language pairs because it avoids the need for human experts to handcraft transfer rules and requires only a few hundred parallel sentences. Ther rules inferred can be used for rule-based MT as well as together with a hybridisation strategy for integrating linguistic resources into phrase-based statistical machine translation (see Rule2Phrase).
Download : Read paper

Rule2Phrase: toolkit for integrating shallow-transfer rules into phrase-based SMT

Free/open-source toolkit to enrich a phrase-based SMT system (Moses) with phrase pairs generated from the linguistic resources of a shallow-transfer rule-based MT system (Apertium). A system built with this toolkit was not outperformed by any other participant in the shared translation task of the Sixth Workshop on Statistical Machine Translation (WMT 11) for the Spanish–English language pair.
Download : Read paper

Gamblr-CAT: word-level quality estimation in TM-based CAT

Free/open-source software to obtain binary quality estimations at the level of words (also called word-keeping recommendations) for translation suggestions produced by a translation memory tool by using either statistical word alignments or external sources of bilingual information.
Download : Read paper

Gamblr-MT: word-level quality estimation in MT

Collection of free/open-source scripts to obtain a collection of features for word-level MT quality estimation using external sources of bilingual information.
Download : Read paper

DocTrans: document translation retrieval based on SMT techniques

Free/open-source piece of software implementing a method based on SMT techniques to retrieve documents which are a plausible translation of a given source text. The method provides the terms to use in a query to retrieve the document translation of the source document provided as input. In combination with a text search engine like Apache Lucene it can be used for translation document alignment. It relies on the free-/open-source SMT system Moses and was last tested with revision 2281.
Download : Read paper

Apertium-tagger-training-tools: target-language-driven POS tagger trainer

Free/open-source package for the unsupervised training of hidden-Markov-model-based POS taggers involved in MT. It uses information, not only from the source language, but also from the target language; to this end the Apertium MT platform is used. After training a file containing the hidden-Markov-model parameters is produced; this file can be directly used within the Apertium MT platform.
Download : Read paper

Apertium-morph: using morphological information with Apache Lucene

Free/open-source package providing a set of tools and Java classes that allow the Apache Lucene text search engine to use morphological information to index and search. To that end, the linguistic resources developed for the Apertium MT platform are used to extract morphological information while indexing.
Download : Read paper