Moses Installation and Training Run-Through

Comments in this form by Mikel L. Forcada & Francis M. Tyers, to adapt the document to the tutorial at the CNGL (April 2009)

The purpose of this guide is to offer a step-by-step example of downloading, compiling, and runing the Moses decoder and related support tools. I make no claims that all of the steps here will work perfectly on every machine you try it on, or that things will stay the same as the software changes. Please remember that Moses is research software under active development.


PART I - Download and Configure Tools and Data

Support Tools Background

Moses has a number of scripts designed to aid training, and they rely on GIZA++ and mkcls to function. More information on the origins of these tools is available at:

A Google Code project has been set up, and the code is being maintained:

Moses uses SRILM-style language models. SRILM is available from:

(Optional) The IRSTLM tools provide the ability to use quantized and disk memory-mapped language models. It's optional, but we'll be using it in this tutorial:

Support Tools Installation

Before we start building and using the Moses codebase, we have to download and compile all of these tools. See the list of versions to double-check that you are using the same code.

I'll be working under /home/guest in these examples. I assume you've set up some appropriately named directory in your own system. I'm installing these tools under an FC6 distro.

Changes to run the same setup under Mac OS X 10.5 are highlighted. For the Mac I'm running under /Users/josh/demo.

Machine Translation Marathon changes are highlighted. We probably won't have time to train a full model today.

mkdir tools
cd tools

Get The Latest Moses Version

Moses is available via Subversion from Sourceforge. See the list of versions to double-check that you are using the same code as this example. From the tools/ directory:

cd /home/guest/tools
mkdir moses
svn co https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk moses

This will copy all of the Moses source code to your local machine.

Compile Moses

Within the Moses folder structure are projects for Eclipse, Xcode, and Visual Studio -- though these are not well maintained and may not be up to date. I'll focus on the linux command-line method, which is the preferred way to compile.

For OS X versions 10.4 and lower, you need to upgrade aclocal and automake to at least version 1.9 (1.6 is the default in 10.4) and set the variables ACLOCAL and AUTOMAKE in ./regenerate-makefiles.sh.

cd moses
./regenerate-makefiles.sh
./configure --with-irstlm=/home/guest/tools/irstlm
make

(The option -j 2 of make is optional. make -j X where X is number of simultaneous tasks is a speedier option for machines with multiple processors)

This creates several files we will be using:

Confirm Setup Success

we'll do this to test the Moses installation

A sample model capable of translating one sentence is available on the Moses website. Download it and translate the sample input file.

cd /home/guest/
mkdir data
cd data
wget http://www.statmt.org/moses/download/sample-models.tgz
curl -O http://www.statmt.org/moses/download/sample-models.tgz
tar -xzvf sample-models.tgz
cd sample-models/phrase-model/
../../../tools/moses/moses-cmd/src/moses -f moses.ini < in > out

The input has "das ist ein kleines haus" listed twice, so the output file (out) should contain "this is a small house" twice.

At this point, it might be wise for you to experiment with the command line options of the Moses decoder. A tutoral using this example model is available at http://www.statmt.org/moses/?n=Moses.Tutorial.

Compile Moses Support Scripts

Moses uses a set of scripts to support training, tuning, and other tasks. The support scripts used by Moses are "released" by a Makefile which edits their paths to match your local environment. First, make a place for the scripts to live:

cd ../../../tools/
mkdir moses-scripts
cd moses/scripts

edit Makefile as needed. Here's my diff:

13,14c13,14
< TARGETDIR?=/home/s0565741/terabyte/bin
< BINDIR?=/home/s0565741/terabyte/bin
---
> TARGETDIR?=/home/guest/tools/moses-scripts
> BINDIR?=/home/guest/tools/bin

make release

This will create a time-stamped folder named /home/guest/moses-scripts/scripts-YYYYMMDD-HHMM with released versions of all the scripts. You will call these versions when training and tuning Moses. Some Moses training scripts also require a SCRIPTS_ROOTDIR environment variable to be set. The output of make release should indicate this. Most scripts allow you to override this by setting a -scripts-root-dir flag or something similar.

export SCRIPTS_ROOTDIR=/home/guest/tools/moses-scripts/scripts-YYYYMMDD-HHMM

Additional Scripts

There are few scripts not included with moses which are useful for preparing data. These were originally made available as part of the WMT08 Shared Task and Europarl v3 releases, I've consolidated some of them into one set.


cd /home/guest/tools
wget http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz
curl -O http://homepages.inf.ed.ac.uk/jschroe1/how-to/scripts.tgz
tar -xzvf scripts.tgz

We'll also get a NIST scoring tool.

wget ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v11b.pl
On the Mac, use ftp or a web browser to get the file. curl and I had a fight about it.
chmod +x mteval-v11b.pl

PART II - Build a Model

This part is very different from that in the original text.

We'll use a bilingual sentence-aligned corpus of around 510,000 sentences in Welsh and English taken from the proceedings of the National Assembly for Wales. This should be good enough for testing purposes and still be doable in a reasonable amount of time on most machines. The source language is Welsh (cy) and the target language is English (en).

cd ../data
wget http://xixona.dlsi.ua.es/corpora/UAGT-PNAW/UAGT-PNAW-1.0.1.tar.gz

tar -xvf tar xzvf UAGT-PNAW-1.0.1.tar.gz

If you're low on disk space, remove the full tar.
rm UAGT-PNAW-1.0.1.tar.gz

cd ../

Prepare Data

First we'll set up a working directory where we'll store all the data we prepare.

mkdir work

Build Language Model

Language models are concerned only with n-grams in the data, so sentence length doesn't impact training times as it does in GIZA++. Many people incorporate extra target language monolingual data into their language models. So, we'll lowercase the full 510792 sentences to use for language modeling.

mkdir work/lm
tools/scripts/lowercase.perl < work/corpus/pnaw-full.tok.en > work/lm/pnaw-full.lowercased.en

We will use IRSTLM to build a tri-gram language model.

export IRSTLM=/home/guest/tools/irstlm
tools/irstlm/bin/build-lm.sh -t /tmp -i work/lm/pnaw-full.lowercased.en -o work/lm/pnaw-full.en.lm
tools/irstlm/bin/compile-lm work/lm/pnaw-full.en.lm.gz work/lm/pnaw-full.en.blm

Check the path on the last line

Train Phrase Model

Skip...

Moses' toolkit does a great job of wrapping up calls to mkcls and GIZA++ inside a training script, and outputting the phrase and reordering tables needed for decoding. The script that does this is called train-factored-phrase-model.perl

If you want to skip this step, you can use the pre-prepared model and ini files located at /afs/ms/u/m/mtm52/BIG/work/model/moses.ini and /afs/ms/u/m/mtm52/BIG/work/model/moses-bin.ini instead of the local references used in this tutorial. Move on to sanity checking your setup.

... up to here.

We'll run this in the background and nice it since it'll peg the CPU while it runs. It may take up to an hour, so this might be a good time to run through the tutorial page mentioned earlier using the sample-models data.

Don't forget to change YYYYMMDD-HHMM to the actual value in your directory.

nohup nice tools/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir tools/moses-scripts/scripts-YYYYMMDD-HHMM/ -root-dir work -corpus work/corpus/pnaw.lowercased -f cy -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 1:3:/home/guest/work/lm/pnaw-full.en.blm >& work/training.out &
nohup nice tools/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -scripts-root-dir tools/moses-scripts/scripts-YYYYMMDD-HHMM/ -root-dir work -corpus work/corpus/news-commentary.lowercased -f fr -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/Users/josh/demo/work/lm/news-commentary.lm >& work/training.out &

You can tail -f work/training.out file to watch the progress of the tuning script. The last step will say something like:

(9) create moses.ini @ Tue Jan 27 19:40:46 CET 2009

Now would be a good time to look at what we've done.

cd work
ls
corpus giza.cy-en giza.en-cy  lm  model

We'll look in the model directory. The three files we really care about are in bold. (contents may be quite different)

cd model
ls -l
total 192554
-rw-r--r-- 1 jschroe1 people  5021309 Jan 27 19:23 aligned.grow-diag-final-and
-rw-r--r-- 1 jschroe1 people 27310991 Jan 27 19:24 extract.gz
-rw-r--r-- 1 jschroe1 people 27043024 Jan 27 19:25 extract.inv.gz
-rw-r--r-- 1 jschroe1 people 21069284 Jan 27 19:25 extract.o.gz
-rw-r--r-- 1 jschroe1 people  6061767 Jan 27 19:23 lex.e2f
-rw-r--r-- 1 jschroe1 people  6061767 Jan 27 19:23 lex.f2e
-rw-r--r-- 1 jschroe1 people     1032 Jan 27 19:40 moses.ini
-rw-r--r-- 1 jschroe1 people 67333222 Jan 27 19:40 phrase-table.gz
-rw-r--r-- 1 jschroe1 people 26144298 Jan 27 19:40 reordering-table.gz

Memory-Map LM and Phrase Table (Optional)

We'll skip this one.

The language model and phrase table can be memory-mapped on disk to minimize the amount of RAM they consume. This isn't really necessary for this size of model, but we'll do it just for the experience.

More information is available on the Moses' web site at: http://www.statmt.org/moses/?n=Moses.AdvancedFeatures and http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel.

Performing these steps can lead to heavy disk use during decoding - you're basically using your hard drive as RAM. Proceed at your own risk, especially if you're using a (slow) networked drive.

Sanity Check Trained Model

We'll do the next one

We haven't tuned yet, but let's just check that the decoder works, and output a lot of logging data with -v 2.

Here's an excerpt of moses initializing with binary files in place (note bold lines, and recall the IRSTLM TMP issue):

echo "c' est une petite maison ." | TMP=/tmp tools/moses/moses-cmd/src/moses -f work/model/moses-bin.ini
Loading lexical distortion models...
have 1 models
Creating lexical reordering...
weights: 0.300 0.300 0.300 0.300 0.300 0.300 
binary file loaded, default OFF_T: -1
Created lexical orientation reordering
Start loading LanguageModel /home/jschroe1/demo/work/lm/news-commentary.blm.mm : [0.000] seconds
In LanguageModelIRST::Load: nGramOrder = 3
Loading LM file (no MAP)
blmt
loadbin()
mapping 36035 1-grams
mapping 411595 2-grams
mapping 118368 3-grams
done
OOV code is 1468
IRST: m_unknownId=1468
Finished loading LanguageModels : [0.000] seconds
Start loading PhraseTable /amd/nethome/jschroe1/demo/work/model/phrase-table.0-0 : [0.000] seconds
using binary phrase tables for idx 0
reading bin ttable
size of OFF_T 8
binary phrasefile loaded, default OFF_T: -1
Finished loading phrase tables : [1.000] seconds
IO from STDOUT/STDIN

And here's one if you skipped the memory mapping steps: (we'll do this one)

echo "Dwyedodd llefarydd ar ran y llywodraeth bod Rali GB Cymru i fod i gael statws Pencampwriaeth y Byd bob blwyddyn ." | tools/moses/moses-cmd/src/moses -f work/model/moses.ini
Loading lexical distortion models...
have 1 models
Creating lexical reordering...
weights: 0.300 0.300 0.300 0.300 0.300 0.300 
Loading table into memory...done.
Created lexical orientation reordering
Start loading LanguageModel /home/jschroe1/demo/work/lm/news-commentary.lm : [47.000] seconds
/home/jschroe1/demo/work/lm/news-commentary.lm: line 1476: warning: non-zero probability for <unk> in closed-vocabulary LM
Finished loading LanguageModels : [49.000] seconds
Start loading PhraseTable /amd/nethome/jschroe1/demo/work/model/phrase-table.0-0.gz : [49.000] seconds
Finished loading phrase tables : [259.000] seconds
IO from STDOUT/STDIN

Again, while these short load times and small memory footprint are nice, decoding times will be slower with memory-mapped models due to disk access.


This will be enough for the day!

PART III - Prepare Tuning and Test Sets

Prepare Data

We'll use some of the dev and devtest data from WMT08. We'll stick with news-commentary data and use dev2007 and test2007. We only need to look at the input (FR) side of our testing data.


PART IV - Tuning

Note that this step can take many hours, even days, to run on large phrase tables and tuning sets. We'll use the non-memory-mapped versions for decoding speed. The training script controls for large phrase and reordering tables by filtering them to include only data relevant to the tuning set (we'll do this ourselves for the test data later).

nohup nice tools/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl work/tuning/nc-dev2007.lowercased.fr work/tuning/nc-dev2007.lowercased.en tools/moses/moses-cmd/src/moses work/model/moses.ini --working-dir work/tuning/mert --rootdir /home/jschroe1/demo/tools/moses-scripts/scripts-YYYYMMDD-HHMM/ --decoder-flags "-v 0" >& work/tuning/mert.out &

Since this can take so long, we can instead make a small, 100 sentence tuning set just to see if the tuning process works. This won't generate very good weights, but it will let us confirm that our tools work.

head -n 100 work/tuning/nc-dev2007.lowercased.fr > work/tuning/nc-dev2007.lowercased.100.fr
head -n 100 work/tuning/nc-dev2007.lowercased.en > work/tuning/nc-dev2007.lowercased.100.en
nohup nice tools/moses-scripts/scripts-YYYYMMDD-HHMM/training/mert-moses.pl work/tuning/nc-dev2007.lowercased.100.fr work/tuning/nc-dev2007.lowercased.100.en tools/moses/moses-cmd/src/moses work/model/moses.ini --working-dir work/tuning/mert --rootdir /home/jschroe1/demo/tools/moses-scripts/scripts-YYYYMMDD-HHMM/ --decoder-flags "-v 0" >& work/tuning/mert.out &

(Note that the scripts rootdir path needs to be absolute).

While this runs, check out the contents of work/tuning/mert. You'll see a set of runs, n-best lists for each, and run*.moses.ini files showing the weights used for each file. You can see the score each run is getting by looking at the last line of each run*.cmert.log file

cd work/tuning/mert
tail -n 1 run*.cmert.log

==> run1.cmert.log <==
Best point: 0.028996 0.035146 -0.661477 -0.051250 0.001667 0.056762 0.009458 0.005504 -0.006458 0.029992 0.009502 0.012555 0.000000 -0.091232 => 0.282865

==> run2.cmert.log <==
Best point: 0.056874 0.039994 0.046105 -0.075984 0.032895 0.020815 -0.412496 0.018823 -0.019820 0.038267 0.046375 0.011876 -0.012047 -0.167628 => 0.281207

==> run3.cmert.log <==
Best point: 0.041904 0.030602 -0.252096 -0.071206 0.012997 0.516962 0.001084 0.010466 0.001683 0.008451 0.001386 0.007512 -0.014841 -0.028811 => 0.280953

==> run4.cmert.log <==
Best point: 0.088423 0.118561 0.073049 0.060186 0.043942 0.293692 -0.147511 0.037605 0.008851 0.019371 0.015986 0.018539 0.001918 -0.072367 => 0.280063

==> run5.cmert.log <==
Best point: 0.059100 0.049655 0.187688 0.010163 0.054140 0.077241 0.000584 0.101203 0.014712 0.144193 0.219264 -0.005517 -0.047385 -0.029156 => 0.280930

This gives you an idea if the system is improving or not. You can see that in this case it isn't, because we don't have enough data in our system and we haven't let tuning run for enough iterations. Kill mert-moses.pl after a few iterations just to get some weights to use.

If mert were to finish successfully, it would create a file named work/tuning/mert/moses.ini containing all the weights we needed. Since we killed mert, copy the best moses.ini config to be the one we'll use. Note that the weights calculated in run1.cmert.log were used to make the config file for run2, so we want run2.moses.ini

If you want to use the weights from a finished mert run, try /afs/ms/u/m/mtm52/BIG/work/tuning/mert/moses.ini

cp run2.moses.ini moses.ini

Insert weights into configuration file

cd ../../../
tools/scripts/reuse-weights.perl work/tuning/mert/moses.ini < work/model/moses.ini > work/tuning/moses-tuned.ini
tools/scripts/reuse-weights.perl work/tuning/mert/moses.ini < work/model/moses-bin.ini > work/tuning/moses-tuned-bin.ini

PART V - Filtering Test Data

Filtering is another way, like binarizing, to help reduce memory requirements. It makes smaller phrase and reordering tables that contain only entries that will be used for a particular test set. Binarized models don't need to be filtered since they don't take up RAM when used. Moses has a script that does this for us, which we'll apply to the evaluation test set we prepared earlier:

tools/moses-scripts/scripts-YYYYMMDD-HHMM/training/filter-model-given-input.pl  work/evaluation/filtered.nc-test2007 work/tuning/moses-tuned.ini work/evaluation/nc-test2007.lowercased.fr 

There is also a filter-and-binarize-model-given-input.pl script if your filtered table would still be too large to load into memory.


PART VI - Run Tuned Decoder on Development Test Set

We'll try this a few ways.

All three of these outputs should be identical, but they will take different amounts of time and memory to compute.

If you don't have time to run a full decoding session, you can use an output located at /afs/ms/u/m/mtm52/BIG/work/evaluation/nc-test2007.tuned-filtered.output


PART VII - Evaluation

Train Recaser

Now we'll train a recaser. It uses a statistical model to "translate" between lowercased and cased data.

mkdir work/recaser
tools/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/train-recaser.perl -train-script tools/moses-scripts/scripts-YYYYMMDD-HHMM/training/train-factored-phrase-model.perl -ngram-count tools/srilm/bin/i686/ngram-count -corpus work/corpus/news-commentary.tok.en -dir /home/jschroe1/demo/work/recaser -scripts-root-dir tools/moses-scripts/scripts-YYYYMMDD-HHMM/

This goes through a whole GIZA and LM training run to go from lowercase sentences to cased sentences. Note that the -dir flag needs to be absolute.

Recase the output

tools/moses-scripts/scripts-YYYYMMDD-HHMM/recaser/recase.perl -model work/recaser/moses.ini -in work/evaluation/nc-test2007.tuned-filtered.output -moses tools/moses/moses-cmd/src/moses > work/evaluation/nc-test2007.tuned-filtered.output.recased

Detokenize the output

tools/scripts/detokenizer.perl -l en < work/evaluation/nc-test2007.tuned-filtered.output.recased > work/evaluation/nc-test2007.tuned-filtered.output.detokenized

Wrap the output in XML

tools/scripts/wrap-xml.perl data/devtest/nc-test2007-ref.en.sgm en my-system-name < work/evaluation/nc-test2007.tuned-filtered.output.detokenized > work/evaluation/nc-test2007.tuned-filtered.output.sgm

Score with NIST-BLEU

tools/mteval-v11b.pl -s data/devtest/nc-test2007-src.fr.sgm -r data/devtest/nc-test2007-ref.en.sgm -t work/evaluation/nc-test2007.tuned-filtered.output.sgm -c

  Evaluation of any-to-en translation using:
    src set "nc-test2007" (1 docs, 2007 segs)
    ref set "nc-test2007" (1 refs)
    tst set "nc-test2007" (1 systems)

NIST score = 6.9126  BLEU score = 0.2436 for system "my-system-name"

We got a BLEU score of 24.4! Hooray! Best translations ever! Let's all go to the pub!

Appendix A - Versions