This mini-tutorial assumes no previous knowledge or expertise. It should be easy to follow by any researcher, otherwise please contact us.
First of all, you must download our software. Windows users: Our software works only in Unix systems, so please install cygwin. Please be sure that you can execute the three provided files, otherwise run this command:
chmod +x ft*.sh sample.py # BTW, lines starting with the # symbol are command-line comments and so can be safely ignored.
Imagine that we want to build a memrep (memorable and representative) phrase set in Spanish. Our target language is therefore Spanish. We will need two different text datasets, each consisting of one sentence per line:
Typically, the Universe dataset is bigger than the Domain dataset, but actually it can be smaller as long as it represents accurately enough the target language.
For this mini-tutorial, we will use the News Crawl dataset (articles from 2012) as Universe and the parallel corpus Spanish-English as Domain (we will only use the Spanish sentences). You will need around 4 GB of free disk space. We will work in the /tmp directory.
# Download (3.1 GB). wget http://www.statmt.org/wmt13/training-monolingual-news-2012.tgz # Extract files. tar xzf training-monolingual-news-2012.tgz # The file training-monolingual/news.2012.es.shuffled contains the Spanish sentences.
# Download (186 MB). wget http://www.statmt.org/europarl/v7/es-en.tgz # Extract files. tar xzf es-en.tgz # The file europarl-v7.es-en.es contains the Spanish sentences.
We are almost ready to sample our Spanish phrases from the Domain dataset../ftget.sh -c europarl-v7.es-en.es -u training-monolingual/news.2012.es.shuffled -r -y > feats-es.txt # Computing universe words... # Computing universe vocabulary... # Computing word counts per phrase... # Computing phrase features...
As you might know, we need to specify two input files and a number of values for "Mean and SD of given features, for representativeness scoring." For the former case, we will just point to our Domain files (europarl-v7.es-en.es and feats-es.txt). For the latter case, we will just execute the following command:[input] sentences_file = /path/to/sentences.file features_file = /path/to/sentence-features.file [output] num_phrases = 500 [representativeness] ; Mean and SD of given features, for representativeness scoring. ; These are corpus-dependent, and can be easily computed with ftrep.sh, ; e.g. `ftrep.sh sentence-features.file` m_nw = ?? s_nw = ?? m_sdchars = ?? s_sdchars = ?? m_oovs = ?? s_oovs = ?? m_lprob = ?? s_lprob = ?? [memorability] ; Weights for memorability scoring, according to Leiva & Sanchis-Trilles paper. ; These are corpus-independent and should not be modified. w_nw = 0.83 w_sdchars = 0.49 w_oovs = 6.95 w_lprob = -1.00
So we can now save our config-es.ini file. We will indicate that we want 10 phrases, just for testing purposes:./ftrep.sh feats-es.txt # Output: email@example.com firstname.lastname@example.org email@example.com -firstname.lastname@example.org # These are the Mean@SD tuples for the 4 features (NW, SDChr, OOV, LProb) in Domain dataset.
[input] sentences_file = /tmp/europarl-v7.es-en.es features_file = /tmp/feats-es.txt [output] num_phrases = 10 [representativeness] m_nw = 26.261 s_nw = 16.3112 m_sdchars = 3.25248 s_sdchars = 0.644104 m_oovs = 0.0139927 s_oovs = 0.059889 m_lprob = -7.8097 s_lprob = 1.04948 [memorability] w_nw = 0.83 w_sdchars = 0.49 w_oovs = 6.95 w_lprob = -1.00
As you can see, sampled phrases are very long. This is so because the Domain dataset has 26.261 words per sentence on average! Therefore, we can indicate a more adequate range for the number of words in order to conduct a better text entry experiment. Use these values:./sample.py config-es.ini > phrases-es.txt
And run the sample sampling code again.m_nw = 6.261 s_nw = 1.3112
So you could use these representativeness scores to build a custom English phrase set where phrases would perform similarly to Mackenzie&Soukoreff's, though in practice you may want to change only m_nw and sd_nw and preserve the rest of computed scores.m_nw = 5.41317 s_nw = 1.12024 m_sdchars = 2.50494 s_sdchars = 0.92618 m_oovs = 0.309881 s_oovs = 0.149455 m_lprob = -9.17064 s_lprob = 2.41214
(cc) 2021 Luis A. Leiva, Germán Sanchis-Trilles