This mini-tutorial assumes no previous knowledge or expertise. It should be easy to follow by any researcher, otherwise please contact us.
First of all, you must download our software. Windows users: Our software works only in Unix systems, so please install cygwin. Please be sure that you can execute the three provided files, otherwise run this command:
chmod +x ft*.sh sample.py
# BTW, lines starting with the # symbol are command-line comments and so can be safely ignored.
Imagine that we want to build a memrep (memorable and representative) phrase set in Spanish. Our target language is therefore Spanish. We will need two different text datasets, each consisting of one sentence per line:
Typically, the Universe dataset is bigger than the Domain dataset, but actually it can be smaller as long as it represents accurately enough the target language.
For this mini-tutorial, we will use the News Crawl dataset (articles from 2012) as Universe and the parallel corpus Spanish-English as Domain (we will only use the Spanish sentences). You will need around 4 GB of free disk space. We will work in the /tmp directory.
# Download (3.1 GB).
wget http://www.statmt.org/wmt13/training-monolingual-news-2012.tgz
# Extract files.
tar xzf training-monolingual-news-2012.tgz
# The file training-monolingual/news.2012.es.shuffled contains the Spanish sentences.
# Download (186 MB).
wget http://www.statmt.org/europarl/v7/es-en.tgz
# Extract files.
tar xzf es-en.tgz
# The file europarl-v7.es-en.es contains the Spanish sentences.
./ftget.sh -c europarl-v7.es-en.es -u training-monolingual/news.2012.es.shuffled -r -y > feats-es.txt
# Computing universe words...
# Computing universe vocabulary...
# Computing word counts per phrase...
# Computing phrase features...
We are almost ready to sample our Spanish phrases from the Domain dataset.
[input]
sentences_file = /path/to/sentences.file
features_file = /path/to/sentence-features.file
[output]
num_phrases = 500
[representativeness]
; Mean and SD of given features, for representativeness scoring.
; These are corpus-dependent, and can be easily computed with ftrep.sh,
; e.g. `ftrep.sh sentence-features.file`
m_nw = ??
s_nw = ??
m_sdchars = ??
s_sdchars = ??
m_oovs = ??
s_oovs = ??
m_lprob = ??
s_lprob = ??
[memorability]
; Weights for memorability scoring, according to Leiva & Sanchis-Trilles paper.
; These are corpus-independent and should not be modified.
w_nw = 0.83
w_sdchars = 0.49
w_oovs = 6.95
w_lprob = -1.00
As you might know, we need to specify two input files and a number of values for "Mean and SD of given features, for representativeness scoring."
For the former case, we will just point to our Domain files (europarl-v7.es-en.es and feats-es.txt).
For the latter case, we will just execute the following command:
./ftrep.sh feats-es.txt
# Output: 26.261@16.3112 3.25248@0.644104 0.0139927@0.0598896 -7.8097@1.04948
# These are the Mean@SD tuples for the 4 features (NW, SDChr, OOV, LProb) in Domain dataset.
So we can now save our config-es.ini file. We will indicate that we want 10 phrases, just for testing purposes:
[input]
sentences_file = /tmp/europarl-v7.es-en.es
features_file = /tmp/feats-es.txt
[output]
num_phrases = 10
[representativeness]
m_nw = 26.261
s_nw = 16.3112
m_sdchars = 3.25248
s_sdchars = 0.644104
m_oovs = 0.0139927
s_oovs = 0.059889
m_lprob = -7.8097
s_lprob = 1.04948
[memorability]
w_nw = 0.83
w_sdchars = 0.49
w_oovs = 6.95
w_lprob = -1.00
./sample.py config-es.ini > phrases-es.txt
As you can see, sampled phrases are very long. This is so because the Domain dataset has 26.261 words per sentence on average!
Therefore, we can indicate a more adequate range for the number of words in order to conduct a better text entry experiment.
Use these values:
m_nw = 6.261
s_nw = 1.3112
And run the sample sampling code again.
m_nw = 5.41317
s_nw = 1.12024
m_sdchars = 2.50494
s_sdchars = 0.92618
m_oovs = 0.309881
s_oovs = 0.149455
m_lprob = -9.17064
s_lprob = 2.41214
So you could use these representativeness scores to build a custom English phrase set where phrases would perform similarly to Mackenzie&Soukoreff's,
though in practice you may want to change only m_nw and sd_nw and preserve the rest of computed scores.
(cc) 2024 Luis A. Leiva, Germán Sanchis-Trilles