Representatively Memorable: Sampling the Right Phrase Set to Get the Text Entry Experiment Right

This mini-tutorial assumes no previous knowledge or expertise. It should be easy to follow by any researcher, otherwise please contact us.

How-To guide

First of all, you must download our software. Windows users: Our software works only in Unix systems, so please install cygwin. Please be sure that you can execute the three provided files, otherwise run this command:


chmod +x ft*.sh sample.py
# BTW, lines starting with the # symbol are command-line comments and so can be safely ignored.

An illustrative example with the Spanish language

Imagine that we want to build a memrep (memorable and representative) phrase set in Spanish. Our target language is therefore Spanish. We will need two different text datasets, each consisting of one sentence per line:

Universe: A dataset to extract some basic knowledge of the language.
Domain: The dataset from which phrases will be sampled.

Typically, the Universe dataset is bigger than the Domain dataset, but actually it can be smaller as long as it represents accurately enough the target language.

For this mini-tutorial, we will use the News Crawl dataset (articles from 2012) as Universe and the parallel corpus Spanish-English as Domain (we will only use the Spanish sentences). You will need around 4 GB of free disk space. We will work in the /tmp directory.

Steps (aka The Recipe)

Get the News Crawl dataset:


# Download (3.1 GB).
wget http://www.statmt.org/wmt13/training-monolingual-news-2012.tgz
# Extract files.
tar xzf training-monolingual-news-2012.tgz
# The file training-monolingual/news.2012.es.shuffled contains the Spanish sentences.

Get the Spanish Europarl dataset:


# Download (186 MB).
wget http://www.statmt.org/europarl/v7/es-en.tgz
# Extract files.
tar xzf es-en.tgz
# The file europarl-v7.es-en.es contains the Spanish sentences.

Compute the sentence features of the Domain dataset. This should last around 5 minutes in a modern computer. The basic syntax of this command is as follows: ftget.sh -c domain_file -u universe_file
We will indicate that the output file should have a header, with the optional -r parameter, and will redirect the output to a TXT file. In sum, run this command: ./ftget.sh -c europarl-v7.es-en.es -u training-monolingual/news.2012.es.shuffled -r -y > feats-es.txt # Computing universe words... # Computing universe vocabulary... # Computing word counts per phrase... # Computing phrase features... We are almost ready to sample our Spanish phrases from the Domain dataset.

The sampling program needs a config INI file, so we will create one. We need to supply the following information:


[input]
sentences_file = /path/to/sentences.file
features_file  = /path/to/sentence-features.file

[output]
num_phrases = 500

[representativeness]
; Mean and SD of given features, for representativeness scoring.
; These are corpus-dependent, and can be easily computed with ftrep.sh,
; e.g. `ftrep.sh sentence-features.file`
m_nw = ??
s_nw = ??
m_sdchars = ??
s_sdchars = ??
m_oovs = ??
s_oovs = ??
m_lprob = ??
s_lprob = ??

[memorability]
; Weights for memorability scoring, according to Leiva & Sanchis-Trilles paper.
; These are corpus-independent and should not be modified.
w_nw = 0.83
w_sdchars = 0.49
w_oovs = 6.95
w_lprob = -1.00

As you might know, we need to specify two input files and a number of values for "Mean and SD of given features, for representativeness scoring." For the former case, we will just point to our Domain files (europarl-v7.es-en.es and feats-es.txt). For the latter case, we will just execute the following command:


./ftrep.sh feats-es.txt
# Output: 26.261@16.3112 3.25248@0.644104 0.0139927@0.0598896 -7.8097@1.04948
# These are the Mean@SD tuples for the 4 features (NW, SDChr, OOV, LProb) in Domain dataset.

So we can now save our config-es.ini file. We will indicate that we want 10 phrases, just for testing purposes:


  [input]
  sentences_file = /tmp/europarl-v7.es-en.es
  features_file  = /tmp/feats-es.txt

  [output]
  num_phrases = 10

  [representativeness]
  m_nw = 26.261
  s_nw = 16.3112
  m_sdchars = 3.25248
  s_sdchars = 0.644104
  m_oovs = 0.0139927
  s_oovs = 0.059889
  m_lprob = -7.8097
  s_lprob = 1.04948

  [memorability]
  w_nw = 0.83
  w_sdchars = 0.49
  w_oovs = 6.95
  w_lprob = -1.00

Finally, perform sampling: ./sample.py config-es.ini > phrases-es.txt As you can see, sampled phrases are very long. This is so because the Domain dataset has 26.261 words per sentence on average! Therefore, we can indicate a more adequate range for the number of words in order to conduct a better text entry experiment. Use these values: m_nw = 6.261 s_nw = 1.3112 And run the sample sampling code again.
Much better, uh? In general, text entry experiments should prefer short sentences. The rest of the parameters can also be tuned at will, but for now we leave them as is. FYI, the well-known Mackenzie & Soukoreff dataset has the following representativeness scores according to the English version of the News Crawl dataset (articles from 2012): m_nw = 5.41317 s_nw = 1.12024 m_sdchars = 2.50494 s_sdchars = 0.92618 m_oovs = 0.309881 s_oovs = 0.149455 m_lprob = -9.17064 s_lprob = 2.41214 So you could use these representativeness scores to build a custom English phrase set where phrases would perform similarly to Mackenzie&Soukoreff's, though in practice you may want to change only m_nw and sd_nw and preserve the rest of computed scores.

(cc) 2025 Luis A. Leiva, Germán Sanchis-Trilles