Representatively Memorable: Sampling the Right Phrase Set to Get the Text Entry Experiment Right

This mini-tutorial assumes no previous knowledge or expertise. It should be easy to follow by any researcher, otherwise please contact us.

How-To guide

First of all, you must download our software. Windows users: Our software works only in Unix systems, so please install cygwin. Please be sure that you can execute the three provided files, otherwise run this command:

chmod +x ft*.sh sample.py
# BTW, lines starting with the # symbol are command-line comments and so can be safely ignored.

An illustrative example with the Spanish language

Imagine that we want to build a memrep (memorable and representative) phrase set in Spanish. Our target language is therefore Spanish. We will need two different text datasets, each consisting of one sentence per line:

  1. Universe: A dataset to extract some basic knowledge of the language.
  2. Domain: The dataset from which phrases will be sampled.

Typically, the Universe dataset is bigger than the Domain dataset, but actually it can be smaller as long as it represents accurately enough the target language.

For this mini-tutorial, we will use the News Crawl dataset (articles from 2012) as Universe and the parallel corpus Spanish-English as Domain (we will only use the Spanish sentences). You will need around 4 GB of free disk space. We will work in the /tmp directory.

Steps (aka The Recipe)

  1. Get the News Crawl dataset:
    # Download (3.1 GB).
    wget http://www.statmt.org/wmt13/training-monolingual-news-2012.tgz
    # Extract files.
    tar xzf training-monolingual-news-2012.tgz
    # The file training-monolingual/news.2012.es.shuffled contains the Spanish sentences.
    
  2. Get the Spanish Europarl dataset:
    # Download (186 MB).
    wget http://www.statmt.org/europarl/v7/es-en.tgz
    # Extract files.
    tar xzf es-en.tgz
    # The file europarl-v7.es-en.es contains the Spanish sentences.
    
  3. Compute the sentence features of the Domain dataset. This should last around 5 minutes in a modern computer. The basic syntax of this command is as follows: ftget.sh -c domain_file -u universe_file
    We will indicate that the output file should have a header, with the optional -r parameter, and will redirect the output to a TXT file. In sum, run this command:
    ./ftget.sh -c europarl-v7.es-en.es -u training-monolingual/news.2012.es.shuffled -r -y > feats-es.txt
    # Computing universe words...
    # Computing universe vocabulary...
    # Computing word counts per phrase...
    # Computing phrase features...
    
    We are almost ready to sample our Spanish phrases from the Domain dataset.
  4. The sampling program needs a config INI file, so we will create one. We need to supply the following information:
    [input]
    sentences_file = /path/to/sentences.file
    features_file  = /path/to/sentence-features.file
    
    [output]
    num_phrases = 500
    
    [representativeness]
    ; Mean and SD of given features, for representativeness scoring.
    ; These are corpus-dependent, and can be easily computed with ftrep.sh,
    ; e.g. `ftrep.sh sentence-features.file`
    m_nw = ??
    s_nw = ??
    m_sdchars = ??
    s_sdchars = ??
    m_oovs = ??
    s_oovs = ??
    m_lprob = ??
    s_lprob = ??
    
    [memorability]
    ; Weights for memorability scoring, according to Leiva & Sanchis-Trilles paper.
    ; These are corpus-independent and should not be modified.
    w_nw = 0.83
    w_sdchars = 0.49
    w_oovs = 6.95
    w_lprob = -1.00
    
    As you might know, we need to specify two input files and a number of values for "Mean and SD of given features, for representativeness scoring." For the former case, we will just point to our Domain files (europarl-v7.es-en.es and feats-es.txt). For the latter case, we will just execute the following command:
    ./ftrep.sh feats-es.txt
    # Output: 26.261@16.3112 3.25248@0.644104 0.0139927@0.0598896 -7.8097@1.04948
    # These are the Mean@SD tuples for the 4 features (NW, SDChr, OOV, LProb) in Domain dataset.
    
    So we can now save our config-es.ini file. We will indicate that we want 10 phrases, just for testing purposes:
      [input]
      sentences_file = /tmp/europarl-v7.es-en.es
      features_file  = /tmp/feats-es.txt
    
      [output]
      num_phrases = 10
    
      [representativeness]
      m_nw = 26.261
      s_nw = 16.3112
      m_sdchars = 3.25248
      s_sdchars = 0.644104
      m_oovs = 0.0139927
      s_oovs = 0.059889
      m_lprob = -7.8097
      s_lprob = 1.04948
    
      [memorability]
      w_nw = 0.83
      w_sdchars = 0.49
      w_oovs = 6.95
      w_lprob = -1.00
    
  5. Finally, perform sampling:
    ./sample.py config-es.ini > phrases-es.txt
    
    As you can see, sampled phrases are very long. This is so because the Domain dataset has 26.261 words per sentence on average! Therefore, we can indicate a more adequate range for the number of words in order to conduct a better text entry experiment. Use these values:
    m_nw = 6.261
    s_nw = 1.3112
    
    And run the sample sampling code again.
    Much better, uh? In general, text entry experiments should prefer short sentences. The rest of the parameters can also be tuned at will, but for now we leave them as is. FYI, the well-known Mackenzie & Soukoreff dataset has the following representativeness scores according to the English version of the News Crawl dataset (articles from 2012):
    m_nw = 5.41317
    s_nw = 1.12024
    m_sdchars = 2.50494
    s_sdchars = 0.92618
    m_oovs = 0.309881
    s_oovs = 0.149455
    m_lprob = -9.17064
    s_lprob = 2.41214
    
    So you could use these representativeness scores to build a custom English phrase set where phrases would perform similarly to Mackenzie&Soukoreff's, though in practice you may want to change only m_nw and sd_nw and preserve the rest of computed scores.

(cc) 2024 Luis A. Leiva, Germán Sanchis-Trilles