Representative and Memorable Phrase Sets for Text Entry Experiments

In text entry experiments, memorability is a desired property of the phrases used as stimuli. However, to date there is no automated method to achieve this effect. Further, today's reference datasets are only available in English, which may lead to misleading results when testing text entry methods with non-native English speakers. In response to this need, we present memrep, a novel sampling method based on two core ideas: a multiple regression model over language-independent features, and the statistical analysis of the corpus from which phrases will be drawn. Our results show that researchers can finally use a method to successfully curate their own stimuli targeting potentially any language or domain.
Recently we have compiled a dataset collection across 10 major languages using our sampling method. We also compared our method with 2 automated alternatives based on random sampling, and found that memrep performs best according to a statistical analysis and qualitative observations.


Please follow this how-to guide for a concise but thorough use case example.


This software is free for scientific use (dual-licensed under the MIT and GPL2 licenses). Please contact us if you are planning to use the software for commercial purposes. The authors are not responsible for any implication derived from the use of this software.

Experimental Datasets

The following are the phrase sets we used in our CHI'14 paper. Each set has 500 phrases.



OpenSubtitles Datasets

The following are the phrase sets we used in our MobileHCI'14 paper. Each set has 2000 phrases.

Full Sentences

No punctuation

No punctuation & lowercased-tokenized

For academic purposes, we have alternate versions of the OpenSubtitles datasets according to 2 random sampling methods (nopunct + lctok form only, sorry). If you are interested in these alternate versions, just send us an email.

Update: Contribute to improve the OpenSubtitles Datasets!

The OpenSubtitles datasets were automatically processed, so human supervision is still needed. We are looking for native speakers of any of the 10 languages shown above. If you want to colaborate, please go to this page. If you eventually do contribute, you will be thanked on this page and possibly in the follow-up academic papers that made use of these datasets.


If you use the code (or a derivation thereof) or any of the two experimental datasets, please cite the following paper:

BibTeX entry:

  author    = {Luis A. Leiva and Germán Sanchis-Trilles},
  title     = {Representatively Memorable: Sampling the Right Phrase Set to Get the Text Entry Experiment Right},
  booktitle = {Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI)},
  year      = {2014},

Also, if you use any of the OpenSubtitles datasets, please cite the following paper:

BibTeX entry:

  author    = {Germán Sanchis-Trilles and Luis A. Leiva},
  title     = {A Systematic Comparison of 3 Phrase Sampling Methods for Text Entry Experiments in 10 Languages},
  booktitle = {Proceedings of the international conference on Human-computer interaction with mobile devices and services (MobileHCI)},
  year      = {2014},

(cc) 2018 Luis A. Leiva, Germán Sanchis-Trilles