Sentence frequency norms for psycholinguistic studies


  • Dufau Stéphane
  • Armando Marjorie
  • Grainger Jonathan

Capitalizing on the Google’s Ngram corpus, we examined the possibility to establish frequency norms for sentences in a format suitable for psycholinguistics. Ngram corpus is based on over 8 million digitized books and reflects how often sequences of (N-)words are used in a particular language (8 languages available to date). Even though publicly available, raw data is presented in a form that is difficult to use as is. Frequency is split by year and corpus split into multiple files. In addition, sequences tagged with part-of-speech are mixed with non-tagged ones. Here, we propose a simplified and curated version of the Ngram frequency norms that will help in stimulus selection for psycholinguistic studies. Such curated norms are used in a pilot study to assess whether or not sentence frequency plays a role in sentence recognition.

