What are the most frequently used words in the English language? You’re about to find out.
In 1965, researcher Mark Mayzner published Tables of Single-letter and Digram Frequency Counts for Various Word-length and Letter-position Combinations. His work, which studied the frequency of letter combinations in English words using a corpus of 20,000 words, has been cited in hundreds of articles.
In December 2012, when he was 85 years old, Mayzner contacted Peter Norvig, a research director at Google. He wanted to see if “perhaps your group at Google might be interested in using the computing power that is now available to significantly expand and produce such tables as I constructed some 50 years ago, but now using the Google Corpus Data, not the tiny 20,000 word sample that I used.”
Norvig did exactly that, and today, YouTube user Abacaba created a brilliant visualization of the results. Before you start watching, try to guess the three most frequently used words in the English language.
Got your three words? Good. Here we go:
I guessed “the” correctly, but my second and third place guesses were “a” and “I” — neither of which are even in the top five. It turns out that “a” is sixth and “I” is 19th!
Here are the top 50 most frequently used words in English:
If the video piqued your interest but didn’t quench your thirst for knowledge, you might want to know how Norvig put it all together. In his own words:
- I consulted the Google Books Ngrams raw data set, which gives word counts of the number of times each word is mentioned (broken down by year of publication) in the books that have been scanned by Google.
- I downloaded the English Version 20120701 “1-grams” (that is, word counts) from that data set given as the files “a” to “z” (that is, http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-a.gz to http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-z.gz). I unzipped each file; the result is 23 GB of text (so don’t try to download them on your phone).
- I then condensed these entries, combining the counts for all years, and for different capitalizations: “word”, “Word” and “WORD” were all recorded under “WORD.” I discarded any entry that used a character other than the 26 letters A-Z. I also discarded any word with fewer than 100,000 mentions. (If you want, you can download the word count file; note that it is 1.5 MB.)
- I generated tables of counts, first for words, then for letters and letter sequences, and keyed off of the positions and word lengths.
Keep in mind that these results are based on Google Books data of 97,565 distinct words, which were mentioned 743,842,922,321 times. That is 37 million times more than in Mayzner’s 20,000-mention collection.
For even more details, you can read Norvig’s full analysis here: English Letter Frequency Counts: Mayzner Revisited.