Duolingo's AI drives its English proficiency tests

Language learning startup Duolingo leverages AI and machine learning to create and score English proficiency tests automatically, reveals a paper published in the journal Transactions of the Association for Computational Linguistics. In it, researchers peel back the curtains on the family of algorithms underlying the Duolingo English Test, a $49 one-hour, at-home assessment that's now accepted by over 2,000 university programs including Columbia, McGill, New York University, University College London, and Williams.

AI-generated tests like Duolingo's could be a godsend for employers looking to hiring English-as-a-second-language (ESL) candidates during the pandemic. Proficiency assessments like Test Of English As A Foreign Language (TOEFL) require that examinees travel to a proctored location, a tough ask in countries where executive orders have mandated the closure of non-essential businesses. Perhaps unsurprisingly, a Duolingo spokesperson says that test volume is up 300% and 375% globally and in China, respectively, and that 500 new programs have begun accepting Duolingo English Test since the pandemic began.

As the coauthors of the paper explain, the Duolingo English Test draws on the item response theory in psychometrics to design and score measures of test-taker ability. It's the basis for most high-stakes modern standardized tests, and it assumes that a response to a test item (i.e. question) is modeled by a function discretely representing an examinee's ability and question difficulty. Fortuitously for Duolingo, this paradigm is well-suited to tasks where the goal is to estimate variables like ability and difficulty; questions can be created and tested with subjects to produce pairs (examine, question) graded "correct" or "incorrect," from which parameters can be derived that anticipate future examinees' abilities.

Computer-adaptive testing (CAT) techniques enabled Duolingo to design a more efficient language test by assigning harder questions to test-takers of higher ability and vice versa. An iterative adaptive algorithm observes examinees' responses to questions during testing and makes an estimate of their abilities. Based on a utility function of the current estimate, it then selects the next question, at which point the process repeats until the test is completed.

For the Duolingo English Test, Duolingo designed a 100-point scoring system corresponding to the Common European Framework of Reference (CEFR), an international standard for describing the reading, writing, listening, and speaking skills proficiency of foreign-language learners. Then, the company's researchers incorporated a range of different test formats, including:

Yes/no vocabulary tests that vary in modality (text versus audio) to assess vocabulary breadth, where examinees are given both text and audio answers and must distinguish English words from English-like pseudowords (words that are morphologically and phonologically plausible, but have no meaning in English).
The c-test format, which measures reading ability by providing examinees passages of text in which some words have been "damaged" (by deleting the second half of every other word) and tasking them with filling in missing letters.
Dictation tests that tap both listening and writing skills by having examinees transcribe an audio recording.
Elicited speech tasks that require examinees to say a sentence out loud.

In pursuit of algorithms for the vocabulary tests that could rank questions by difficulty so that the sequence of questions in the overall proficiency test could be tailored to ability, Duolingo had a panel of linguistics Ph.D.s with English teaching experience compile an inventory of words labeled by CEFR level (which ranges from "Beginner/Breakthrough" to "Proficient/Mastery"). They fed this corpus to AI models to train them, and they report that the models eventually learned that advanced words -- even pseudowords -- are rarer and mostly have Greco-Latin etymologies, whereas basic words are common and have mostly Anglo-Saxon origins.

For the c-tests, Duolingo leveraged a range of corpora gleaned from online sources -- including English language self-study websites, test preparation resources for English proficiency exams, English Wikipedia articles that had been rewritten for Simple English, and the crowdsourced English sentence database Tatoeba -- together with regression and ranking techniques to architect longer-form AI models. The models in question, which were trained on labeled texts and then on unlabeled texts with similar linguistic features, learned to predict not only the difficulty of a given c-test but also the difficulty of dictation and elicited speech tests.

In fact, Duolingo reports that the trained model correctly ranked more difficult passages above simpler ones 85% of the time, and that its predictions mirrored those of a panel of four experts. The researchers used these predictions to automatically generate c-test items from paragraphs in the corpora and over 400 passages written by the experts.

Ultimately, automating the serving of all questions to Duolingo English Proficiency examinees required creating a CAT administration algorithm, which was trained on over 25,000 test items to intelligently cycle through formats (e.g., yes/no vocabulary text or audio, c-test, dictation, and elicited). After choosing the first four questions at random, the algorithm estimates the test score and selects the difficulty of the next question to sample accordingly, a process that repeats until the test exceeds 25 items (or 40 minutes in length).

In real test scenarios, human proctors review each test session for roughly 75 behaviors over multiple rounds, with the help of AI trained on millions of data points collected daily to detect rule-breaking. Beyond this, during test sessions, computer vision algorithms verify examinees' identities (via their webcams) and tests are automatically canceled if they attempt to access external apps or plugins.

Analyses of over 500,000 examinee-question pairs from over 21,000 tests administered in 2018 revealed that the Duolingo English Test produced rankings nearly identical to what traditional human pilot testing would provide, according to the paper's coauthors. The test moreover correlated "significantly" (0.73) with English assessments like TOEFL and International English Language Testing System (IELTS) and satisfied industry standards for reliability (the degree to which a test is consistent and stable) and test security. (Duolingo found that test-takers could take the test about 1,000 times before seeing the same test item again, on average.)

In future work, Duolingo researchers plan to investigate the extent to which people of equal ability but different subgroups (like gender or age) have unequal probability of success on test questions. In addition, they hope to study whether other indices, such as narrativity and word concreteness, could be incorporated into the Duolingo English Proficiency's models to predict text difficulty and comprehension.

To this end, a recently released version of the test includes more nuanced speaking and writing exercises and has higher test score reliability.

"English is the most popular language to learn on Duolingo, and many learners also asked if we could certify their English skills formally, in order to help them gain access to higher education and better job opportunities," wrote Duolingo machine learning scientist Burr Settles and assessment scientist Geoffrey LaFlair in a blog post published today. "Duolingo is a mission-driven company, and we created the Duolingo English Test to break down barriers to higher education. As a result, we've learned that an online, personalized approach to testing is not only important for increasing access -- it's an essential innovation that is reshaping the education system as we know it, and we are excited to be leading the way."

Duolingo's investment in AI-enabled English testing coincides with improvements to the AI at the core of its language learning platform, which aims to make lessons more engaging by automatically tailoring them to each individual language learner. Statistical and machine learning models like half-life regression analyze the error patterns of millions of users to predict the "half-life" for each word in a person's long-term memory, and to help content creators behind the scenes tailor beginner, intermediate, and advanced level material, Settles told VentureBeat in an interview last July.

"There are millions of words in the English language, and maybe 10,000 high-frequency words -- what order do you teach them? How do you string them together?" he said. “The core part of our AI strategy is to get as close as possible to having a human-to-human experience."

More