Stanford University AI researchers have created a “socially equitable” natural language processing (NLP) tool they say improves upon off-the-shelf AI solutions used today that fail to account for things like regional dialects, slang, or the natural way people talk when they regularly speak more than one language.
In a paper published late last week, researchers found Equilid to be more accurate than commonly used identification tools like langid.py and Google’s CLD2. Popular language identification tools, the paper argues, draw on a “European-centric corpora” of the written word, as well as websites, Wikipedia, and newswires, methods that may not best represent the way people actually talk.
Language identification is a form of NLP used for things like serving up Google search results or even tracking social media chatter to make predictions. Equilid was made to better understand slang, regional dialects, and the natural way people communicate online when they speak more than one language, like, say, the 90 million English speakers in the Philippines who may regularly switch between English and Tagalog.
The report finds that more effective identification of language in underrepresented dialects could “help reveal dangerous trends in infectious diseases in the areas that need it most.”
“Flu tracking, election predictions, anytime you’re trying to do social sensing or you want to use social media to predict outcomes, that’s kind of where we see our application having the biggest impact,” lead author David Jurgens told VentureBeat in a phone interview. “If you’re undersampling your population by quite a bit and it’s a community that only speaks a local dialect, you may actually omit — I mean, think about all the people that speak Louisiana Cajun English that’s a little tougher to understand, and they have their own spellings. What if we omitted Louisiana [from flu tracking]? That seems like a pretty bad outcome.”
AI trained with slang
To make Equilid work, language and text were drawn a variety of sources, like European legislation and Wikipedia, but also Urban Dictionary, conversations about articles in the Talk section of the Wikipedia website, and African American Vernacular English, also known as ebonics. Interpretations of the Bible and Quran were also used; the Watchtower magazine from Jehovah’s Witnesses, which is translated into hundreds of languages, was also a rich resource.
By far, Jurgens said, the majority of language used to train Equilid and strengthen its ability to recognize specific geographic regions came from Twitter. Equilid draws upon language from nearly 98 million tweets from 1.5 million users in 53 languages.
“It’s [Twitter data] probably about a quarter of the dataset, but in terms of social representativeness like getting these local populations, it’s like 100 percent. I don’t know how else we would do it unless we had social media to kind of guide us,” he said.
Equality through accuracy
The goal with Equilid, Jurgens said, was not just to make a more socially equitable product, but to improve the accuracy and overall quality of NLP.
Equilid was inspired by the work of Dirk Hovy, who found, for example, that NLP made from language derived from Wall Street Journal and a German newspaper skews toward older men and away from young people or women.
An article by Caliskan, Bryson, and Narayana published this spring in the journal Science, titled “Semantics derived automatically from language corpora contain human-like biases,” also found bias based on race or gender.
The average person working with NLP today may consider language identification a solved problem. Recent work by others and the Equilid results make it clear that’s just not the case, Jurgens said.
“If it’s not working here, it’s probably working even less in the other [NLP] tools that we have,” he said. “Hopefully this is the first in a long line of research looking at natural language processing or computational linguistics from an equality point of view, where our tools make mistakes and where we’re actually missing people.”