Researchers in Seattle have introduced what they call a new AI grand challenge called TuringAdvice, which is centered on creating language models that generate helpful advice for humans using real-world language.
The TuringAdvice challenge is based on the dynamic RedditAdvice data set. Created for the challenge, RedditAdvice is a crowdsourced data set of advice shared in the past two weeks that got the most upvotes in Reddit subcommunities. To pass the challenge, a machine must deliver advice as helpful as or better than popular human advice.
As part of the TuringAdvice launch, the researchers also released a static RedditAdvice 2019 data set for training advice-giving AI models, which includes 616,000 pieces of advice from 188,000 situations shared by people in Reddit subcommunities.
Initial analysis indicates that advanced models such as Google’s T5, a model with 11 billion parameters introduced last fall, only write advice moderators found at least as helpful as human advice in 9% of cases. The researchers also evaluated versions of the Grover Transformer model and TF-IDF. The study does not evaluate popular bidirectional NLP models like Google’s BERT, since they’re generally considered worse at generating text than left-to-right models. Demonstrations of human versus machine advice on relationships, legal matters, and life in general are available online.
“Today’s largest models struggle on REDDITADVICE, so we are excited to see what new models get developed,” a recently released paper about TuringAdvice reads. “We argue that there is a deep underlying issue: a gap between how humans use language in the real world, and what our evaluation methodology can measure. Today’s dominant paradigm is to study static datasets, and to grade machines by the similarity of their output with predefined correct answers.”
“However, when we use language in the real world to communicate with each other — such as when we give advice, or teach a concept to someone — there is rarely a universal correct answer to compare with, just a loose goal we want to achieve. We introduce a framework to narrow this gap between benchmarks and real-world language use.”
Advances in the creation of AI in the TuringAdvice challenge could enable the creation of AI better at delivering advice for humans or acting as a virtual therapist, authors said.
To ensure results remain in line with real-world language use, the team chose a dynamic evaluation method in which they gathered 200 situations from Reddit subcommunities in a recent two-week period. They chose advice as a testing scenario because it’s something all people are inherently familiar with and it overlaps with core NLP tasks like reading comprehension.
The TuringAdvice challenge is the work of the University of Washington and the Allen Institute of AI and was detailed in a research paper released on the preprint repository arXiv last week titled “Evaluating Machines by their Real-World Language Use.” University of Washington associate professor Ali Farhadi, whose AI startup Xnor was recently acquired by Apple, is also a coauthor. Farhadi is also lead of the PRIOR team at the Allen Institute.
All evaluations of model performance come from humans hired through Amazon’s Mechanical Turk. Once a frowned-upon way to obtain data for training AI models, the paper calls hiring Mechanical Turk employees more ethical than posting automated machine advice in response to humans in need of help, but acknowledges that getting paid to complete the task introduces extrinsic motivation. Workers who tended to choose machine advice over human advice were let go.
Lead researcher Rowan Zellers told VentureBeat its second round of leaderboard results is expected in the months ahead after researchers get the chance to create and fine-tune their models to take on the TuringAdvice challenge.
By choosing popular advice shared in Reddit subcommunities, researchers said they attempted to create intrinsic motivation like the kind experienced by humans responding to calls for help on Reddit.
One concern going forward for the TuringAdvice challenge is price; evaluation of 200 pieces of advice on Mechanical Turk costs about $370. Future participants in the TuringAdvice challenge will be asked to pay Mechanical Turk fees in order for their model to be evaluated or potentially appear on the TuringAdvice leaderboard.
TuringAdvice is the latest challenge created in the past year to make more robust natural language models. Last fall, University of Washington’s NLP lab joined researchers from New York University, Facebook AI Research, and Samsung Research to introduce the SuperGLUE challenge and leaderboard, a more complex series of tasks to evaluate performance.