Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

There's been an explosion in recent years of natural language processing (NLP) datasets aimed at testing various AI capabilities. Many of these datasets have accompanying leaderboards, which provide a means of ranking and comparing models. But the adoption of leaderboards has thus far been limited to setups with automatic evaluation, like classification and knowledge retrieval. Open-ended tasks requiring natural language generation such as language translation, where there are often many correct solutions, lack techniques that can reliably automatically evaluate a model's quality.

To remedy this, researchers at the Allen Institute for Artificial Intelligence, the Hebrew University of Jerusalem, and the University of Washington created GENIE, a leaderboard for human-in-the-loop evaluation of text generation. GENIE posts model predictions to a crowdsourcing platform (Amazon Mechanical Turk), where human annotators evaluate them according to predefined, dataset-specific guidelines for fluency, correctness, conciseness, and more. In addition, GENIE incorporates various automatic machine translation, question answering, summarization, and common-sense reasoning metrics including BLEU and ROUGE to show how well they correlate with the human assessment scores.

As the researchers note, human-evaluation leaderboards raise a couple of novel challenges, first and foremost potentially high crowdsourcing fees. To avoid deterring submissions from researchers with limited resources, GENIE aims to keep submission costs around $100, with initial submissions to be paid by academic groups. In the future, the coauthors plan to explore other payment models including requesting payment from tech companies while subsidizing the cost for smaller organizations.

To mitigate another potential issue -- the reproducibility of human annotations over time across various annotators -- the researchers use techniques including estimating annotator variance and spreading the annotations over several days. Experiments show that GENIE achieves "reliable scores" on the included tasks, they claim.

"[GENIE] standardizes high-quality human evaluation of generative tasks, which is currently done in a case-by-case manner with model developers using hard-to-compare approaches," Daniel Khashabi, a lead developer on the GENIE project, explained in a Medium post. "It frees model developers from the burden of designing, building, and running crowdsourced human model evaluations. [It also] provides researchers interested in either human-computer interaction for human evaluation or in automatic metric creation with a central, updating hub of model submissions and associated human-annotated evaluations."

The coauthors believe that the GENIE infrastructure, if widely adopted, could alleviate the evaluation burden for researchers while ensuring high-quality, standardized comparison against previous models. Moreover, they anticipate that GENIE will facilitate the study of human evaluation approaches, addressing challenges like annotator training, inter-annotator agreement, and reproducibility -- all of which could be integrated into GENIE to compare against other evaluation metrics on past and future submissions.

"We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation," the coauthors wrote in a paper describing their work. "This is a novel deviation from how text generation is currently evaluated, and we hope that GENIE contributes to further development of natural language generation technology."