You've probably seen some of the dazzling images conjured up by artificial intelligence recently, like a astronaut riding a horse or an avocado sitting in therapist's chair. These fantastical pictures come from AI models that aim to translate any text you give them into a visual representation. But are these systems really as good at understanding our prompts as those impressive cherrypicked examples suggest?
A new study from the minds at Google DeepMind exposes the hidden limitations in how we currently evaluate the performance of these text-to-image AI models. In research released on the preprint server arXiv, they introduce a fresh approach called "Gecko" that promises a more comprehensive and reliable way to benchmark this blossoming technology.
"While text-to-image generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt," warns the DeepMind team in their paper, titled "Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings."
They point out that the datasets and automatic metrics predominantly used today to assess the capabilities of models like DALL-E, Midjourney, and Stable Diffusion don't tell the full story. Small-scale human evaluations give limited insight, while automatic metrics can miss important nuances and even disagree with human judges.
Introducing Gecko: A new benchmark for text-to-image models
To shine a light on these issues, the researchers developed Gecko—a new benchmark suite that cranks up the difficulty for text-to-image models. Gecko bombards them with 2,000 text prompts that probe a wide range of skills and complexity levels. It carves up these prompts into specific sub-skills, going beyond vague categories to pinpoint the exact weaknesses holding a model back.
"This skills-based benchmark categorizes prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging," explains co-lead author Olivia Wiles.

The Gecko framework introduced by Google DeepMind researchers addresses shortcomings in the evaluation of text-to-image AI models by providing (a) a comprehensive skills-based benchmark dataset, (b) extensive human annotations across different templates, (c) an improved automatic evaluation metric, and (d) insights into model performance across various criteria. The study aims to enable more accurate and robust benchmarking of these increasingly popular AI systems. (Credit: arxiv.org)
A more accurate picture of AI capabilities
The researchers also gathered over 100,000 human ratings on images generated by several leading models in response to the Gecko prompts. By collecting this unprecedented volume of feedback data across different models and evaluation frameworks, the benchmark can tease apart whether gaps in performance stem from models' true limitations, ambiguous prompts, or inconsistent evaluation methods.
"We gather human ratings across four templates and four text-to-image models for a total of over 100,000 annotations," highlights the study. "This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality."
Finally, Gecko features an enhanced automatic evaluation metric based on question-answering that aligns more closely with human judgments compared to existing metrics. When used to compare state-of-the-art models on the new benchmark, this combination revealed previously undetected differences in their strengths and weaknesses.
"We introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160," notes the paper. Overall, DeepMind's own Muse model came out on top when subjected to Gecko's gauntlet.
The researchers hope their work demonstrates the importance of using diverse benchmarks and evaluation approaches to truly understand what text-to-image AI can and can't do before deploying it in the real world. They plan to make the Gecko code and data freely available to spur further progress.
"Our work shows that the choice of dataset and metric has a big impact on perceived performance," says Wiles. "We hope Gecko enables more accurate benchmarking and diagnosis of model capabilities going forward."
So while that Epic Mickey reject might seem impressive at first glance, we still need rigorous testing to separate the real deal from the fool's gold. Gecko offers a glimpse at how to get there.
