Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
They say a picture is worth a thousand words. But an image can’t “speak” to individuals who have blindness or low-vision (BLV) without a little help. In a world driven by visual imagery, especially online, this creates a barrier to access.
The good news: When screen readers — software that reads the content of web pages to BLV people — come across an image, they will read any “alt-text” descriptions that the website creator added to the underlying HTML code, rendering the image accessible.
The bad news: Few images are accompanied by adequate alt-text descriptions.
In fact, according to one study, alt-text descriptions are included with fewer than 6% of English-language Wikipedia images. And even in instances where websites do provide descriptions, they may be of no help to the BLV community. Imagine, for example, alt-text descriptions that list only the name of the photographer, the image’s file name, or a few keywords to aid with search. Or picture a home button that has the shape of a house but no alt-text saying “home.”
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
As a result of missing or unhelpful image descriptions, members of the BLV community are frequently left out of valuable social media interactions or unable to access essential information on websites that use images for site navigation or to convey meaning.
Can AI aid those with blindness and low vision?
While we should encourage better tooling and interfaces to nudge people toward making images accessible, society’s failure to date to provide useful and accessible alt-text descriptions for every image on the internet points to the potential for an AI solution, says Elisa Kreiss, a graduate student in linguistics at Stanford University and a member of the Stanford Natural Language Processing Group.
However, natural language generated (NLG) image descriptions haven’t yet proven beneficial to the BLV community. “There’s a disconnect between the models we have in computer science that are supposed to generate text from images and what actual users find to be useful,” Kreiss says.
In a recent paper, Kreiss and her study co-authors (including scholars from Stanford, Google Brain and Columbia University) found that BLV users prefer image descriptions that take context into account.
Because context can dramatically change the meaning of an image — e.g., a football player in a Nike ad versus in a story about traumatic brain injury — contextual information is vital for crafting alt-text descriptions that are useful.
Yet existing metrics of image description quality don’t take context into account. These metrics are therefore steering the development of NLG image descriptions in a direction that will not improve image accessibility, Kreiss says.
Read the paper, “Context Matters for Image Descriptions for Accessibility: Challenges for Referenceless Evaluation Metrics”
Kreiss and her team also found that BLV users prefer longer alt-text descriptions rather than the concise descriptions typically promoted by prominent accessibility guidelines — a result that runs counter to expectations.
These findings highlight the need not only for new ways of training sophisticated language models, Kreiss says, but also for new ways of evaluating them to ensure they serve the needs of the communities they’ve been designed to help.
Measuring image descriptions’ usefulness in context
Computer scientists have long assumed that image descriptions should be objective and context-independent, Kreiss says. But human-computer interaction research shows BLV users tend to prefer descriptions that are both subjective and context-appropriate. “If the dog is cute or the sunny day is beautiful, depending on the context, the description might need to say so,” she says. And if the image appears on a shopping website versus a news blog, the alt-text description should reflect the particular context to help clarify its meaning.
Yet existing metrics for evaluating the quality of image descriptions focus on whether a description is a reasonable fit for the image regardless of the context in which it appears, Kreiss says.
For example, current metrics might highly rate a soccer team’s photo description that reads “a soccer team playing on a field,” regardless of whether it accompanies an article about cooperation (in which case the alt-text should include something about how the team cooperates), a story about the athletes’ unusual hairstyles (in which case the hairstyles should be described) or a report on the prevalence of advertising in soccer stadiums (in which case the advertising in the arena might be mentioned). If image descriptions are to better serve the needs of BLV users, Kreiss says, they must have greater context-awareness.
To explore the importance of context, Kreiss and her colleagues hired Amazon Mechanical Turk workers to write image descriptions for 18 images, each of which appeared in three different Wikipedia articles. In addition to the soccer example cited above, the dataset included images such as a church spire linked to articles about roofs, building materials and Christian crosses; and a mountain range and lake view associated with articles about montane (mountain slope) ecosystems, a body of water, and orogeny (a specific way that mountains are formed).
The researchers then showed the images to both sighted and BLV study participants and asked them to evaluate each description’s overall quality; imaginability (how well it helped users imagine the image); relevance (how well it captured relevant information); irrelevance (how much irrelevant information it added); and general “fit” (how well the image fit within the article).
The study revealed that BLV and sighted participants’ ratings were highly correlated.
Knowing that the two populations were aligned in their assessments will be helpful when designing future NLG systems for generating image descriptions, Kreiss says. “The perspectives of people in the BLV community are essential, but often during system development we need much more data than we can get from the low-incidence BLV population.”
Another finding: Context matters. Participants’ ratings of an image description’s overall quality closely aligned with their ratings for relevance.
When it came to description length, BLV participants rated the quality of longer descriptions more highly than did sighted participants, a finding Kreiss considers surprising and worthy of further research. “Users’ preference for shorter or longer image descriptions might also depend on the context,” she notes. Figures in scientific papers, for example, might merit longer descriptions.
Steering toward better metrics of image description quality
Kreiss hopes her team’s research will promote metrics of image description quality that will better serve the needs of BLV users. She and her colleagues found that two of the current methods (CLIPScore and SPURTS) were not capable of capturing context.
CLIPScore, for example, only provides a compatibility score for an image and its description. And SPURTS evaluates the quality of the description text without reference to the image.
While these metrics can evaluate the truthfulness of an image description, that is only a first step toward driving “useful” description generation, which also requires relevance (i.e., context dependence), Kreiss says.
It was therefore unsurprising that CLIPScore’s ratings of the image descriptions in the researchers’ dataset did not correlate with the ratings by the BLV and sighted participants. Essentially, CLIPScore rated the description’s quality the same regardless of context.
When the team added the text of the various Wikipedia articles to alter the way CLIPScore is computed, the correlation with human ratings improved somewhat — a proof of concept, Kreiss says, that reference-less evaluation metrics can be made context-aware.
She and her team are now working to create a metric that takes context into account from the get-go to make descriptions more accessible and more responsive to the community of people they are meant to serve.
“We want to work toward metrics that can lead us toward success in this very important social domain,” Kreiss says. “If we’re not starting with the right metrics, we’re not driving progress in the direction we want to go.”
Katharine Miller is a contributing writer for the Stanford Institute for Human-Centered AI.
This story originally appeared on Hai.stanford.edu. Copyright 2023
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!