Writing photo captions is a monotonous — but necessary — chore begrudgingly undertaken by editors everywhere. Fortunately for them, though, AI might soon be able to handle the bulk of the work. In a paper (“Adversarial Semantic Alignment for Improved Image Captions”) appearing at the 2019 Conference in Computer Vision and Pattern Recognition (CVPR) in Long Beach, California this week, a team of scientists at IBM Research describes a model capable of autonomously crafting diverse, creative, and convincingly humanlike captions.
Architecting the system required addressing a chief shortcoming of automatic captioning systems: sequential language generation resulting in syntactically correct — but homogeneous, unnatural, and semantically irrelevant — structures. The coauthors’ approach gets around this with an attention captioning model, which allows the captioner to use fragments of scenes in the photos it’s observing to compose sentences. At every generating step, the team’s AI model has the choice of attending to either visual or textual cues from the last step.
In order to ensure the generated captions didn’t sound too robotic, the research team employed generative adversarial network (GANs) — two-part neural networks consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples — in training the captioner. A co-attention discriminator scored the “naturalness” of novel sentences via a model that matched scenes at the pixel level with generated words, enabling the captioner to compose by judging the image and sentence pairs.
Avoiding bias in the training data set — another common problem in captioning systems, which often suffer from overfitting (i.e., analysis that corresponds too closely to a particular set of data) and subsequently generalize poorly to scenes where learned objects (e.g., “bed,” “bedroom”) appear in unseen contexts (“bed and forest”) — required building a diagnostic tool. To that end, the researchers propose a test corpus of captioned images designed in such a way that bad model performance indicates overfitting.
In experiments that tasked human evaluators from Amazon’s Mechanical Turk to identify which captions were generated by the AI model and to judge how well each caption described the corresponding image, given several real and synthetic samples, the researchers report that their captioner achieved “good” performance on the whole. They believe that their work lays the groundwork for powerful new computer vision systems, which they intend to explore in future work.
“Progress on automatic image captioning and scene understanding will make [AI] systems more reliable for use as personal assistants for visually impaired people and in improving their day-to-day life,” wrote the researchers. “The semantic gap in bridging language and vision points to the need for incorporating common sense and reasoning into scene understanding.”