AI extracts speech bubbles from comic strips

Segmentation -- partitioning an image or scan into multiple segments, or sets of pixels -- is a task at which artificial intelligence (AI) excels. Case in point: Researchers at Google parent company Alphabet's DeepMind recently revealed in an academic paper that they'd developed a system capable of segmenting CT scans with "near-human performance." Now, scientists at the University of Potsdam in Germany have developed an AI segmentation tool for a slightly more cartoony medium: comics.

In a paper published on the preprint server Arxiv.org ("Deep CNN-based Speech Balloon Detection and Segmentation for Comic Books"), they describe a neural network (i.e., layers of mathematical functions modeled after biological neurons) that can detect and isolate speech bubbles in graphic novels and comic books. During tests involving a dataset containing speech bubbles with "wiggly tails" and "curved corners," it achieved an F1 score (a measure of a test's accuracy) of 0.94, which the researchers claim is state-of-the-art.

"Speech balloons usually consist of a carrier, [a symbolic device used to hold the text,] and a tail connecting the carrier to its root character from which the text emerges. Both tails and carriers come in a variety of shapes, outlines, and degrees of wiggliness," the researchers explain. "It ... pays to classify [speech bubbles] as different classes, because they serve different functions: In contrast to captions, which are normally used for narrative purposes, speech balloons typically contain direct speech or thoughts of characters in the comic."

The team tapped a fully convolutional neural network -- a class of AI commonly used to analyze visual imagery -- originally architected for medical image segmentation and trained for classification of "natural images." They modified it slightly and fed it 750 annotated pages from 90 comic books in the Graphic Narrative Corpus, a digital library of graphic novels, memoirs, and nonfiction written in English.

Over time, it learned to classify whether each pixel in a comic strip belonged to a speech balloon or not.

To validate their approach, the researchers tested the trained AI system on a subset (15 percent) of the 750 images they sourced from the Graphic Narrative Corpus. Impressively, it managed to approximate illusory contours -- boundaries of speech balloons not outlined by physical lines, but by "imaginary" continuations of the lines defining the space between panels.

The researchers posit that their AI speech balloon detection system could be used to create corpora of annotated comic books, or as a first step in a general segmentation pipeline for historical manuscripts, scientific articles, figures and tables, and newspaper articles. And they say that it one day might aid in the development of assistive technologies for people with poor vision.

That's not to suggest it's perfect. It performed poorly with speech bubbles in Japanese manga, which the researchers say could be the result of encoded "culture-specific" features of the Latin alphabet and the horizontal orientation of text lines speech balloons in the training dataset. But work has already begun on an updated model with more manga samples, and on a model extended to segment captions, characters, and other elements.

"Of course, human-assisted verification is needed, but given the fact there are now several computer vision domains where the performance of [some AI] models is at least close to human performance, we expect to be able to solve several tedious annotation tasks, freeing human resources for more interesting endeavours," they wrote.

More