Elevate your enterprise data technology and strategy at Transform 2021.

In a paper published on the preprint server Arxiv.org, researchers at IBM propose StarNet, an end-to-end trainable image classifier that’s able to localize what it believes to be the key regions supporting its predictions. Besides addressing the task of visual classification, StarNet supports the task of weakly supervised few-shot object detection, such that only a small amount of noisy data is required to achieve reasonable accuracy with it.

StarNet could increase transparency in and reduce the amount of training data needed for new visual domains, like self-driving cars and autonomous industrial robots. By extension, it could cut down on deployment time for AI projects involving classifiers, which surveys show ranges between 8 and 90 days.

StarNet consists of a few-shot classifier module attached to an extractor, both of which are trained in a meta-learning fashion where episodes are randomly sampled from classes. Each episode comprises support samples and random query samples for a given base class of image, like “turtle,” “parrot,” “chicken,” and “dog.”

IBM StarNet

StarNet tries to geometrically match every pair of support and query images, matching up regions of arbitrary shape between the two images to the local deformations (accommodating for changes in shape). Training drives the matched regions to correspond to the locations of the class instances present on image pairs that share the same class label, localizing the instances. As they’re localized, StarNet highlights the common image regions, giving insight into how it made its predictions.

In experiments, the researchers used only the class labels for training, validation, and all of the support images, sourcing from data sets including miniImageNet dataset, CIFAR-FS, and FC100, all of which have 100 randomly chosen classes; CUB, which has 11,788 images of birds of 200 species; and ImageNetLOC-FS, which comprises 331 animal categories. They used 2,000 episodes for validation and 1,000 for testing on a single Nvidia K40 graphics card, resulting in running times from 1.15 seconds per batch to 2.2 seconds per batch on average.

On few-shot classification tasks, StarNet managed to perform up to 5% better than the state-of-the-art baselines. And with respect to weakly supervised few-shot object detection, the model obtained results “higher by a large margin” than results obtained by all compared baselines. The team attributes this strong performance to StarNet’s knack for classifying objects through localization.

“Future work directions include extending StarNet towards efficient end-to-end differentiable multi-scale processing for better handling very small and very large objects; iterative refinement utilizing StarNet’s locations predictions made during training; and applying StarNet for other applications requiring accurate localization using only a few examples, such as visual tracking.”

It’s often assumed that as the complexity of an AI system increases, it becomes invariably less interpretable. But researchers have begun to challenge that notion with libraries like Facebook’s Captum, which explains decisions made by neural networks with the deep learning framework PyTorch, as well as IBM’s AI Explainability 360 toolkit and Microsoft’s InterpretML. For its part, Google recently detailed a system that explains how image classifiers make predictions, and OpenAI detailed a technique for visualizing AI decision-making.


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more
Become a member