Researchers propose data set to measure few-shot learning performance

A plethora of AI models have been tailored to tackle few-shot classification, which refers to learning a classifier for new classes given only a few examples (an ability humans naturally possess). Improving on it could lead to more efficient algorithms capable of expanding their knowledge without requiring large labeled data sets, but to date, many of the procedures and corpora used to assess progress here are lacking.

That's why researchers at Google AI, the University of California at Berkeley, and the University of Toronto propose in a preprint paper a benchmark for training and evaluating large-scale, diverse, and more "realistic" few-shot classification models. They say it improves upon previous approaches by incorporating multiple data sets of "diverse" distributions and by introducing realistic class imbalance, which they say allows the testing of robustness across a spectrum from low-shot learning onward.

The work was published in May 2019, but it was recently accepted to the International Conference on Learning Representations (ICLR) that will be held in Addis Ababa, Ethiopia in April.

As the team explains, as opposed to synthetic environments, real-life learning experiences are heterogeneous in that they vary by the number of classes and the examples per class. They also measure only within-corpus generalization, and they ignore the relationships between classes when forming episodes -- i.e., the coarse-grained classification of dogs and chairs may present different difficulties than the fine-grained classification of dog breeds. (An "episode" encompasses states that come between an initial-state and a terminal-state, such as a game of chess.)

By contrast, the researchers' data set -- the Meta-Dataset -- leverages data from 10 different corpora, which span a variety of visual concepts natural and human-made and vary in the specificity of the class definition. Two are reserved for evaluation, meaning that no two classes from them participate in the training set, while the remaining ones contribute some classes to each of the training, validation, and test splits of classes.

Meta-Dataset separately employs an algorithm for sampling episodes, which aims to yield imbalanced episodes of variable shots (class precision) and ways (accuracy). A prescribed number of examples of each chosen class are chosen uniformly at random to populate the support and query sets.

In experiments, the team trained meta-learning models via training episodes sampled using the same algorithm as used for Meta-Dataset's evaluation episodes. They say that, tested against Meta-Dataset, the models generally didn't improve when provided multiple data sources and that they didn't benefit from meta-learning across the data sets. Moreover, they report that the models weren't robust to the amount of data in test episodes; rather, each excelled in a different part of the spectrum.

"We believe that our exploration of various models on Meta-Dataset has uncovered interesting directions for future work pertaining to meta-learning across heterogeneous data," wrote the coauthors, who added that addressing the uncovered shortcomings constitutes an important research goal. "[I]t remains unclear what is the best strategy for creating training episodes, the most appropriate validation creation and the most appropriate initialization."