In a preprint paper published on, researchers at Amazon, Carnegie Mellon, and the University of Texas at Austin describe X-Transformer, an approach to tuning language algorithms to the problem of returning labels from large data sets. They say that it achieves state-of-the-art results on several benchmarks, as well as on a product query data set from Amazon.

X-Transformer targets what the researchers call “extreme” multi-label text classification (XMC): Given an input text instance, it attempts to return the most relevant labels from a collection where the number of labels could be in the millions (or more). XMC is essentially a text classification challenge on an industrial scale — a challenge that requires overcoming hardware limitations in addition to a lack of training data.

“Many challenging problems at Amazon amount to finding relevant results from an enormous output space of potential candidates: for example, suggesting keywords to advertisers starting new campaigns on Amazon, predicting next queries a customer will type based on the previous queries he or she typed,” wrote the coauthors. “Keyword recommendation systems provide keyword suggestions for advertisers to create campaigns … An XMC model, when trained on a product-to-query dataset such as product-query customer purchase records, can suggest queries that are relevant to any given product by utilizing product information, like title, description, [or] brand.”

X-Transformer, which builds on Google’s existing Transformer architecture, consists of a semantic label indexing component, a deep neural matching component, and an ensemble ranking component. Semantic label indexing decomposes the original XMC problem into a set of sub-problems via a process called label clustering. Next, the deep neural matching component fine-tunes a Transformer model for each SLI-induced XMC sub-problem. The ensemble ranking component is then used to assemble scores from various sub-problems, theoretically bolstering performance further.

In experiments, the researchers claim the proposed X-Transformer achieved new state-of-the-art results on four XMC benchmarks and lead to improvement on real-would XMC applications.

For example, on a Wikipedia data set with a half-million labels, X-Transformer achieved a “prec@1” (a metric indicating how often the highest-ranked document is relevant) of 77.28%, a “substantial” improvement over the well-established hierarchical label tree approach Parabel (which achieves 68.70%) and the competing machine learning method AttentionXML (76.95%). When applied to an internal Amazon data set dubbed Prod2Query-1M, which consisted of 14 million products on and 1 million labels (queries), X-Transformer showed a 10.7% relative improvement over Parabel.

The X-Transformer data sets, code, and models are available in open source on GitHub.