Amazon's X-Transformer tackles industrial-scale text classification

In a preprint paper published on Arxiv.org, researchers at Amazon, Carnegie Mellon, and the University of Texas at Austin describe X-Transformer, an approach to tuning language algorithms to the problem of returning labels from large data sets. They say that it achieves state-of-the-art results on several benchmarks, as well as on a product query data set from Amazon.

X-Transformer targets what the researchers call "extreme" multi-label text classification (XMC): Given an input text instance, it attempts to return the most relevant labels from a collection where the number of labels could be in the millions (or more). XMC is essentially a text classification challenge on an industrial scale -- a challenge that requires overcoming hardware limitations in addition to a lack of training data.

"Many challenging problems at Amazon amount to finding relevant results from an enormous output space of potential candidates: for example, suggesting keywords to advertisers starting new campaigns on Amazon, predicting next queries a customer will type based on the previous queries he or she typed," wrote the coauthors. "Keyword recommendation systems provide keyword suggestions for advertisers to create campaigns ... An XMC model, when trained on a product-to-query dataset such as product-query customer purchase records, can suggest queries that are relevant to any given product by utilizing product information, like title, description, [or] brand."

X-Transformer, which builds on Google's existing Transformer architecture, consists of a semantic label indexing component, a deep neural matching component, and an ensemble ranking component. Semantic label indexing decomposes the original XMC problem into a set of sub-problems via a process called label clustering. Next, the deep neural matching component fine-tunes a Transformer model for each SLI-induced XMC sub-problem. The ensemble ranking component is then used to assemble scores from various sub-problems, theoretically bolstering performance further.

In experiments, the researchers claim the proposed X-Transformer achieved new state-of-the-art results on four XMC benchmarks and lead to improvement on real-would XMC applications.

For example, on a Wikipedia data set with a half-million labels, X-Transformer achieved a "prec@1" (a metric indicating how often the highest-ranked document is relevant) of 77.28%, a "substantial" improvement over the well-established hierarchical label tree approach Parabel (which achieves 68.70%) and the competing machine learning method AttentionXML (76.95%). When applied to an internal Amazon data set dubbed Prod2Query-1M, which consisted of 14 million products on Amazon.com and 1 million labels (queries), X-Transformer showed a 10.7% relative improvement over Parabel.

The X-Transformer data sets, code, and models are available in open source on GitHub.

More