Amazon's AI generates images of clothing to match text queries

Generative adversarial networks (GANs) -- two-part AI models consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples -- have been applied to tasks ranging from video, artwork, and music synthesis to drug discovery and misleading media detection. They've also made their way into ecommerce, as Amazon revealed in a blog post this morning. Scientists at Amazon describe a GAN that generates clothing examples to match product descriptions, which they say could be used to refine customer text queries. For instance, if a shopper searched for "women's black pants" and then add the word "petite" and then the word "capri," the on-screen images would adjust accordingly with each new word.

It's not unlike the GAN model commercialized by startup Vue.ai, which susses out clothing characteristics and learns to produce realistic poses, skin colors, and other features. From snapshots of apparel, it's able to generate model images in every size up to 5 times faster than a traditional photoshoot.

Amazon's proposed system -- ReStGAN -- is a modification of an existing system -- StackGAN -- that produces images by splitting them into two parts. Using a GAN, it first generates a low-resolution image directly from text, after which it upsamples the image with a GAN to a higher-resolution version with textures and natural coloration. The GANs are trained with a long short-term memory AI model that processes sequential inputs in order, enabling them to refine images as successive words are added to the inputs. And to make the task of synthesizing from the descriptions easier, the system is restricted to three product classes -- pants, jeans, and shorts -- for which the training images are standardized (i.e., the backgrounds are removed and the images are cropped and resized so that they're alike in shape and scale).

The research team trained the system in an unsupervised fashion, meaning the training data consisted of product titles and images that didn’t require any additional human annotation. The team increased the system's stability using an auxiliary classifier that categorized images generated by the model according to three properties: apparel type (pants, jeans, or shorts), color, and whether they depicted men's, women's, or unisex clothing. The researchers also grouped colors in a representational space called LAB, which was designed so that the distance between points corresponded to perceived color differences, forming the basis for a lookup table that maps visually similar colors to the same features of the textual descriptions.

The ability to retain old visual features while adding new ones is one of the novelties of the system, according to the researchers, the other being the color model, which yields images whose colors better match textual inputs. In experiments, the team reports that ReStGAN improved product classification by type up to 22% and gender up to 27%, compared with the previous best-performing models based on the StackGAN architecture. Color improved 100%.

More