Amazon's new AI technique lets users virtually try on outfits

In a series of papers scheduled to be presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Amazon researchers propose complementary AI algorithms that could form the foundation of an assistant that helps customers shop for clothes. One lets people fine-tune search queries by describing variations on a product image, while another suggests products that go with items a customer has already selected. Meanwhile, a third synthesizes an image of a model wearing clothes from different product pages to demonstrate how items work together as an outfit.

Amazon already leverages AI to power Style by Alexa, a feature of the Amazon Shopping app that suggests, compares, and rates apparel using algorithms and human curation. With style recommendations and programs like Prime Wardrobe, which allows users to try on clothes and return what they don't want to buy, the retailer is vying for a larger slice of sales in a declining apparel market while surfacing products that customers might not normally choose. It's a win for businesses on its face -- excepting cases where the recommended accessories are Amazon's own, of course.

Virtual try-on network

Researchers at Lab126, the Amazon hardware lab which spawned products like Fire TV, Kindle Fire, and Echo, developed an image-based virtual try-on system called Outfit-VITON designed to help visualize how clothing items in reference photos might look on an image of a person. It can be trained on a single picture using a generative adversarial network (GAN), Amazon says, a type of model with a component called a discriminator that learns to distinguish generated items from real images.

"Online apparel shopping offers the convenience of shopping from the comfort of one's home, a large selection of items to choose from, and access to the very latest products. However, online shopping does not enable physical try-on, thereby limiting customer understanding of how a garment will actually look on them," the researchers wrote. "This critical limitation encouraged the development of virtual fitting rooms, where images of a customer wearing selected garments are generated synthetically to help compare and choose the most desired look."

Outfit-VITON comprises several parts: a shape generation model whose inputs are a query image, which serves as the template for the final image; and any number of reference images, which depict clothes that will be transferred to the model from the query image.

In preprocessing, established techniques segment the input images and compute the query person's body model, representing their pose and shape. The segments selected for inclusion in the final image pass to the shape generation model, which combines them with the body model and updates the query image's shape representation. This shape representation moves to a second model -- the appearance generation model -- that encodes information about texture and color, producing a representation that's combined with the shape representation to create a photo of the person wearing the garments.

Outfit-VITON's third model fine-tunes the variables of the appearance generation model to preserve features like logos or distinctive patterns without compromising the silhouette, resulting in what Amazon claims is "more natural" outputs than those of previous systems. "Our approach generates a geometrically correct segmentation map that alters the shape of the selected reference garments to conform to the target person," the researchers explained. "The algorithm accurately synthesizes fine garment features such as textures, logos, and embroidery using an online optimization scheme that iteratively fine-tunes the synthesized image."

Visiolinguistic product discovery

One of the other papers tackles the challenge of using text to refine an image that matches a customer-provided query. The Amazon engineers' approach fuses textual descriptions and image features into representations at different levels of granularity, so that a customer can say something as abstract as "Something more formal" or as precise as "Change the neck style," and it preserves some image features while following customers' instructions to change others.

The system consists of models trained on triples of inputs: a source image, a textual revision, and a target image that matches the revision. The inputs pass through three different sub-models in parallel, and at distinct points in the pipeline, the representation of the source image is fused with the representation of text before it's correlated with the representation of the target image. Because the lower levels of the model tend to represent lower-level input features (e.g., textures and colors) and higher levels higher-level features (sleeve length or tightness of fit), hierarchical matching helps to train the system to ensure it's able to handle textual modifications of different resolutions, according to Amazon.

Each fusion of linguistic and visual representations is performed by a separate two-component model. One uses a joint attention mechanism to identify visual features that should be the same in the source and target images, while the other identifies features that should change. In tests, the researchers say that it helped to find valid matches to textual modifications 58% more frequently than its best-performing predecessor.

"Image search is a fundamental task in computer vision. In this work, we investigate the task of image search with text feedback, which entitles users to interact with the system by selecting a reference image and providing additional text to refine or modify the retrieval results," the coauthors wrote. "Unlike the prior works that mostly focus on one type of text feedback, we consider the more general form of text, which can be either attribute-like description, or natural language expression."

Complementary-item retrieval

The last paper investigates a technique for large-scale fashion data retrieval, where a system predicts an outfit item's compatibility with other clothing, wardrobe, and accessory items. It takes as inputs any number of garment images together with a numerical representation called a vector indicating the category of each, along with a category vector of the customer's sought-after item, allowing a customer to select things like shirts and jackets and receive recommendations for shoes.

"Customers frequently shop for clothing items that fit well with what has been selected or purchased before," the researchers wrote. "Being able to recommend compatible items at the right moment would improve their shopping experience ... Our system is designed for large-scale retrieval and outperforms the state-of-the-art on compatibility prediction, fill-in-the-blank, and outfit complementary item retrieval."

Images pass through a model that produces a vector representation of each, and each representation passes through a set of masks that de-emphasize some representation features and amplify others. (The masks are learned during training, and the resulting representations encode product information like color and style that's relevant only to a subset of complementary items, such as shoes, handbags, and hats.) Another model takes as input the category for each image and the category of the target item and outputs values for prioritizing the masks, which are called subspace representations.

The whole system is trained using an evaluation criterion that accounts for the outfit. Each training sample includes an outfit as well as items that go well with that outfit and a group of items that don't, such that post-training, the system produces vector representations of every item in a catalog. Finding the best complement for a particular outfit then becomes a matter of looking up the corresponding vectors.

In tests that use two standard measures on garment complementarity, the system outperformed its three top predecessors with 56.19% fill-in-the-blank accuracy (and 87% compatibility area under the curve) while enabling more efficient item retrieval, and while achieving state-of-the-art results on data sets crawled from multiple online shopping websites (including Amazon and Like.com).

Virtual try-on network

Visiolinguistic product discovery

Complementary-item retrieval

More