Facebook today announced improvements to the shopping experiences across its platform, including Facebook Shops, a new way for businesses to set up a single online store for customers to access on Facebook and Instagram. The company characterized the new products — all of which are powered by a family of new AI and machine learning system — as a step toward its vision of an all-in-one AI assistant that can search and rank products, while at the same time personalizing its recommendations to individual tastes.
Ecommerce businesses like Facebook Marketplace lean on AI to automate a host of behind-the-scenes tasks, from learning preferences and body types to understanding the factors that might influence purchase decisions. McKinsey estimates that Amazon, which recently deployed AI to handle incoming shopper inquiries, generates 35% of all sales from its product recommendation engine. Beyond ranking, AI from startups like ModiFace, Vue.ai, Edited, Syte, and Adverity enable customers to try on shades of lipstick virtually, see model images in every size, and spot trends and sales over time.
“We’re seeing a lot of small businesses that never had online presences get online for the first time [as a result of the pandemic,]” said Facebook CEO Mark Zuckerberg during a livestream this afternoon, who revealed that over 160 million small businesses around the world use the platform’s services. “This isn’t going to make up for all of their lost business, but it can help. And for lots of small businesses during this period, this is the difference between staying afloat or going under … Facebook is uniquely positioned to be a champion for small businesses and what helps them grow and what keeps them healthy.”
Facebook says its AI-powered shopping systems segment, detect, and classify images to know where products appear and deliver shopping suggestions. One of those systems — GrokNet — was trained on seven data sets containing images of products that millions of users post, buy, and sell in dozens of categories, ranging from SUVs to stiletto heels to side tables. Another creates 3D views from 2D videos of products, even those obscured by dim or overly bright lighting, while a third spotlights apparel like scarfs, ties, and more that might be partially obscured by their surroundings.
Facebook says that GrokNet, which can detect exact, similar (via related attributes), and co-occurring products across billions of photos, performs searches and filtering on Marketplace at least twice as accurately than the algorithm it replaced. For instance, it’s able to identify 90% of home and garden listings compared with Facebook’s text-based attribution systems, which can only identify 33%. In addition to generating tags for colors and materials from images before Marketplace sellers list an item, as part of a limited test, it’s tagging products on Facebook Pages when Page admins upload a photo.
In the course of pre-training GrokNet on 3.5 billion images and 17,000 hashtags and fine-tuning it on internal data sets across 96 Nvidia V100 graphics cards, Facebook says it used real-world seller photos with “challenging” angles along with catalog-style spreads. To make it as inclusive as possible for all countries, languages, ages, sizes, and cultures, it sampled examples of different body types, skin tones, locations, socioeconomic classes, ages, and poses.
Rather than manually annotate each image with product identifiers, which would have taken ages — there are 3 million possible identifiers — Facebook developed a technique to automatically generate additional identifiers using GrokNet as a feedback loop. Leveraging an object detector, the approach identifies boxes in images surrounding likely products, after which it matches the boxes against a list of known products to keep matches within a similarity threshold. The resulting matches are added to the training set.
Facebook also took advantage of the fact that each training data set has an inherent level of difficulty. Easier tasks don’t need that many images or annotations, while more difficult tasks require more. Company engineers improved GrokNet’s accuracy on tasks simultaneously by allocating most of the training to challenging sets and only a few images per batch to simpler ones.
The productized GrokNet, which has 83 loss functions — i.e., functions that map events of variables onto numbers representing some cost associated with the events — can predict a range of properties for a given image, including its category, attributes, and likely search queries. Using just 256 bits to represent each product, it produces embeddings akin to fingerprints that can be used in tasks like product recognition, visual search, visually similar product recommendations, ranking, personalization, price suggestions, and canonicalization.
In the future, Facebook says it will employ GrokNet to power storefronts on Marketplace so that customers can more easily find products, see how those products are being worn, and receive relevant accessory recommendations. “This universal model allows us to leverage many more sources of information, which increases our accuracy and outperforms our single vertical-focused models,” the company wrote. “Considering [all these] kinds of issues from the start ensures that our attribute models work well for everyone.”
3D views and AR try-on
A complementary AI model powers Facebook’s 3D views feature, which is now available on Marketplace for iOS in a test. Building on the 3D Photos tool Facebook introduced in February, it takes a video shot with a smartphone and post-processes it to create an interactive, pseudo-3D representation that can be spun and moved up to 360 degrees.
Facebook uses a method called simultaneous localization and mapping (SLAM) for the reconstruction, where a map of an unknown environment or object is created and updated while an agent’s (smartphone’s) location is simultaneously tracked. The smartphone’s poses are reconstructed in 3D space, and its paths are smoothed with a system that detects abnormal gaps and maps each pose into a coordinate space that corrects for discontinuities. To maintain consistency, the smooth camera paths are mapped back into the original space, reintroducing discontinuities and ensuring that objects remain recognizable.
Facebook’s SLAM technique also combines observations from frames to obtain a sparse point cloud, which consists of the most prominent features from any given captured scene. This cloud serves as guidance to the camera poses that correspond to viewpoints best representing objects in 3D; images are distorted in such a way that they look like they were taken from the viewpoints. A heuristic outlier detector finds key points that could introduce distortions and discards them, while similarity constraints make the featureless parts of the reconstructions more rigid and out-of-focus areas look more natural.
Beyond 3D reconstructions, Facebook says that it will soon draw on its Spark AR platform checkout to allow customers to see how items look in various places. (Already, brands like Nyx, Nars, and Ray-Ban use it in Facebook Ads and Instagram to power augmented reality “try-on” experiences.) The company plans to support try-on for a wider variety of items — including home decor and furniture — across apps and services including Shops, Facebook’s feature that enables businesses to sell directly through the network.
To imbue services like Marketplace with the ability to automatically isolate clothing products within images, Facebook developed a segmentation technology it claims achieves state-of-the-art performance compared with several baselines. The tech — an “operator” called Instance Mask Projection — can spot items like wristbands, necklaces, skirts, and sweaters photographed in uneven lighting or partially obscured, or even shown in different poses and layered under other items like shirts and jackets.
Instance Mask Projection detects a clothing product as a whole and roughly predicts its shape. This prediction serves as a guide to refine the estimate for each pixel, allowing global information from the detection to be incorporated. The predicted instance maps are projected into a feature map that’s used as input for semantic segmentation. According to Facebook, this design makes the operator suitable for clothing parsing (which involves complex layering, large deformations, and non-convex objects) as well as street-scene segmentation (overlapping instances and small objects).
Facebook says it’s training its product recognition systems with the operator across dozens of product categories, patterns, textures, styles, and occasions, including lighting and tableware. It’s also enhancing the tech to detect objects in 3D photos, and in a related effort, it’s developing a body-aware embedding to detect clothing that might be flattering for a person’s shape.
“Today, we can understand that a person is wearing a suede-collared polka-dot dress, even if half of her is hidden behind her office desk. We can also understand whether that desk is made of wood or metal,” said Facebook in a statement. “As we work toward our long-term goal of teaching these systems to understand a person’s taste and style — and the context that matters when that person searches for a product — we need to push additional breakthroughs.”
Toward an AI fashion assistant
Facebook says its goal is to one day combine these disparate approaches into a system that can serve up product recommendations on the fly, matched to individual tastes and styles. It envisions an assistant that can learn preferences by analyzing images of what’s in a person’s wardrobe, for instance, and that allows the person to try favorites on self-replicas and sell apparel that others can preview.
To this end, Facebook says its researchers are prototyping an “intelligent digital closet” that provides not only outfit suggestions based on planned activities or weather, but also fashion inspiration informed by individual products and aesthetics.
It’s like a hardware-free, ostensibly more sophisticated take on the Echo Look, Amazon’s discontinued AI-powered camera that told customers how their outfits looked and kept track of what was in their wardrobe while recommending clothes to buy from Amazon.com. Companies like Stitch Fix, too, use algorithms to help pick out clothes sent to customers, choose the clothes kept in inventory, and keep track of things customers found online that they love.
Facebook anticipates that new systems will ultimately be required to adapt to changing trends and preferences, ideally systems that learn from feedback on images of potentially desirable products. It recently made progress with Fashion++, which uses AI to suggest personalized style advice like adding a belt or half-tucking a shirt. But the company says advancements in language understanding, personalization, and “social-first” experiences must emerge before a truly predictive fashion assistant becomes a possibility.
“We envision a future in which [a] system could … incorporate your friends’ recommendations on museums, restaurants, or the best ceramics class in the city — enabling you to more easily shop for those types of experiences,” said Facebook. “Our long-term vision is to build an all-in-one AI lifestyle assistant that can accurately search and rank billions of products, while personalizing to individual tastes. That same system would make online shopping just as social as shopping with friends in real life. Going one step further, it would advance visual search to make your real-world environment shoppable. If you see something you like (clothing, furniture, electronics, etc.), you could snap a photo of it and the system would find that exact item, as well as several similar ones to purchase right then and there.”
Facebook’s renewed focus on ecommerce comes as the company contends with flattening ad sales resulting from the pandemic. Even as online sales skyrocketed over the past few months, Facebook declined to increase Marketplace’s commission — 5%, compared to Amazon and Walmart’s 15% — likely to maintain a competitive edge. Some analysts estimate Marketplace will become a $5 billion-plus annual revenue stream for Facebook in the long term, all else being equal.