Facebook researchers this week introduced Situated Interactive MultiModal Conversations (SIMMC), a novel research direction aimed at training AI chatbots that take actions like showing an object and explaining what it’s made of in response to images, memories of previous interactions, and individual requests. In a technical paper, they detail new data sets designed for this purpose containing around 13,000 human-to-human dialogs across two domains — furniture and fashion — along with several tasks framed as objective evaluation protocols.
Facebook would appear to be working toward an assistant capable of processing data a user and the assistant co-observe, and then outputting replies beyond just plain text based on this data. The hope is that this assistant emulates human chat partners by responding to images, messages, and messages about images as naturally as a person might. For example, given the prompt “I want to buy some chairs — show me brown ones and tell me about the materials,” the assistant might reply with an image of brown chairs and the text “How do you like these? They have a solid brown color with a foam fitting.”
SIMMC supports the development of such an assistant with the aforementioned data sets and new technical tasks, which address task-oriented dialogs encompassing multimodal user contexts in the form of a co-observed image or a virtual reality environment. The tasks get updated dynamically based on the dialog flow and the assistant’s actions.
In SIMMC-Furniture, the furniture-focused data set, a user interacts with a conversational assistant to get recommendations for items like couches and side tables. To create it, the Facebook researchers built a virtual environment within Unity where volunteers were connected randomly with humans posing as a virtual, full-featured assistant. The users could ask to see a particular type of furniture, and the assistant could filter a catalog of 3D Wayfair assets by price, color, material, and more while navigating through the filtered results to share their view in focused (i.e., zoomed-in) or carousel (three slots containing three different items) presentations.
Meanwhile, in the SIMMC-Fashion data set, users asked humans posing as virtual assistants for jacket, dress, and other clothing and accessory suggestions. Within the same Unity environment, assistants could sort by price, brand, color, and more as the users browsed and explored options informed by preferences and visual scenes, memories, and assistant-recommended items.
For both corpora, the researchers noted which items appeared in each view. They also developed an ontology to capture the multimodal interactions within dialog flows and provide semantics for assistant and user utterances, consisting of four primary components: objects, activities (e.g., “add to cart”), attributes (“brands”), and dialog acts (“ask”). To complement this, they derived a labeling language for annotation that allowed for the representation of dialog exchanges, such that the SIMMC annotations capture the relations of objects with their corresponding dialog annotations.
Building on these data sets, the Facebook researchers built a basic assistant consisting of four components: an utterance and history encoder, multimodal fusion, an action predictor, and a response generator.
- The utterance and history encoder creates encodings (numerical representations) from user replies and the dialog history.
- The multimodal fusion step combines information from the text and multimodal context into a mathematical object called a tensor.
- The action predictor predicts actions to be taken by the assistant by transforming the tensor into another object called a vector, and then by predicting an API the assistant might need to call.
- The response generator generates an assistant response that’s semantically relevant to users’ requests. For example, given the request “Show me black couches less than $500,” the generator might reply “Here are some” or “Sorry, we do not have any black couches cheaper than $500” based on available inventory.
After training the models on both SIMMC-Fashion and SIMMC-Furniture, the researchers found that they outperformed two baseline AI systems across a number of metrics. Despite not leveraging the fine-grained annotations, the best-performing action predictor chose the right API 79.6% of the time for the SIMMC-Furniture corpus and 85.1% of the time for SIMCC-Fashion. Facebook says that it will publicly release the data, annotations, code, and models in the future.
The research follows Facebook’s detailing of the AI systems behind its shopping experiences, which continue to evolve across Instagram, WhatsApp, and Facebook proper. The company says its goal is to one day combine its approaches into a system that can serve up product recommendations on the fly, matched to individual tastes and styles — a sort of hardware-free take on the recently discontinued Echo Look, Amazon’s AI-powered camera that told customers how their outfits looked and kept track of their wardrobe.