Researchers design AI that can infer whole floor plans from short video clips

Floor plans are useful for visualizing spaces, planning routes, and communicating architectural designs. A robot entering a new building, for instance, can use a floor plan to quickly sense the overall layout. Creating floor plans typically requires a full walkthrough so 3D sensors and cameras can capture the entirety of a space. But researchers at Facebook, the University of Texas at Austin, and Carnegie Mellon University are exploring an AI technique that leverages visuals and audio to reconstruct a floor plan from a short video clip.

The researchers assert that audio provides spatial and semantic signals complementing the mapping capabilities of images. They say this is because sound is inherently driven by the geometry of objects. Audio reflections bounce off surfaces and reveal the shape of a room, far beyond a camera's field of view. Sounds heard from afar -- even multiple rooms away -- can reveal the existence of "free spaces" where sounding objects might exist (e.g., a dog barking in another room). Moreover, hearing sounds from different directions exposes layouts based on the activities or things those sounds represent. A shower running might suggest the direction of the bathroom, for example, while microwave beeps suggest a kitchen.

The researchers' approach, which they call AV-Map, aims to convert short videos with multichannel audio into 2D floor plans. A machine learning model leverages sequences of audio and visual data to reason about the structure and semantics of the floor plan, finally fusing information from audio and video using a decoder component. The floor plans AV-Map generates, which extend significantly beyond the area directly observable in the video, show free space and occupied regions divided into a discrete set of semantic room labels (e.g., family room and kitchen).

The team experimented with two settings, active and passive, in digital environments from the popular Matternet3D and SoundSpaces datasets loaded into Facebook's AI Habitat. In the first, they used a virtual camera to emit a known sound while it moved throughout the room of a model home. In the second, they relied only on naturally occurring sounds made by objects and people inside the home.

Across videos recorded in 85 large, real-world, multiroom environments within AI Habitat, the researchers say AV-Map not only consistently outperformed traditional vision-based mapping but improved the state-of-the-art technique for extrapolating occupancy maps beyond visible regions. With just a few glimpses spanning 26% of an area, AV-Map could estimate the whole area with 66% accuracy.

"A short video walk through a house can reconstruct the visible portions of the floorplan but is blind to many areas. We introduce audio-visual floor plan reconstruction, where sounds in the environment help infer both the geometric properties of the hidden areas as well as the semantic labels of the unobserved rooms (e.g., sounds of a person cooking behind a wall to the camera's left suggest the kitchen)," the researchers wrote in a paper detailing AV-Map. "In future work, we plan to consider extensions to multi-level floor plans and connect our mapping idea to a robotic agent actively controlling the camera ... To our knowledge, ours is the first attempt to infer floor plans from audio-visual data."