State-of-the-art machine learning algorithms can extract two-dimensional objects from photographs and render them faithfully in three dimensions. It’s a technique that’s applicable to augmented reality apps and robotics as well as navigation, which is why it’s an acute area of research for Facebook.
In a blog post today ahead of the International Conference on Computer Vision (ICCV) in Seoul, Facebook highlighted its latest advancements with respect to intelligent content-understanding. It says that together, its systems can be used to detect even complex foreground and background objects, like the legs of a chair or overlapping furniture.
“[Our] research builds on recent advances in using deep learning to predict and localize objects in an image, as well as new tools and architectures for 3D shape understanding, like voxels, point clouds, and meshes,” wrote Facebook researchers Georgia Gkioxari, Shubham Tulsiani, and David Novotny in a blog post. “Three-dimensional understanding will play a central role in advancing the ability of AI systems to more closely understand, interpret, and operate in the real world.”
One of the works spotlighted is Mesh R-CNN, a method that’s able to predict three-dimensional shapes from images of cluttered and occluded objects.
Facebook researchers say they augmented the open source Mask R-CNN’s two-dimensional object segmentation system with a mesh prediction branch, which they further bolstered with a library — Torch3d — containing highly optimized three-dimensional operators. Mesh R-CNN effectively uses uses Mask R-CNN to detect and classify the various objects in an image, after which It infers three-dimensional shapes with the aforementioned predictor.
Facebook says that, evaluated on the publicly available Pix3D corpus, Mesh R-CNN successfully detects objects of all categories and estimates their full three-dimensional shape across scenes of furniture. On a separate data set — ShapeNet — Mesh R-CNN outperformed prior work by a 7% relative margin.
Another Facebook-developed system — Canonical 3D Pose Networks, cheekily shortened to C3DPO — addresses scenarios where meshes and corresponding images aren’t available for training. It builds reconstructions of three-dimensional keypoint models, achieving state-of-the-art reconstruction results using two-dimensional keypoint supervision. (Keypoints in this context refer to tracked parts of objects that provide a set of clues around the geometry and its viewpoint changes.)
C3DPO taps a reconstruction model that predicts the parameters of the corresponding camera viewpoint and the three-dimensional keypoint locations. An auxiliary component learns alongside the model to address the ambiguity introduced in the factorization of three-dimensional viewpoints and shapes.
Facebook notes that such reconstructions were previously achievable in part because of memory constraints. C3DPO’s architecture enables three-dimensional reconstruction where hardware for capture isn’t feasible, like with large-scale objects.
“[Three-dimensional] computer vision has many open research questions, and we are experimenting with multiple problem statements, techniques, and methods of supervision as we explore the best way to push the field forward as we did for two-dimensional understanding,” wrote Gkioxari, Tulsiani, and Novotny. “As the digital world adapts and shifts to use products like 3D Photos and immersive AR and VR experiences, we need to keep pushing sophisticated systems to more accurately understand and interact with objects in a visual scene.”