Google's Objectron uses AI to track 3D objects in 2D video

Coinciding with the kickoff of the 2020 TensorFlow Developer Summit, Google today published a pipeline -- Objectron -- that spots objects in 2D images and estimates their poses and sizes through an AI model. The company says it has implications for robotics, self-driving vehicles, image retrieval, and augmented reality -- for instance, it could help a factory floor robot avoid obstacles in real time.

Tracking 3D objects is a tricky prospect, particularly when dealing with limited compute resources (like a smartphone system-on-chip). And it becomes tougher when the only imagery (usually video) available is 2D due to a lack of data and a diversity of appearances and shapes of objects.

The Google team behind Objectron, then, developed a toolset that allowed annotators to label 3D bounding boxes (i.e., rectangular borders) for objects using a split-screen view to display 2D video frames. 3D bounding boxes were overlaid atop it alongside point clouds, camera positions, and detected planes. Annotators drew 3D bounding boxes in the 3D view and verified their locations by reviewing the projections in 2D video frames, and for static objects, they only had to annotate the target object in a single frame. The tool propagated the object's location to all frames using ground truth camera pose information from AR session data.

To supplement the real-world data in order to boost the accuracy of the AI model's predictions, the team developed an engine that placed virtual objects into scenes containing AR session data. This allowed for the use of camera poses, detected planar surfaces, and estimated lighting to generate physically probable placements with lighting that matches the scene, which resulted in high-quality synthetic data with rendered objects that respected the scene geometry and fit seamlessly into real backgrounds. In validation tests, accuracy increased by about 10% with the synthetic data.

Better still, the team says the current version of the Objectron model is lightweight enough to run in real time on flagship mobile devices. With the Adreno 650 mobile graphics chip found in phones like the LG V60 ThinQ, Samsung Galaxy S20+, and Sony Xperia 1 II, it's able to process around 26 frames per second.

The Objectron is available in MediaPipe, a framework for building cross-platform AI pipelines consisting of fast inference and media processing (like video decoding). Models trained to recognize shoes and chairs are available, as well as an end-to-end demo app.

The team says that in the future, it plans to share additional solutions with the research and development community to stimulate new use cases, applications, and research efforts. Additionally, it intends to scale the Objectron model to more categories of objects and further improve its on-device performance.

More