Take the latest VB Survey to share how your company is implementing AI today.

Real-time hand shape and motion trackers are an invaluable part of sign language recognition and gesture control systems, not to mention a number of augmented reality experiences. But they’re often hobbled by occlusion and a lack of contrast patterns, preventing them from performing reliably or robustly.

Those challenges and others motivated scientists at Google to investigate a new computer vision approach to hand perception — one bolstered by machine learning. They say that in experiments, it managed to infer up to 21 3D points of a hand (or multiple hands) on a mobile phone from just a single frame.

Google previewed the new technique at the 2019 Conference on Computer Vision and Pattern Recognition in June and recently implemented it in MediaPipe, a cross-platform framework for building multimodal applied machine learning pipelines to process perceptual data of different modalities (such as video and audio). Both the source code and an end-to-end usage scenario are available on GitHub.

Google AI hand tracking

“The ability to perceive the shape and motion of hands can be a vital component in improving the user experience across a variety of technological domains and platforms,” wrote research engineers Valentin Bazarevsky and Fan Zhang in a blog post. “We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.”

Google’s technique comprises three AI models working in tandem: a palm detector — BlazePalm — that analyzes a frame and returns a hand bounding box; a hand landmark model that looks at the cropped image region defined by the palm detector and returns 3D hand points; and a gesture recognizer that classifies the previously-computed point configuration into a set of gestures.

Recognizing hands isn’t an easy task; BlazePalm has to contend with a lack of features while spotting occluded and self-occluded hands. To clear those roadblocks, the team trained a palm detector instead of a hand detector, since estimating bounding boxes of objects like fists tends to be easier than detecting hands and fingers. As an added benefit, it generalizes well to edge cases like handshakes, and it can model palms using square bounding boxes that ignore other aspect ratios, reducing the number of points by a factor of 3-5.

After palm detection, the hand landmark model takes over, performing localization of 21 3D hand-knuckle coordinates inside the detected hand regions. Training it took 30,000 real-world images manually annotated with coordinates, plus high-quality synthetic hand model rendered over various backgrounds and mapped to the corresponding coordinates.

The last step in the pipeline is the gesture recognition system, which determines the state of each finger from joint angles and maps the set of finger states to predefined gestures. Bazarevsky and Zhang say that it’s able to recognize counting gestures from multiple cultures (e.g. American, European, and Chinese) and various hand signs including a closed fist, “OK”, “rock”, and “Spiderman”.

The models can perform individual tasks like cropping and rendering exclusively on graphics cards to save on computation, and the palm detector only runs as necessary — the bulk of the time, hand location in subsequent video frames is inferred from the computed hand key points in the current frame. Only when the inference confidence falls below a certain threshold is the hand detection model reapplied to the whole frame.

In the future, Bazarevsky, Zhang, and colleagues plan to extend the technology with more robust and stable tracking, and to enlarge the amount of gestures it can reliably detect and support dynamic gestures unfolding in time. “We believe that publishing this technology can give an impulse to new creative ideas and applications by the members of the research and developer community at large,” they added.