Facebook and Google's AI generates 3D human poses

Predicting 3D human poses might not fall within most people's purview, but robotics, computer graphics, and other fields chiefly concerned with kinematics -- the branch of mechanics concerned with the motion of objects -- stand to benefit from systems that can do just that. Pose prediction is a task to which artificial intelligence (AI) has been applied before, somewhat recently by Google, but some prior work hit a roadblock: It stretched digital joints and bones in unnatural directions, particularly when the joints rotated.

New research by Facebook's AI Research division, Google Brain, and ETH Zurich promises to address the problem, fortunately. In a paper ("Modeling Human Motion with Quaternion-based Neural Networks") published on the preprint server Arxiv.org this week, researchers describe an AI system -- QuaterNet -- that improves pose generation by representing joint rotations as complex number systems called quaternions, and by penalizing joint position errors.

As the coauthors of the paper explain, recurrent neural networks -- a type of AI algorithm capable of learning long-term dependencies -- have been historically used to perform both short- and long-term pose prediction, while convolutional neural networks -- algorithms highly adept at analyzing visual imagery -- have been successfully applied to long-term generation of locomotion (movement from one place to another). But a perfect model remains elusive, owing to the inherent randomness of human poses.

"Human motion is a stochastic process with a high level of uncertainty," the researchers wrote. "For a given past, there will be multiple likely sequences of future frames and uncertainty grows with duration."

Most models employ transition operators to predict next poses given previous poses. They output recorded target frames from the recorded frames they ingest, which works well for the most part. But it doesn't expose them to their own errors, and so prevents them from recovering from those errors.

The researchers' proposed system, by contrast, employs a convolutional neural network that looks at past frames, learning over time to make long-term predictions as it's progressively exposed to its own predictions. Meanwhile, the loss function -- a function that maps values of one or more variables onto a real number -- takes as input joint rotations and computes the position of each joint. This both improves the model's stability and reduces error, the coauthors say.

To validate the model's short-term pose prediction prowess, the researchers sourced Human3.6M, an open-source 3D human pose dataset containing 3.6 million human poses from seven actors performing 15 actions, along with corresponding images. Long-term generation tests were evaluated on a different dataset containing locomotion samples.

In the short-term prediction task, the coauthors report improvement over the Human 3.6M baseline. And in the case of long-term pose generation, where the goal was generating a pose sequence given an average speed and ground trajectory, they characterize the model as "qualitatively" comparing with recent work while allowing for better control of time and space constraints.

They leave to future work extending QuaterNet to other motion-related tasks, such as action recognition or pose estimation from video, and using neural networks that "perform computations directly in quaternionic domain."

More