Motion capture — the process of recording people’s movements — traditionally requires special equipment, cameras, and software. But researchers at the Max Planck Institute and Facebook Reality Labs claim they’ve developed a machine learning algorithm — PhysCap — that works with any off-the-shelf DSLR camera running at 25 frames per second. In a paper expected to be published in the journal ACM Transactions on Graphics in November 2020, the team details what they say is the first of its kind for real-time, physically plausible 3D motion capture that accounts for environmental constraints like floor placement. PhysCap ostensibly achieves state-of-the-art accuracy on existing benchmarks and qualitatively improves stability at training time.

Motion capture is a core part of modern film, game, and app development. Attempts to make motion capture practical for amateur videographers have ranged from a $2,500 suit to a commercially available framework that leverages Microsoft’s depth-sensing Kinect. But these are imperfect — even the best human pose-estimating systems struggle to produce smooth animations, yielding 3D models with improper balance, inaccurate body leaning, and other artifacts of instability. PhysCap, on the other hand, reportedly captures physically and anatomically correct poses that adhere to physics constraints.

In its first stage, PhysCap estimates 3D body poses in a purely kinematic way with a convolutional neural network (CNN) that infers combined 2D and 3D joint positions from a video. After some refinement, the second stage commences, in which foot contact and motion states are predicted for every frame by a second CNN. (This CNN detects heel and forefoot placement on the ground and classifies the observed poses into “stationary” or “non-stationary” categories.) In the final stage, kinematic pose estimates from the first stage (in both 2D and 3D) are reproduced as closely as possible to account for things like gravity, collisions, and foot placement.

In experiments, the researchers tested PhysCap on a Sony DSC-RX0 camera and a PC with 32GB of RAM, a GeForce RTX 2070 graphics card, and an eight-core Ryzen7 processor, which they used to capture and process six motion sequences in scenes acted out by two performers. The study coauthors found that while PhysCap generalized well across scenes with different backgrounds, it sometimes mispredicted foot contact and therefore foot velocity. Other limitations that arose were the need for a calibrated floor plane and a ground plane in the scene, which the researchers note is harder to find outdoors.

To address these limitations, the team plans to investigate modeling hand-scene interactions and contacts between a person’s legs and body in seated and reclining poses. “Since the output of PhysCap is environment-aware and the returned root position is global, it is directly suitable for virtual character animation, without any further post-processing,” the researchers wrote. “Here, applications in character animation, virtual and augmented reality, telepresence, or human-computer interaction are only a few examples of high importance for graphics.”


How startups are scaling communication: The pandemic is making startups take a close look at ramping up their communication solutions. Learn how