Carnegie Mellon University AI researchers have created an AI agent that is able to translate words into physical movement. Called Joint Language-to-Pose, or JL2P, the approach combines natural language with 3D pose models. The pose forecasting joint embedding is trained with end-to-end curriculum learning, an approach that stresses shorter task completion sequences before moving on to harder objectives.

JL2P animations are limited to stick figures today, but the ability to translate words into human-like movement can someday help humanoid robots do physical tasks in the real world or assist creatives in animating virtual characters for things like video games or movies.

JL2P is in line with previous works that turn words into imagery — like Microsoft’s ObjGAN, which sketches images and storyboards from captions, Disney’s AI that uses words in a script to create storyboards, and Nvidia’s GauGAN, which lets users paint landscapes using paintbrushes labeled with words like “trees,” “mountain,” or “sky.”

 

JL2P is able to do things like walk or run, play musical instruments (like a guitar or violin), follow directional instructions (left or right), or control speed (fast or slow). The work, originally detailed in a July 2 paper on arXiv.org, will be presented by coauthor and CMU Language Technology Institute researcher Chaitanya Ahuja on September 19 at the International Conference on 3D Vision in Quebec.

“We first optimize the model to predict 2 time steps conditioned on the complete sentence,” the paper reads. “This easy task helps the model learn very short pose sequences, like leg motions for walking, hand motions for waving, and torso motions for bending. Once the loss on the validation set starts increasing, we move on to the next stage in the curriculum. The model is now given twice the [number] of poses for prediction.”

JL2P claims a 9% improvement upon human motion modeling compared to state-of-the-art AI proposed by SRI International researchers in 2018.

JL2P is trained using the KIT Motion-Language Dataset.

Introduced in 2016 by the High Performance Humanoid Technologies in Germany, the data set combines human motion with natural language descriptions that map 11 hours of recorded human movement to more than 6,200 English sentences of approximately eight words.