This Nvidia neural network can apply slow motion to any video

Most high-end DSLRs and smartphones can shoot in slow motion, but not all. That's because doing so is quite data-intensive: Super Slow Motion mode on Sony's Xperia XZ2 smartphone, for example, shoots 960 frames a second (fps), which is 32 times the amount of frame data it captures at the default 30 fps. That requires a lot of storage, not to mention a processor that's speedy enough to process every frame.

Nvidia's novel algorithm, which will be presented at the 2018 Conference on Computer Vision and Pattern Recognition in Salt Lake City this week, can slow down footage after the fact. But unlike the jittery slow motion filters that fill gaps in footage with time-stretched frames, the research team's solution uses machine learning to hallucinate new frames.

Scientists from Nvidia, the University of Massachusetts Amherst, and the University of California, Merced engineered an unsupervised, end-to-end neural network that can generate an arbitrary number of intermediate frames to create smooth slow-motion footage. They call the technique "variable-length multi-frame interpolation."

"We're taking a slow-motion effect and applying it to existing video," Jan Kautz, who leads the learning and perception team at Nvidia, told VentureBeat in a phone interview. "You can slow it down by a factor of eight or 15 -- there's no upper limit."

Here's how it works: One convolutional neural network (CNN) estimates the optical flow -- the pattern of motion of the objects, surfaces, and edges in the scene -- both forward and backward in the timeline between the two input frames. It then predicts how the pixels will move from one frame to the next, generating what's known as a flow field -- a 2D vector of predicted motion -- for each frame, which it fuses together to approximate a flow field for the intermediate frame.

A second CNN then interpolates the optical flow, refining the approximated flow field and predicting visibility maps in order to exclude pixels occluded by objects in the frame and subsequently reduce artifacts in and around objects in motion. Finally, the visibility map is applied to the two input images, and the intermediate optical flow field is used to warp (distort) them in such a way that one frame transitions smoothly to the next.

The researchers trained the system on 240 fps videos from YouTube and handheld cameras -- including a series of clips from The Slow Mo Guys (for a corpus of 11,000 videos total) and used Nvidia Tesla V100 GPUs and a cuDNN-accelerated PyTorch deep learning framework.

The results are impressive, to say the least -- the output videos don't exhibit the hallmark jitteriness and blurriness of slow-motion software filters. With the exception of a few jagged edges around the borders of fast-moving objects, it's tough to tell them apart from footage shot natively at high framerates.

Another advantage of the system is that the parameters of both CNNs are independent of the specific time step being interpolated, allowing the neural net to generate as many intermediate frames as needed in parallel.

"[O]ur approach achieves state-of-the-art results over all datasets, generating single or multiple intermediate frames," the researchers write. "It is remarkable, considering the fact that our model can be directly applied to different scenarios without any modification."

Unfortunately, it's unlikely to be commercialized anytime soon. Kautz said that the system isn't highly optimized and that getting it to run in real time will be a challenge. And he expects that when it does appear in consumer devices and apps, it will perform most of the processing in the cloud.

Still, it's a promising step forward for machine learning -- and for slow-motion enthusiasts everywhere. Here's to more overly dramatic backflips, skateboard tricks, and dogs catching balls in midair.

More