Google's AutoFlip uses AI to crop videos for you

Video filmed and edited for TV is typically created and viewed in landscape, but problematically, aspect ratios like 16:9 and 4:3 don't always fit the display being used for viewing. Fortunately, Google is on the case. It today detailed AutoFlip, an open source tool for intelligent video reframing. Given a video and a target dimension, it analyzes the video content and develops optimal tracking and cropping strategies, after which it produces an output video with the same duration in the desired aspect ratio.

As Google Research senior software engineer Nathan Frey and senior software engineer Zheng Sun note in a blog post, traditional approaches for reframing video usually involve static cropping, which often leads to unsatisfactory results. More bespoke approaches are superior, but they typically require video curators to manually identify salient content in each frame, track their transitions from frame to frame, and adjust crop regions accordingly throughout the video.

By contrast, AutoFlip is completely automatic thanks to AI object detection and tracking technologies that intelligently understand video content. The system detects changes in the composition that signify scene changes in order to isolate scenes for processing. And within each shot, it uses video analysis to identify salient content before reframing the scene, chiefly by selecting an optimized camera mode and path.

To detect when a shot in a video changes, AutoFlip computes the color histogram of each frame and compares this with prior frames. If the distribution of frame colors changes at a different rate than a sliding historical window, a shot change is signaled. AutoFlip buffers the video until the scene is complete before making reframing decisions in order to optimize the reframing for the entire scene.

AutoFlip also taps AI-based object detection models to find interesting content in the frame, like people, animals, text overlays, logos, and motion. Face and object detection models are integrated with AutoFlip through MediaPipe, a framework that enables the development of pipelines for processing multimodal data, which uses Google's TensorFlow Lite machine learning framework on processors. This structure allows AutoFlip to be extensible, according to Google, so developers can add detection algorithms for different use cases and video content.

AutoFlip automatically chooses a reframing strategy -- stationary, panning, or tracking -- depending on the way objects behave during the scene. In stationary mode, the reframed camera viewport is fixed in a position (like a stationary tripod) where important content can be viewed throughout the majority of the scene. Panning mode moves the viewport at a constant velocity, on the other hand, while tracking mode provides continuous and steady tracking of objects as they move around within the frame.

Based on which reframing strategy is selected, AutoFlip determines a cropping window for each frame while preserving the content of interest. A configuration graph provides settings for reframing such that if it becomes impossible to cover all the required region, the system will automatically switch to a less aggressive strategy by applying a letterbox effect, padding the image to fill the frame. AutoFlip will draw on the background color (if it's a solid color) to ensure the padding blends in, or otherwise use a blurred version of the original frame.

The researchers leave to future work improving AutoFlip's ability to detect "objects relevant to the intent of the video," such as speaker detection for interviews or animated face detection on cartoons, and ensuring input video with overlays on the edges of the screen (such as text or logos) aren't cropped from the view. But they assert that even in its current form, AutoFlip will "reduce the barriers to ... design creativity."

"By combining text/logo detection and image inpainting technology, we hope that future versions of AutoFlip can reposition foreground objects to better fit the new aspect ratios. [And] in situations where padding is required, deep uncrop technology could provide improved ability to expand beyond the original viewable area," wrote Frey and Sun. "We are excited to release this tool directly to developers and filmmakers, reducing the barriers to their design creativity and reach through the automation of video editing. The ability to adapt any video format to various aspect ratios is becoming increasingly important as the diversity of devices for video content consumption continues to rapidly increase."