Machine learning helps Microsoft's AI realistically colorize video from a single image

Film colorization might be an art form, but it's one that AI models are slowly getting the hang of. In a paper published on the preprint server Arxiv.org ("Deep Exemplar-based Video Colorization"), scientists at Microsoft Research Asia, Microsoft's AI Perception and Mixed Reality Division, Hamad Bin Khalifa University, and USC's Institute for Creative Technologies detail what they claim is the first end-to-end system for autonomous examplar-based (i.e., derived from a reference image) video colorization. They say that in both quantitative and qualitative experiments, it achieves results superior to the state of the art.

"The main challenge is to achieve temporal consistency while remaining faithful to the reference style," wrote the coauthors. "All of the [model's] components, learned end-to-end, help produce realistic videos with good temporal stability."

The paper's authors note that AI capable of converting monochrome clips into color isn't novel. Indeed, researchers at Nvidia last September described a framework that infers colors from just one colorized and annotated video frame, and Google AI in June introduced an algorithm that colorizes grayscale videos without manual human supervision. But the output of these and most other models contains artifacts and errors, which accumulate the longer the duration of the input video.

To address the shortcomings, the researchers' method takes the result of a previous video frame as input (to preserve consistency) and performs colorization using a reference image, allowing this image to guide colorization frame-by-frame and cut down on accumulation error. (If the reference is a colorized frame in the video, it'll perform the same function as most other color propagation methods but in a "more robust" way.) As a result, it's able to predict "natural" colors based on the semantics of input grayscale images, even when no proper matching is available in either a given reference image or previous frame.

This required architecting an end-to-end convolutional network -- a type of AI system that's commonly used to analyze visual imagery -- with a recurrent structure that retains historical information. Each state comprises two modules: a correspondence model that aligns the reference image to an input frame based on dense semantic correspondences, and a colorization model that colorizes a frame guided both by the colorized result of the previous frame and the aligned reference.

The team compiled a training data set from the open source Videvo corpus, which contains mostly animals and landscapes. They supplemented it with portrait videos from a separate corpus (Hollywood 2) and filtered out videos that were too dark or faded in color, leaving 768 videos in total. And for each video, they extracted 25 frames and further expanded the data category with photos from ImageNet, which they used to apply random geometric distortion and luminance noises to generate frames. The end result: 70,000 augmented videos in "diverse categories."

In tests, the coauthors report that their system gave the best Top-5 and Top-1 class accuracy in ImageNet, suggesting it produced semantically meaningful results. Moreover, it managed the lowest Frechet Inception Distance (FID) score compared with benchmarks, indicating its output was "highly" realistic.

"Overall, the results of our method, though slightly less vibrant, exhibit similar colorfulness to the ground truth. The qualitative comparison also indicates that our method produces the most realistic, vibrant colorization results," wrote the researchers. "[O]ur method exhibits vibrant colors in each frame with significantly fewer artifacts compared to other methods. Meanwhile, the successively colorized frames demonstrate good temporal consistency."