Google's AI converts webpages into videos

Researchers at Google say they've developing an AI system that can automatically convert webpages into short videos. It extracts assets like text and images and their design styles including fonts, colors, and graphical layouts from HTML sources and organizes the assets into a sequence of shots, maintaining a look and feel similar to the source page as it does so.

Google envisions the system could be useful to businesses that host websites containing rich visual representations about their services or products. These assets, the company says, could be repurposed for videos, potentially enabling those without extensive resources to reach a broader audience. A typical video costs between $880 and $1,200 and can take days to weeks to produce.

URL2Video, which was presented at the 2020 User Interface Software and Technology Symposium, automatically selects key content from a page and decides the temporal and visual presentation of each asset. These presentations come from a set of heuristics identified through a study with designers, and they capture video editing styles including content hierarchy, constraining the amount of information in a shot and its time duration while providing consistent color and style for branding. Using this information, URL2Video parses a webpage, analyzes the content, selects visually salient text or images, and preserves design styles, which it organizes according to a user's specifications.

URL2Video extracts document object model information and multimedia materials on a per-webpage basis, identifying visually distinguishable elements as a candidate list of asset groups containing headings, product images, descriptions, and call-to-action buttons. The system captures both the raw assets (i.e., text and multimedia files) and detailed design specifications (HTML tags, CSS styles, and rendered locations) for each element and then ranks the asset groups by assigning each a priority score based on their visual appearance and annotations. In this way, an asset group that occupies a larger area at the top of the page receives a higher score.

URL2Video automatically selects and orders the asset groups to optimize the total priority score. To make the videos concise, the system presents only dominant elements from a page, such as a headline and a few multimedia assets, and constrains the duration of elements. Given an ordered list of assets based on the DOM hierarchy, URL2Video follows the heuristics obtained from the design study to make decisions about both the temporal and spatial arrangement. The system transfers the layout of elements into the video's aspect ratio and applies the style choices including fonts and colors, adjusting the presentation timing of assets and rendering the content into an MPEG-4 video.

Google says that in a user study with designers at Google, URL2Video effectively extracted elements from a webpage and supported the designers by bootstrapping the video creation process. "While this current research focuses on the visual presentation, we are developing new techniques that support the audio track and a voiceover in video editing," Google research scientists Peggy Chi and Irfan Essa wrote in a blog post. "All in all, we envision a future where creators focus on making high-level decisions and an ML model interactively suggests detailed temporal and graphical edits for a final video creation on multiple platforms."

More