In a paper accepted to the upcoming International Conference on Machine Learning (ICML) 2020 conference, researchers at OpenAI demonstrate that certain AI language models trained on pixel sequences can generate coherent images. They say it’s a small but significant step toward understanding and bridging the gap between computer vision and language understanding techniques.
Self-supervised learning, or learning without human-labeled data, is a longstanding challenge in machine learning. Recently, models like Google’s BERT, Facebook’s RoBERTa, and OpenAI’s GPT-3 have achieved leading performance on a range of language tasks, but this same emerging class hasn’t been successful when applied to image generation or classification.
Fortunately, Transformer-based models like GTP-3 are domain-agnostic, meaning they can be applied to sequences of any form. OpenAI exploited this to train a smaller version of its language model, GPT-2, on image data. The results indicate the model understands characteristics like object appearances and categories even without hand-coded knowledge; features from the model achieve state-of-the-art performance on a number of classification corpora and near state-of-the-art unsupervised accuracy.
OpenAI trained three versions of image-generating GPT-2 models — iGPT-S (which contained 76 million parameters), iGPT-M (455 million parameters), and iGPT-L (1.4 billion parameters) — on the popular benchmark corpus ImageNet, and an even larger model dubbed iGPT-XL (6.8 billion parameters) on a mix of ImageNet and images from the web. They then reduced the images’ resolutions and created their own 9-bit color palette to represent pixels, yielding an input sequence length 3 times shorter than the standard RGB spectrum without sacrificing accuracy.
According to OpenAI, the results show that image feature quality sharply increased with depth before mildly decreasing. The researchers posit this might have occurred because Transformer-based models operate in two phases. In the first phase, the model gathers information from its surrounding context to build contextualized image features, and in the second phase, the contextualized features are used to predict the next pixels in images.
OpenAI also found that both increasing the scale of its models and training for more iterations resulted in better image quality. When the features were evaluated on the benchmarks CIFAR-10, CIFAR-100, and STL-10, they outperformed those from all supervised and unsupervised transfer algorithms.
However, OpenAI notes that their approach has limitations. Its iGPT models field only low-resolution images and exhibit biases that are a consequence of the data they’ve been trained on — for instance, perhaps developing associations between genders and roles (i.e., “male scientist”). Moreover, they take large amounts of time and compute to train — roughly 2,500 days for iGPT-L on an Nvidia V100 graphics card.
For this reason, the work primarily serves as a proof-of-concept demonstration, according to the researchers. “The significant resource cost to train these models and the greater accuracy of [existing] methods precludes these representations from practical real-world applications in the vision domain … [and] expect that developers will need to pay increasing attention to the data that they feed into their systems and to better understand how it relates to biases in trained models,” they wrote. “[However, our] results suggest that due to its simplicity and generality, a sequence transformer given sufficient compute might ultimately be an effective way to learn excellent features in many domains.”
OpenAI has long asserted that powerful computers in conjunction with reinforcement learning and other techniques can achieve paradigm-shifting AI advances. As MIT Technology Review reported earlier this year, a team within OpenAI called Foresight runs experiments to test how far they can push AI capabilities by training algorithms with increasingly large amounts of data and compute. According to that same report, OpenAI is developing a system trained on images, text, and other data using massive computational resources that the company’s leadership believes is the most promising path toward artificial general intelligence (AGI), or AI that can learn any task a human can.