Stuck in GPT-3’s waitlist? Try out the AI21 Jurassic-1

In January 2020, OpenAI laid out the scaling law of language models: You can improve the performance of any neural language model by adding more training data, more model parameters, and more compute. Since then, there has been an arms race to train ever larger neural networks for natural language processing (NLP). And the latest to join the list is AI21 with its 178 billion parameter model.

AI21 background and founding team

AI21 is an Israeli company founded in 2017 by Yoav Shoham, Ori Goshen, and Amnon Sashua. Before this, Amnon founded Mobileye, the NYSE-listed self-driving tech company that Intel acquired for $15.4 billion. After being in stealth for years, AI21 launched its first product, Wordtune, in 2020 to help people write better.

Last month, the company announced it has trained and released two large NLP models, Jurassic-1 Large and Jurrasic-1 Jumbo, via an interactive web UI called AI21 Studio.

In contrast to OpenAI’s closed beta access, AI21 makes its models available for anyone to try out -- without any waitlist.

Model sizes and performance benchmarks

Larger models exist -- like the Chinese Wu Dao 2.0, which is 10x the size, with 1.75 trillion parameters. But AI21’s J-1 Jumbo is the largest English language model available to the general public so far.

Caption: GPT-3 parameter sizes as estimated here, GPT-Neo as reported by EleutherAI, J-1 as reported by AI21. * denotes the models are open source.

The zero-shot model performance on known benchmarks for J-1 Jumbo is on par with GPT-3 Davinci, the largest OpenAI GPT-3 model. “Zero-shot” is when the model is not given any special prompt and is not fine-tuned on any sort of training data specific to the task.

as reported by AI21.

Examples

In a previous article, I walked through a number of examples to show GPT-Neo’s real world performance. Let us examine how well AI21’s models perform in real practice.

Fact completion. Let’s start by asking Jurassic-1 some basic general knowledge questions. My prompts to the model are given in italics and the model’s response in bold.

How many medals did USA win in 2012 Olympics? 104 ##How many golds did USA win in 2016 Olympics? 46 ##

That is the correct answer!

What stood out:

The model is smart enough to figure out what we mean by "golds" in the question, while the prompt was talking about medals.
J-1 Jumbo 178B gets this right, but J-1 Large 7.5B does not!
Trying the same question with the 2021 Olympics does not work (probably because the model is not continuously trained with fresh data).

Neural Jeopardy! Taking it one step further, how about a Jeopardy-style question-answer dialog. Thanks to the good folks at Water Cooler Trivia (WCT), we already have a question-answer set, a human benchmark, and a benchmark for GPT-3.

Running through the 157 Jeopardy-style WCT questions, the J-1 Jumbo model was able to answer with 55.4% accuracy. This compares favorably to the 52% average of humans who had taken the WCT. However, it is significantly worse than the 73% accuracy rate of the GPT-3.

On “easy” questions, J-1 Jumbo did three times worse than GPT-3. Among the 58 questions that 90+% of the human participants got right, GPT-3 was wrong 4 times, but J-1 was wrong 12 times.

Here are some of the questions GPT-3 got right but Jurassic-1 Jumbo did not:

Q: Originally, they were called Baby Gays. Then they rebranded as Q-Tips, with the Q standing for what word related to high standards?GPT-3: QualityAI21: Quarts

Q: Although the most popular song on the film's soundtrack is "This Is Halloween" and the main character's name is Jack Skellington, the 1993 cult-classic Tim Burton film that was nominated for a visual effects Oscar includes what OTHER holiday in the film's title?GPT-3: The Nightmare Before ChristmasAI21: SLEEPY HOLLOW. The 1999 film, starring Johnny Depp, Christina Ricci, Miranda Richardson, Casper

As the questions got tougher, both models performed similarly. Among the 20 toughest questions, GPT3 was wrong 13 times, and Jurassic-1 was wrong 14 times. And nine of the incorrect answers by both models were the same -- i.e., the models were wrong in the same way!

Q: Spoonerism: two pairs of words with the first sounds or letters swapped, such as blushing crow and crushing blow. What spoonerism describes these two objects? A feigned animal and large meal cooked in an oven.GPT-3: A feigned animal and large meal cooked in an oven.AI21: (1) A feigned animal and large meal cooked in an oven. (2) A feigned animaCorrect Answer: faked beast

Q: Mario Lopez hosted the first seven seasons of what dance competition show from 2008 - 2012?GPT-3: Dancing with the StarsAI21: Dancing with the Stars.Correct Answer: America's Best Dance Crew

The J-1 model sometimes thought of itself being in a game-show, leading to hilarious completions that would fit right into a game-show scrip. (It got the second answer wrong after the drum-roll):

Q: What "shedding-type" card game has been a Mattel brand since 1992 and features cards with titles such as "Wild" and "Reverse?AI21: UNO! UNO! UNO! UNO! UNO! UNO! UNO!Correct Answer: Uno

Q: In Rolling Stone's latest version of the greatest guitarists of all time, No. 5 was a man with a monosyllabic first and last name. This man was one of the three guitarists for The Yardbirds, and he spent decades exploring innovative guitar sounds. Who is he?A21: (Drum roll, please) Eric ClaptonCorrect Answer: Jeff Beck

Possible reason for J1-Jumbo's underperformance

I ran a few modifications of the experiment (some with GPT-3's Q&A prompts and some without) and the best scoring benchmarks for J1-Jumbo were from a "zero-shot setting." However, the GPT-3 numbers reported here are from a "few-shot" setting with an OpenAI-provided example prompt.

“Zero-shot” is when the model is not given any special prompt, and “few-shot” is when there is an example prompt given to the model that exhibits the task. An example prompt in this case would be a bunch of random question/answer pairs. This helps the model understand the task it is trying to perform -- i.e., answer the question.

The Q&A prompt used for the GPT-3 benchmark was likely tuned and engineered for optimal performance with GPT-3's models, but a similar "official" tuned prompt for Q&A does not exist yet for AI21's models. Engineering a prompt that would compete with the GPT-3 benchmark does not seem like a trivial task. I observed that the generated output was sensitive to seemingly trivial changes to the prompt text -- like an empty line in the beginning of the prompt or an extra space at the end of each line of the prompt. Hopefully, AI21 will put out an exhaustive set of "best-performing" prompts to help mitigate this.

You can explore a detailed model comparison, experiment setting, and raw data here.

Creative writing

Now let’s see how the model does when it comes to creative writing. I picked a somewhat difficult task -- a period piece about Twitter written as the late Jerome K. Jerome. Mario Klingemann tried this prompt first with GPT-3, and you can see the GPT-3 response in this thread.

Below is the response by AI21 J-1 Jumbo (178B). Again, the input text is in italics and J-1's response is in bold:

The importance of being on twitterby Jerome K. JeromeLondon, Summer 1897

It's possible that I have confused the sequence of events.Having learned the rudiments of twittering, I pressed on,and soon found myself moving among the twittering classes.

I recognized at once that these people were different fromthose I had known before. They were called “twitterers,” andthey twittered. That was all I knew about them.

They twittered, and they twittered.

At first I twittered too. But I soon found that Icould not live by twittering alone, and that, moreover,the twittering interfered with the serious business of life.

So, having first secured the goodwill of the peoplewho had known me before, I began to look round for somenew form of life. And when I met a man, well dressedand sleek, with a cigar in his mouth, and a gold watchchain across his waistcoat, and an eye-glass in hiseye, I said to him, “Sir, what do you do for a living?”He said, “I am a professional gentleman.”I said, “What, sir, do you call yourself a gentleman?”He said, “I am a barrister-at-law.”

I also tried this prompt with the J1-Large 7.5B model. After a few tries, the model gave out what, in my opinion, is a genuinely creative piece:

The importance of being on twitterby Jerome K. JeromeLondon, Summer 1897

It’s Sunday, and we’re having tea.“I suppose,” says I, “that you’re twittering.”“Oh, yes,” says she, “I twitter all the time.“I twitter about the weather,“I twitter about people,“I twitter about books,“I twitter about politics,“I twitter about art“I twitter about anything and everything

I found the piece impressive because it managed to write in a dated language in a rhythmic prose that managed to preserve the overall theme of social networks.

How to try out Jurassic-1

Unlike GPT-3, Jurassic-1 is accessible to everybody. You can access it from the AI21 studio (account creation does require phone-number authentication).

The free tier allows 10K tokens per day for the Jurassic-1 178B model and three times as much for the smaller Jurassic-1 7.5B model. That is enough to try out using the web UI, but not enough to use the API to run any sort of tests or benchmarks.

AI21 will be commercializing its models through an offering called AI21 Studio, which is currently in "limited open beta." The company hasn't announced a pricing model for this commercial usage yet.

The bottom line

Issues surrounding AI safety, ethics, and biases have been a matter of concern with neural language models, and they continue to remain with AI21's models. Keeping those issues aside for a moment, AI21's models seem to be a promising substitute for GPT-3. However, they lag behind on a few fronts:

They lack the ability to use specialized models like “GPT-3 davinci-instruct”, which spurs GPT-3 to follow instructions given as prompts or “GPT-3 codex” that specializes in writing code.
The "prompt" ecosystem is still not as mature as GPT-3. Many of GPT-3's prompts do not directly translate to AI21, and an exhaustive "official" list of prompts is not yet available.
AI21's free token quota is too restrictive, and there is no usage based pricing announced as of yet. This makes it difficult to run benchmarks or do prompt engineering. Still, you can always write to them with an explanation of the requirement and they are happy to bump up the quota (like they did for me).

However, it’s still very early days for AI21. With time, we can expect the AI21 language models to be a viable alternative to the OpenAI language models.

Abhishek Iyer is the founder of FreeText AI, a company specializing in text mining and Amazon review analysis.