Stability AI unveils new FreeWilly language models trained using minimal — and highly synthetic — data

There's a new large language model (LLM) in town — two of them, in fact — and '90s kids will immediately recognize their names: FreeWilly1 and FreeWilly2.

Unveiled on Friday by Stability AI, the company behind the Stable Diffusion image generation AI and founded by former UK hedge funder Emad Mostaque, who has been accused of exaggerating his resume, the two new LLMs are both based off of versions of Meta's LLaMA and LLaMA 2 open-source models, but trained on an entirely new, smaller dataset, which includes synthetic data.

Both models excel in intricate reasoning, linguistic subtleties, and answering complex questions related to specialized domains like law and mathematics.

Stability's subsidiary CarperAI released the FreeWillys under a "non-commercial license" — meaning they cannot be used for moneymaking/enterprise/business purposes, and are instead aimed at advancing research and promoting open access in the AI community.

Smaller whales, more environmentally friendly

The names of the models are a play on the "Orca" AI training methodology developed by researchers at Microsoft, which allows "smaller" models (those exposed to more limited data) to achieve the performance of large foundational models exposed to more massive datasets. (Not a reference to the IRL boat-sinking orcas.)

Specifically, FreeWilly1 and FreeWilly2 were trained with 600,000 data points — just 10% of the size of the original Orca dataset — using instructions from four datasets created by Enrico Shippole, meaning they were far less costly and far more environmentally friendly (using less energy and having a lower carbon footprint) than the original Orca model and most leading LLMs. The models still produced outstanding performance, comparable to and even exceeding ChatGPT on GPT-3.5 in some cases.

Training on synthetic data shows promise

One issue that has come up as LLMs proliferate is this: What happens as more content is generated using them, and then future updates to these models, and future models, are trained on that AI-generated content/data?

An open-access paper described a process of "model collapse," wherein LLMs trained on increasing amounts of AI-generated data performed more poorly than predecessors trained on human-generated data.

However, when training the FreeWillys, Stability AI used two other LLMs to generate 500,000 examples and 100,000 synthetic examples, respectively, and found that the FreeWillys still performed well, showing that synthetic data may be an answer to model collapse — and to avoiding the use of copyrighted or proprietary data.

Swimming into the future with Stability AI

Stability AI envisions these models setting new standards in the field of open access LLMs, empowering natural language understanding and enabling complex tasks.

"We are excited about the endless possibilities that these models will bring to the AI community and the new applications they will inspire," said the Stability AI team. They expressed their gratitude to the researchers, engineers and collaborators whose dedication made this milestone possible.

Researchers and developers can access the weights for FreeWilly2 as-is, while FreeWilly1's weights are released as deltas over the original model.

Smaller whales, more environmentally friendly

Training on synthetic data shows promise

Swimming into the future with Stability AI

More