Stability AI unveils its first LLM, as open-source AI race continues

Stability AI, the company funding the development of open-source generative AI models like Stable Diffusion and Dance Diffusion, today announced the launch of its StableLM suite of language models.

After developing models for multiple domains, including image, audio, video, 3D and biology, this is the first time the developer is jumping into the language model game currently dominated by tech heavyweights such as OpenAI, Meta and Stanford.

The suite’s first offering, the StableLM open-source language model, is now available in alpha, featuring 3 billion and 7 billion parameters, both trained on 800 billion data tokens, with larger 15-billion to 65-billion parameter models to follow.

In 2022, Stability AI introduced Stable Diffusion, a groundbreaking open-source image model that offers a transparent and scalable alternative to proprietary AI. With the release of the StableLM suite, the company aims to demonstrate how small, efficient models can provide high performance with the appropriate training.

StableLM is an extension of the company’s foundational AI technology, which promotes transparency, accessibility and support in AI design. Stability AI believes that the release represents another significant step toward making foundational AI technology accessible to all, with numerous applications, including generating text and code.

Open-source is the new cool

The StableLM suite builds on Stability AI’s prior work, including the groundbreaking Stable Diffusion image model, which offered an open-source alternative to proprietary generative AI image models such as DALL-E. In addition, the Stable language model can generate text and code, making it ideal for various downstream applications.

Despite its small size, the model is surprisingly effective in conversational and coding tasks (similar to OpenAI’s ChatGPT) due to its training on an experimental dataset. Stability AI has a track record of open-sourcing earlier language models, such as GPT-J, GPT-NeoX, and the Pythia suite, trained on The Pile open-source dataset.

StableLM-Alpha models are trained on the new dataset that builds on The Pile, which contains 1.5 trillion tokens. The new “experimental dataset” is supposedly three times larger than The Pile, the context length for the StableLM models being 4,096 tokens.

Stability AI is strongly committed to transparency and accessibility in AI design, and the StableLM suite is no exception. Developers are encouraged to freely inspect, use and adapt the StableLM base models for commercial or research purposes, subject to the terms of the CC BY-SA-4.0 license. Under the license, you must give credit to Stability AI, provide a link to the license, and indicate if changes were made.

According to the license document, users may do so in any reasonable manner, but not in any way that suggests the Stability AI endorses them or their use.

_{Image source: Stability AI}

In a post, the company announced that the StableLM suite also includes a set of research models that are instruction fine-tuned, using a combination of five recent open-source datasets for conversational agents. As a proof of concept, the company fine-tuned the StableLM model with Stanford Alpaca’s procedure using a combination of five recent datasets for conversational agents: Stanford’s Alpaca, Nomic-AI’s gpt4all, RyokoAI’s ShareGPT52K datasets, Databricks labs’ Dolly and Anthropic’s HH, and will be releasing these models as StableLM-Tuned-Alpha.

Stability AI said an upcoming technical report would document the model’s specifications and the training settings.

These models are also intended for research use only and are released under the noncommercial CC BY-NC-SA 4.0 license, in line with Stanford’s Alpaca license.

The LLM race just got bigger

The 800 billion-token training dataset is notable compared to Meta’s LLaMA language model, trained on 1 trillion tokens for 7 billion parameters.

Recently, Menlo Park-based firm Together announced the launch of RedPajama, an open-source project developed in collaboration with several AI institutions including Ontocord AI, ETH DS3Lab, Stanford CRFM, Hazy Research and MILA Québec AI Institute.

That project is quite similar to Stability AI’s approach, aiming to create large language models (LLMs) that are fully open source and lead the industry in performance. The initial dataset released by RedPajama contains 1.2 trillion tokens and adheres to the LLaMA recipe, despite being significantly smaller than Meta’s LLaMA model. Its dataset is publicly available on Hugging Face, while Apache 2.0 scripts on GitHub can be used to reproduce the results.

According to Stability AI, language models are the backbone of the digital economy, and everyone should have a voice in their design. By offering fine-grained access to the models, the company hopes to encourage the development of interpretability and safety techniques beyond what is possible with closed models. The company’s models are now available in its GitHub repository, and Stability AI plans to publish a full technical report in the near future.

Stability AI is also seeking to grow its team and is looking for individuals passionate about democratizing access to this technology and experienced in LLMs. For those interested, the company is accepting applications on its website.

In addition to its work on the StableLM suite, Stability AI is kicking off its crowd-sourced RLHF program and working with community efforts such as Open Assistant, an initiative to create an open-source dataset for AI assistants.

The company plans to release more models soon and says it is excited to collaborate with developers and researchers to roll out the StableLM suite.

Open-source is the new cool

The LLM race just got bigger

More