RedPajama replicates LLaMA dataset to build open source, state-of-the-art LLMs

Thought the open source AI references to camelids were finished? Think again: Yesterday, Together, a Menlo Park, California-based company focused on building a decentralized cloud and open source models, announced RedPajama (yes, like Llama Llama Red Pajama) yesterday.

"In many ways, AI is having its Linux moment," the company said in a blog post, linking to a January post written by Chris Re, co-founder of Together, Stanford associate professor and co-founder of SambaNova, Snorkel.ai and Factory.

RedPajama is a collaborative project between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute to create leading, fully open-source large language models (LLMs). Its effort began with yesterday's release of a 1.2 trillion token dataset that follows the LLaMA recipe. The data enables any organization to pre-train models that can be permissively licensed. The full dataset is available on Hugging Face and users can reproduce results with Apache 2.0 scripts available on Github.

LLaMA is a state-of-the-art foundation LLM released in February by Meta with gated access to researchers. Several other models based on LLaMA have come out in recent weeks, including Alpaca, Vicuna and Koala — but those models have not been available for commercial use. There was also some LLaMA-drama when the LLaMA model was leaked on 4chan.

In the coming weeks, Together will release a full suite of LLMs and instruction tuned versions based on the RedPajama dataset. The company emphasized that the forthcoming models will be fully open-source and commercially viable. In a tweet, the company said, "We hope this can be a clean-room, drama-free version. The RedPajama models we release, starting in the coming weeks, will be released under the Apache 2.0 license."

RedPajama part of a wave of open source AI

As VentureBeat reported last week, open source AI has been having a moment over the past few weeks, following the wave of LLM releases and an effort by startups, collectives and academics to push back on the shift in AI to closed, proprietary LLMs.

And a camelid-adjacent model, Dolly 2.0 (as in Dolly the Sheep), also made headlines last week when its developer, Databricks, called it the first open, instruction-following LLM for commercial use.

But the largest, state-of-the-art open source LLMs like LLaMA have been limited to the research community. "They are limited in that you can't build real applications and ship them," said Vipul Ved Prakash, founder and CEO of Together and previously cofounder of Cloudmark and Topsy. "We think having permissively licensed models is a critical aspect of open source AI."

Replicating the LLaMA dataset was no small task

The company started with LLaMa, which it called the "leading suite of open base models," because it was trained on a "very large dataset that was carefully filtered for quality." Also, the 7 billion parameter LLaMA model is "trained for much longer, well beyond the Chinchilla-optimal point, to ensure the best quality at that model size."

While neither the dataset nor the model will be identical, the developers aim to create a fully open source reproduction of LLaMA which would be available for commercial applications, and provide a "more transparent pipeline for research."

The developers did not have access to the LLaMA dataset but had enough of a recipe to go on. "We followed the recipe very carefully to essentially recreate [the LLaMA dataset] from scratch," said Prakash. The dataset consists of seven data slices, including data from Common Crawl, arxiv, Github, Wikipedia and a corpus of open books.

"For each data slice, we conduct careful data pre-processing and filtering, and tune our quality filters to roughly match the number of tokens as reported by Meta AI in the LLaMA paper," read the blog post.

"All of the data LLaMA was trained on is openly available data, but the challenge was that they they didn't provide the actual data set — there's a lot of work to go from the overview to the actual data set," said Prakash. For example, he explained, the paper might describe how they picked the best 10,000 from a million documents, but they didn't give you the 10,000. "So we followed the recipe to repeat all that work to create an equivalent dataset," he said.

The debate over building transparent systems

Prakash said that the RedPajama project collaborators believe it's important that systems are transparent. "You know exactly how this model was built, what went into it," he said. "If you're trying to improve it, you can start from the dataset."

The project also brings together a larger community to these models, he added. "I would say academia has really been cut out of foundation model research because of the level of resources required, starting from data to the compute," he said. He added that there is a small number of people in the world working on these large models today, and if there was broader access, "a lot of brilliant people" around the world would be able to explore different directions of neural architectures, training algorithms and safety research.

"Also, this is one of the first really general AI which can be adapted to different tasks, and we think the applicability is very broad," he said. "But many different applications are possible only if you have access to the model, the model weights, and adapt them to different computing environments. We see a lot of this happen because of open source AI."

There is another side to the open source AI debate, however. For example, Ilya Sutskever, OpenAI’s chief scientist and co-founder, recently said it was “wrong” to share research so openly, saying fear of competition and fears over safety — were “self-evident." He added that “at some point it will be quite easy, if one wanted, to cause a great deal of harm with those models."

And in a recent interview with VentureBeat, Joelle Pineau, VP of AI research at Meta, said that while accountability and transparency in AI models is essential, the key for Meta is to balance the level of access, which can vary depending on the potential harm of the model.

“My hope, and it’s reflected in our strategy for data access, is to figure out how to allow transparency for verifiability audits of these models,” she said, adding that access could be decided based on the level of potential harm of the model.

On the other hand, she said that some levels of openness go too far. “That’s why the LLaMA model had a gated release,” she explained. “Many people would have been very happy to go totally open. I don’t think that’s the responsible thing to do today.”

Debates around ethical datasets as well

There have also been debates about the ethics of the datasets themselves, whether the models are open or closed. An article last week in The Guardian said that the "enormous datasets used to train the latest generation of these AI systems, like those behind ChatGPT and Stable Diffusion, are likely to contain billions of images scraped from the internet, millions of pirated ebooks, the entire proceedings of 16 years of the European parliament and the whole of English-language Wikipedia."

But Prakash says that he thinks "these models capture in some ways the output of human society and there is a sort of obligation to make them open and usable by everyone." He added that "most of the magic" of these models comes from the fact that they are trained on "really broad and vast" data.

He also pointed out that the original data is compressed significantly in the actual model. The RedPajama dataset is 5 terabytes, and the models can be as small as 14 GB, ~500x smaller than the original data they are modeling.

"This means that knowledge from the data is abstracted, transformed and modeled in a very different representation of weights and biases of parameters in the neural network model, and not stored and used in its original form," said Prakash. So, it is "not reproducing the training data — it is derivative work on top of that. From our understanding, it is considered fair use as long as the model is not reproducing the data — it's learning from it."

There is no doubt that the open source AI debates are highly-complex. But when asked why the company called the new project RedPajama, the answer was far more simple. "A lot of us have small children," said Prakash. "It just seemed fun."

RedPajama part of a wave of open source AI

Replicating the LLaMA dataset was no small task

The debate over building transparent systems

Debates around ethical datasets as well

More