With a wave of new LLMs, open-source AI is having a moment — and a red-hot debate

The open-source technology movement has been having a moment over the past few weeks thanks to AI — following a wave of recent large language model (LLM) releases and an effort by startups, collectives and academics to push back on the shift in AI to closed, proprietary LLMs.

State-of-the-art LLMs require huge compute budgets — OpenAI reportedly used 10,000 Nvidia GPUs to train ChatGPT— and deep ML expertise, so few organizations can train them from scratch. Yet, increasingly, those that have the resources and expertise are not opening up their models — the data, source code, or deep learning's secret sauce, the model weights — to public scrutiny, relying on API distribution instead.

That is where open-source AI is stepping into the void to democratize access to LLMs. For example, two weeks ago Databricks announced the ChatGPT-like Dolly, which was inspired by Alpaca, another open-source LLM released by Stanford in mid-March. Alpaca, in turn, used the weights from Meta’s LLaMA model that was released in late February. LLaMA was immediately hailed for its superior performance over models such as GPT -3, despite having 10 times fewer parameters.

Meta is known as a particularly “open” Big Tech company (thanks to FAIR, the Fundamental AI Research Team founded by Meta's chief AI scientist Yann LeCun in 2013). It had made LLaMA’s model weights available for academics and researchers on a case-by-case basis — including Stanford for the Alpaca project — but those weights were subsequently leaked on 4chan. This allowed developers around the world to fully access a GPT-level LLM for the first time.

Other open-source LLaMA-inspired models have been released in recent weeks, including Vicuna, a fine-tuned version of LLaMA that matches GPT-4 performance; Koala, a model from Berkeley AI Research Institute; and the ColossalChat, a ChatGPT-type model that is part of the Colossal -AI project from UC Berkeley. Some of these open-source models have even been optimized to run on the lowest-powered devices, from a MacBook Pro down to a Raspberry Pi and an old iPhone.

It's important to note, however, that none of these open-source LLMs is available yet for commercial use, because the LLaMA model is not released for commercial use, and the OpenAI GPT-3.5 terms of use prohibit using the model to develop AI models that compete with OpenAI.

An open-source debate as old as software

Nonprofits have also stepped into the open-source AI fray: Last week the German nonprofit LAION (Large-scale Artificial Intelligence Open Network) proposed to democratize AI research and build a publicly-funded supercomputer with 100,000 powerful accelerators, such as GPUs. It would be used to create open-source replicas of models as large and powerful as GPT-4 as quickly as possible.

And two weeks ago, the free-software community Mozilla announced an open-source initiative for developing AI, saying they “intend to create a decentralized AI community that can serve as a ‘counterweight’ against the large profit-focused companies."

All of this has stirred up a debate as old as software: Should AI models be freely available so anyone can modify, personalize and distribute them without restrictions? Or should they be protected by copyright and require the purchase of a license? And what are the ethical and security implications of using these open-source LLMs — or, on the other hand, their closed, costly counterparts?

The open-source software movement of the late ‘90s and early ‘00s produced iconic innovations like Mozilla’s Firefox web browser, Apache server software and the Linux operating system, which was the foundation of the Android OS that powers the majority of the world’s smartphones.

But in the academia-focused, research-heavy world of AI, open source has been particularly influential. “Most of the progress in the past five years in AI came from open science and open source,” Hugging Face CEO Clement Delangue told VentureBeat in an interview a couple of weeks before the company drew more than 5,000 to an open-source AI event that turned into what many called the “Woodstock of AI.”

For example, he explained, most of today’s most popular LLMs, including ChatGPT, are built on Transformers, a neural network architecture that was announced in 2017 with the "Attention Is All You Need" research paper (it was authored by nine co-authors at Google, several of whom went on to found LLM startups including Cohere and Character AI).

After Transformers were developed and shared openly, “people built on top of that with scaffolds like RoBERTa, GPT-2 and GPT-3,” said Delangue. “People were building on top of one another using the same kind of architecture and technique.”

But over the past year and a half, more and more companies have transitioned to more proprietary commercial models, he explained, models that may lack even a research paper. “Now, we don't know if [a model] is 200 billion or 10 billion parameters,” he said. “The research community is left speculating about the details, and it creates less transparency.”

The many shades of the open-source AI spectrum

There are many shades on the spectrum of open-source AI, said Moses Guttman, founder and CEO of ClearML, an MLOps platform that is available as a hosted service or as an open-source tool. Even if a company is unwilling to share source code, he explained, it can offer some level of openness that helps understand the model’s process, “whether you anonymize data or sample the data so people just understand what it was trained on.”

Big Tech companies have historically sat on various points on the openness spectrum. Google CEO Sundar Pichai recently told the Wall Street Journal that it has open-sourced models before, but would have to evaluate going forward.

“I think it has an important role to play,” he said of open source, adding that the future ecosystem will likely be more diverse than people think.

“Over time, you will have access to open-source models,” he said. “You’ll be able to run models on-device. Companies will be able to build their own models, as well as people who use models through large cloud providers. I think you’ll have a whole diverse range of options.”

But Yann LeCun tweeted in February about his concerns for the future of open-source AI:

In an interview with VentureBeat, Joelle Pineau, VP of AI research at Meta, said that accountability and transparency in AI models is essential.

“The pivots in AI are huge, and we are asking society to come along for the ride,” she said. “That’s why, more than ever, we need to invite people to see the technology more transparently and lean into transparency.”

She pointed out that there will always be open- and closed-source AI, with some models designed to contribute to pushing research in an open way, while others are products with the potential to transform people's lives.

However, Pineau doesn’t fully align herself with statements from OpenAI that cite safety concerns as a reason to keep models closed. “I think these are valid concerns, but the only way to have conversations in a way that really helps us progress is by affording some level of transparency,” she said.

She pointed to Stanford’s Alpaca project as an example of “gated access” — where Meta made the LLaMA weights available for academic researchers, who fine-tuned the weights to create a model with slightly different characteristics.

“We welcome this kind of investment from the ecosystem to help with our progress,” she said. But while she did not comment to VentureBeat on the 4chan leak that led to the wave of other LLaMA models, she told the Verge in a press statement, “While the [LLaMA] model is not accessible to all … some have tried to circumvent the approval process.”

Pineau did emphasize that Meta received complaints on both sides of the debate regarding its decision to partially open LLaMA. “On the one hand, we have many people who are complaining it's not nearly open enough, they wish we would have enabled commercial use for these models," she said. "But the data we train on doesn't allow commercial usage of this data. We are respecting the data."

However, there are also concerns that Meta was too open and that these models are fundamentally dangerous. “If people are equally complaining on both sides, maybe we didn't do too bad in terms of making it a reasonable model,” she said. “I will say this is something we always monitor and with each of our releases, we carefully look at the trade-offs in terms of benefits and potential harm.”

GPT-4 release led to an increasingly fiery open-source debate

When GPT-4 was released on March 14, there was a raft of online criticism about what accompanied the announcement: a 98-page technical report that did not include any details about the model’s “architecture (including model size), hardware, training computer, dataset construction, training method, or similar.”

One noteworthy critic of GPT-4’s closed source release was William Falcon, CEO of Lightning AI and creator of PyTorch Lightning, an open-source Python library that provides a high-level interface for popular deep learning framework PyTorch.

“I think what’s bothering everyone is that OpenAI made a whole paper that’s like 90-something pages long,” he told VentureBeat. “That makes it feel like it’s open-source and academic, but it’s not.” OpenAI had been supportive of open source in the past, he added. “They’ve played along nicely. Now, because they have this pressure to monetize … they just divorced themselves from the community.”

Though OpenAI was founded as an open-source company in 2015, it has clearly shifted its focus. in a recent interview with The Verge, Ilya Sutskever, OpenAI’s chief scientist and co-founder, said it was “wrong” to share research so openly. OpenAI’s reasons for not sharing more information about GPT-4 — fear of competition and fears over safety — were “self-evident,” he said, adding that “at some point it will be quite easy, if one wanted, to cause a great deal of harm with those models. And as the capabilities get higher it makes sense that you don't want to disclose them.”

In a statement to VentureBeat, Sandhini Agarwal, researcher, policy research at OpenAI, said that the company makes its technology available to external researchers “who work closely with us on important issues,” adding that open-source software plays a “crucial role in our research efforts” and their significance “cannot be understated — we would not have been able to scale ChatGPT without it. We’re dedicated to continually supporting and contributing to the open-source community.”

The balance between open and closed AI

While there is debate about the pros and cons of specific instances, most agree that there should be a balance between open and closed AI, said Stella Biderman, a mathematician and artificial intelligence researcher at Booz Allen Hamilton and EleutherAI.

Those who say models are too dangerous to release openly create frustrations for external researchers who want to understand the behaviors of these products, she said.

“In general, I think that we should respect what individuals think is the best way to disseminate their research,” she said. “But I'm sympathetic to the concern that there is a disconnect in rhetoric between, we can't show this information and also we can sell it to you.”

Still, Biderman emphasized that there are definitely models that should not be released. Booz Allen, for example, is one of the largest providers of AI services to the government, and mostly focuses on the national security applications of those models. “For national security and other reasons, those people very much don’t want those models to be released,” she said.

However, having open-source research is essential, she said: “If we don't have organizations that have both the technical expertise, as well as the funding, to train an open-source model, there isn't going to be the ability for people to study them outside of the organizations that have a financial interest in them.”

The latest wave of open-source LLMs has pros and cons

The latest wave of open-source LLMs are much smaller and not as cutting-edge as ChatGPT, but “they get the job done,” said Simon Willison, an open-source developer and co-creator of Django, a free and open-source Python-based web framework.

“Before LLaMA came along, I think lots of people thought that in order to run a language model that was of any use at all, you needed $16,000 worth of video cards and a stack of 100 GPUs,” he told VentureBeat. “So the only way to access these models was through OpenAI or other organizations."

But now, he explained, open-source LLMs can run on a laptop. “It turns out maybe we don't need the cutting edge for a lot of things,” he said.

ClearML’s Guttmann agreed, saying his customers don’t necessarily need a solution at the scale of an OpenAI. “Enterprise companies may [want] to solve a very specific problem” that doesn’t require a nice UI,” he said.

However, the ethical implications of using these open-source LLM models are complicated and difficult to navigate, said Willison. OpenAI, for example, has extra filters and rules in place to prevent writing things like a Hitler manifesto, he explained. “But once you can run it on your own laptop and do your own additional training, you could potentially train a fascist language model — in fact, there are already projects on platforms like 4chan that aim to train ‘anti-woke’ language models,” he said.

This is concerning because it opens the door to harmful content creation at scale. Willison pointed to romance scams as an example: Now, with language models, scammers could potentially use them to convince people to fall in love and steal their money on a massive scale, he said.

Currently, Willison says he leans towards open-source AI. “As an individual programmer, I use these tools on a daily basis and my productivity has increased, allowing me to tackle more ambitious problems,” he said. “I don't want this technology to be controlled by just a few giant companies; [that] feels inherently wrong to me given its impact.”

But he still expressed concern. “What if I'm wrong?” he said. “What if the risks of misuse outweigh the benefits of openness? It's difficult to balance the pros and cons.”

The future of AI must strike the right balance, say experts

At its heart, open-source software should be a good thing, wrote Alex Engler, research fellow at the Brookings Institution in a 2021 article in IEEE Spectrum.

But one of the scary parts of open-source AI is how “intensely easy it is to use," he wrote. "The barrier is so low … that almost anyone who has a programming background can figure out how to do it, even if they don't understand, really, what they're doing."

According to Meta’s Pineau, the key is to balance the level of access, which can vary depending on the potential harm of the model.

“My hope, and it’s reflected in our strategy for data access, is to figure out how to allow transparency for verifiability audits of these models,” she said, adding that access could be decided based on the level of potential harm of the model.

On the other hand, she said that some levels of openness go too far. “That’s why the LLaMA model had a gated release,” she explained. “Many people would have been very happy to go totally open. I don’t think that’s the responsible thing to do today.”