Linux Foundation unveils new permissive license for open data collaboration

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

The Linux Foundation has announced a new permissive license designed to help foster collaboration around open data for artificial intelligence (AI) and machine learning (ML) projects.

Data may be the new oil, but for AI and ML projects, having access to expansive and diverse datasets is key to reducing bias and building powerful models capable of all manner of intelligent tasks. For machines, data is a little like "experience" is for humans -- the more of it you have, the better decisions you are likely to make.

With CDLA-Permissive-2.0, the Linux Foundation is building on its previous efforts to encourage data-sharing through licensing arrangements that clearly define how the data -- and any derivative datasets -- can and can't be used.

Data pools

The Linux Foundation introduced the Community Data License Agreement (CDLA) in 2017 to entice organizations to open up their vast pools of (underused) data to third parties. There were two original licenses, a sharing license with a "copyleft" reciprocal commitment borrowed from the open source software sphere, stipulating that any derivative datasets built from the original dataset must be shared under a similar license, and a permissive license (1.0) without any such obligations in place (much as "true" open source software might be defined).

Licenses are basically legal documents that outline how a piece of work (in this case datasets) can be used or modified, but specific phrases, ambiguities, or exceptions can often be enough to spook companies if they think releasing content under a specific license could cause them problems down the line. This is where the CDLA-Permissive-2.0 license comes into play -- it's essentially a rewrite of version 1.0 but shorter and simpler to follow. Going further, it has removed certain provisions that were deemed unnecessary or burdensome and may have hindered broader use of the license.

For example, version 1.0 of the license included obligations that data recipients preserve attribution notices in the datasets. For context, attribution notices or statements are standard in the software sphere, where a company that releases software built on open source components has to credit the creators of these components in its own software license. But the Linux Foundation said feedback it received from the community and lawyers representing companies involved in open data projects pointed to challenges around associating attributions with data (or versions of datasets).

So while data source attribution is still an option, and might make sense for specific projects -- particularly where transparency is paramount -- it is no longer a condition for businesses looking to share data under the new permissive license. The chief remaining obligation is that the main community data license agreement text be included with the new datasets.

Data versus software

This highlights how transposing a concept from a software license to a dataset license doesn't always make sense, in part because laws and regulations usually treat data differently from software and other similar creative content.

"Data is different from software," Linux Foundation compliance and legal VP Steve Winslow told VentureBeat. "Open source software is typically made of copyrightable works, where authorship is important. By contrast, data may frequently have little or no applicable intellectual property rights, and authorship and attribution are often less important."

But isn't attribution still desirable, even if it's not always going to be applicable or relevant? According to Winslow, enforcing data attribution could have some negative consequences in terms of organizations' willingness to collaborate around data.

"Some data recipients may still choose to attribute the data, to show that the data is trustworthy based on its source," Winslow said. "But it will be their call, and not a requirement, as it could impose limitations on how to organize and analyze the data or force unintended burdens on data collaboration."

For example, let's assume data from multiple contributors -- which could run into the thousands -- is pooled into a single dataset. If the dataset is only ever used in that combined form, attribution isn't a huge burden. But if the dataset is subsequently split into subsets that are redistributed separately or combined with a different dataset, determining which attributions apply to the new dataset becomes a lot of work.

Transition

Several companies have already revealed plans to make their existing open datasets available under the new CDLA-Permissive-2.0 license, including Microsoft's research arm, which will now transition some of its open datasets, including Hippocorpus, Public Perception of Artificial Intelligence, Xbox Avatars Descriptions, Dual Word Embeddings, and GPS Trajectory.

IBM's Project CodeNet, NOAA JFK, Airline Reporting Carrier On-Time Performance, PubLayNet, and Fashion-MNIST will also transition to the new license.

Data pools

Data versus software

Transition

More