Roboflow expands open-source datasets for better computer vision AI models

All machine learning libraries and projects rely on data to learn, train and operate.

In an effort to help developers more easily benefit from labeled datasets and machine learning models for computer vision, Roboflow today announced an expansion of its datasets and AI models as part of its Roboflow Universe initiative, which could well be one of the largest such open-source repositories available. Roboflow claims that it now has over 90,000 datasets that include over 66 million images in the Roboflow Universe service launched in August 2021.

Roboflow was founded in 2019 and raised $20 million in a Series A funding round in September 2021. Roboflow provides the open-source Universe repository of datasets and models for computer vision as well as data labeling, model development and hosting capabilities. The Roboflow business model is to provide free tiers of service for users at an entry level and then as usage grows, or for those organizations working with proprietary sets, the company provides paid support and service options.

The Roboflow Universe isn't about simply providing images that a developer can use; it's about providing images that are curated in an approach that enables datasets to be used for AI-powered applications.

"A project is basically something that contains both a dataset someone could use and a trained model on top of that data set," Joseph Nelson, co-founder and CEO told VentureBeat. "The dataset is both the images as well as the annotations."

Data is nice, labeled data is nicer

Nelson said that usually organizations spend a substantial amount of time preparing machine learning data.

The data preparation process involves data labeling and classification, such that a model can effectively be trained. Nelson said that the labeling in Roboflow Universe isn't just a description of an image either.

Labels that Roboflow Universe can include for a given dataset are things like a bounding box, which provides a box around an object, that can be helpful for object detection in a crowded landscape. Another type of labeling that Roboflow performs is instance segmentation, whichprovides a polygon shape that neatly maps around the object of interest.

Data-labeling formats used in machine learning are also often complex and varied. To that end, Nelson said that Roboflow supports the export of dataset into 36 data labeling annotation formats. Among the supported formats are COCO JSON, VOC XML and the YOLO Darknet TXT format.

"Making the image data broadly available and usable means that someone can immediately find a dataset, pull it into their training pipeline, and get up and going," Nelson said.

How developers integrate Roboflow Universe datasets into applications

Bringing computer vision datasets and models into AI-powered applications can often be a complex integration.

Nelson's goal with Roboflow is to help minimize the complexity. He saidthat Roboflow Universe datasets can be accessed via open APIs. For example, he noted that Roboflow has a Python package hosted on the Python Package Index (PyPI) that enables developers to programmatically pull down images, annotations and models and then embed directly those components into an application.

Deploying a Roboflow Universe model into popular cloud machine learning services, including AWS Sagemaker or Google's Vertex is also a straightforward operation via an API call, according to Nelson. Additionally Roboflow makes datasets and models available as Docker containers, enabling the deployment on edge devices. There is also a software development kit (SDK) for supporting Apple iOS devices as well.

"If we make it very easy to use a model wherever you want to use it, then ideally, an engineer focuses their time on the thing that their business logic actually does," Nelson said.

The intersection of open source models and AI bias

Making it easier to access datasets and models for computer vision to build applications is a key goal for Roboflow. Another impact of having such a large corpus of open source data is helping to improve AI bias concerns.

"Bias in AI is never a solved problem," Nelson said. "But providing explainability, accessibility and discoverability can help."

Nelson explained that AI bias is often about trying to understand why a model made a particular decision. Fundamentally, the way that models make decisions is based on data the models are trained on. By having a larger dataset that includes more diversity, a model can potentially become more representative, with less risk of bias.

"Ultimately a lot of AI bias problems stem from under-representation," Nelson said. "The way to fix under representation is by enabling active collection of data sets of the underrepresented class, and making that data accessible, searchable and usable."

Data is nice, labeled data is nicer

How developers integrate Roboflow Universe datasets into applications

The intersection of open source models and AI bias

More