Enterprises today have vast amounts of unstructured data scattered across numerous environments.

The “dirty secret,” according to Unstructred.io founder and CEO Bryan Raymond, is that data scientists are often still processing all that data exactly as they were doing 20 years ago, typically by manually building pre-processing guidelines.

“Data scientists hate pre-processing,” he told the audience at VentureBeat Transform 2023. “It’s like going to the dentist.”

Unstructured.io, which uses natural language to transform data from its raw form to learning-ready, was selected as Most Likely to Succeed at the Innovation Showcase at VentureBeat Transform 2023.

Connecting data to LLMs

Raymond described his company’s platform as an ETL — extract, transform and load — for large language models (LLMs).

“We like to think of ourselves as top of tunnel,” he said.

Unstructured.io connects data to LLMs and uses a variety of technologies — including computer vision, natural language processing (NLP) and Python scripts — to extract complexity.

The unstructured data is curated, cleaned of artifacts and made LLM-ready, Raymond explained. This is a simpler and faster strategy and data scientists don’t have to write hundreds of lines of parsing code.

Clean, structured data can be elusive

The tool’s enterprise API enables browser workflows for all types of developers, and supports pre-processing of more than 25 file types and thousands of formats in more than 100 languages, said Raymond. It is available as a free API, as a Google Colab notebook and on GitHub, where its library provides open-source components for pre-processing text documents such as PDFs, HTML and Word documents.

Raymond said he came up with the idea for the company after being “stuck in data engineering hell” at a previous employer. Just getting clean, structured data took years, he said.

Unstructured.io was founded in 2022 and the company is now “hard at work” on enterprise-grade data connectors that are resistant to interruptions and can detect new file versions and easily parallelize, said Raymond. The company currently has 15 data connectors, and plans to increase to more than 30.

The Innovation Showcase at this year’s VentureBeat Transform highlighted 10 unique companies in the generative AI, machine learning (ML) and analytics spaces. The three winners were Unstructured.io, Arize AI (Best Technology) and Skyflow (Best Presentation Style), along with seven Honorable Mentions.

