What is unstructured data in AI?

Many databases are filled with information that’s carefully organized into rows and columns. The type and role for each part is pre-defined and often enforced by software that checks the data before and after it’s stored. Studying these tables for insights is relatively simple and straight-forward for data scientists.

Some data sources, though, lack predictable order, but this doesn’t mean that they can’t be useful. The most common source in this vein are human-readable data texts written in human languages. Aside from the basic rules of grammar and some conventions of storytelling and journalism, there is no little obvious structure that can be used to make sense of the information and turn it into solid data.

Other potential sources for unstructured information come from automatic collection, often from the telemetry from smart devices. The burgeoning world of the internet of things (IoT) is producing petabytes of information that are largely unstructured. These files may have a basic format with some predefined fields for timestamps, but the reading from the sensors frequently arrives in raw form with little or no classification or interpretation.

Some artificial intelligence (AI) scientists specialize in making sense of, what is known as, unstructured data. In some sense, all data files come with a certain amount of structure or rules, and the challenge is to look beyond this structure for more in-depth insights.

How is unstructured data analyzed?

The approaches are largely statistical. The algorithms look for patterns or relationships between various entries. Are the same words typically found in the same sentence or paragraph? Does some value of a sensor spike just before another one? Are some colors common in an image?

Many modern algorithms impose an extra basic layer of structure on the data source, a process that’s frequently called embedding the data or building an embedding. A text, for instance, may be searched for the 10,000 most common words that aren’t common in other books or sources. An image may be broken into sections. This rough structure becomes the foundation for later statistical analysis.

The creation of these embeddings is often as much an art as it is a science. Much of the work done by data scientists involves designing and testing various strategies for building the rough embedding.

In many cases, domain expertise can make it possible for a human to transfer their understanding from the area to the algorithm. For instance, a doctor may decide that all blood pressure readings above a certain value should be classified as “high.” An insurance adjuster may decide that all rear-end collisions are the fault of the trailing car. These rules bring structure to the embeddings and the data to help classify it.

What are the goals for unstructured AI?

The goals vary from domain to domain. A common request is to find similar items in a database. Is a similar face found in this collection of photographs? Is this text plagiarized from a book? Is there another person with a similar resume?

Others try to make predictions for the future to help an enterprise plan. This may mean predicting how many cars might be sold next year or how weather conditions might affect demand. This work is often much more challenging than searching for similar entries.

Some work solely to classify data. Security researchers, for example, want to use AI to look for anomalies in the log files that should be investigated. Bank programmers, on the other hand, may need to flag potentially fraudulent or suspicious transactions because of rules imposed by regulators. Some classification algorithms work to codify the data simply. Additionally, machine vision algorithms, for instance, may look at faces and try to classify whether the people are happy, sad, angry, worried or any of a large set of emotions.

How do some major companies work with unstructured data?

The major cloud companies have expanded their cloud services to support creating data lakes from unstructured data. The providers all offer various storage solutions that are tightly coupled with their various AI services to turn the data into meaningful insights.

Microsoft’s Azure AI uses a mixture of text analysis, optical character recognition, voice recognition and machine vision to make sense of an unstructured collection of files that may be texts or images. Its Cognitive Search Service will build a language-aware index of the data to guide searching and find the most relevant documents. Machine learning algorithms are integrated with traditional text searching to focus on significant terms like personal names or key phrases. Its knowledge mining algorithms are tunable by data scientists to unlock more profound studies of the data. The Cognitive Search Service is a bundled product, but the various algorithms for machine learning and search are also available independently.

Google offers a wide range of tools for storing data and applying various artificial intelligence algorithms to them. Many of the tools are ideal for using unstructured data. AutoML, for example, is designed to simplify the construction of machine learning models and it’s integrated directly with a number of Google’s data storage options to enable data lakes. Vision AI can analyze images, decode text and even classify the emotion of people in the images. The Cloud Natural Language can find key passages, domain-specific words and translate words. All are sold as cloud products and billed according to usage.

IBM also supports building data warehouses and data lakes with tools for both data storage and analysis that encompass the major algorithms from statistical analysis and artificial intelligence. Some of its products bundle together several of these options into task-centered tools. Teams looking for predictive analytics, for example, could use their SPSS Statistics package together with Watson AI Studio to create models for future behavior. The technologies are integrated with IBM’s storage options like the database db2, and can be either installed on premises or used in the cloud.

AWS supports creating data lakes for unstructured data with a variety of products. The company's Redshift tool, for example, can search and analyze data from a variety of sources from the S3 object storage to the more structured SQL databases. It simplifies working with complex architectures with a single interface. Amazon also offers a variety of machine learning, machine vision and artificial intelligence services that will work with all of its data storage options. These are generally available as either dedicated instances or sometimes as serverless options that are billed only when used.

Oracle also offers a wide range of artificial intelligence tools. The Oracle Cloud Infrastructure (OCI) for Language is optimized for classifying unstructured text by looking for important phrases and entities. It can detect languages, begin translation and classify the sentiment of the writer. The Data Integration tool brings all the power of artificial intelligence to a code-free tool for data analysis and reporting. A collection of pre-built models can work with standard languages, while some teams may want to create their own models.

How are startups targeting unstructured data?

Making some sense of unstructured data is the focus for many startups specializing in artificial intelligence, machine learning and natural language processing. Some are focused on building better algorithms with deeper insight, and others are creating better models that can be applied directly to problems.

The field has a natural overlap with data science and predictive analytics. The process of finding insight in text and visual data is a natural complement to creating reports and generating predictions from more structured data.

Some startups focus on providing the tools so that developers can create their own models by working with the data directly. Firms like Squirro, TeX AI, RapidMiner, Indico, Dataiku, Alteryx and H2O AI are just some companies building the foundation for conducting AI experiments with your own data.

One particular focus is natural language processing. Hugging Face has created a platform where companies can share their models with others, a process that encourages the development of sophisticated, more general models with broad ability.

Basis Technology is also creating tools that identify significant names and entities in unstructured text. Their product Rosette searches for relationships between the identities and creates semantic maps between them.

Others are commercializing their own models and reselling them directly. OpenAI is creating a large model of human language, GPT-3 and opening up access through an API, so developers can add its features. It is ideal for work like copywriting, text classification and text summarization. The company is also building a collection of book summaries. GitHub, for instance, uses OpenAI technology in their CoPilot tool that acts like a smart assistant that helps programmers write more code faster.

Cohere AI is also building their own model and opening it up via an API. Some developers are using the model to classify documents for projects like litigation support. Others are using the model to help writers find the right words and create better documents.

Some are focusing the natural language models to help with specific tasks. You, for instance, is building a new search engine that offers more control to users while also relying on smarter AI to extract meaning and find the best answers. Others are packaging similar approaches as APIs for developers. ZIR and Algolia are building a pluggable search engine with semantic models that can perform better than pure keyword search.

A number of the startups want to bring the power of the algorithms to particular industries or niches. They can tap into unstructured data as part of a larger focus on solving clear-cut problems for their targeted market. Viz AI, for instance, is creating an intelligent care coordinator for tracking patients in various stages of recovery. Socure hopes to improve identity verification and fraud detection for banks and other industries trying to distinguish between authentic and inauthentic behavior. Exceed AI is creating virtual sales assistants that help customers find answers and products.

What AI and unstructured data can’t do

The biggest limitation for the algorithms is the quality of any signal in the data. Occasionally, the data — structured or unstructured — doesn’t offer much correlation that can lead to a solid answer to a particular question. If there’s no significant connection or there’s too much random noise, there will be no signal for the algorithms to identify.

This challenge is more significant for unstructured data because extra, unhelpful bits are more likely to be part of the information. While the algorithms are designed to sift through the information and exclude the unhelpful parts, there are still limits to their power. There is typically much more noise in the unstructured data.

The problem is compounded by the value of finding any weak signal. If an event doesn’t happen very frequently, detecting it may not yield much profit. Even when the algorithms are successful, some unstructured data analysis does not pay off because the success is too infrequent.

Often, poorly defined questions produce ambiguous results. Some approach unstructured data searching for insights, but without clearly written definitions, the answers may be equally ambiguous. A big challenge for many unstructured projects is simply defining a clear goal, so the models can be trained accurately.