Supervised vs. unsupervised learning: What's the difference?

At the advent of the modern AI era, when it was discovered that powerful hardware and datasets could yield strong predictive results, the dominant form of machine learning fell into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data, predict outcomes, and more. But while supervised learning can, for example, anticipate the volume of sales for a given future date, it has limitations in cases where data falls outside the context of a specific question.

That's where semi-supervised and unsupervised learning come in. With unsupervised learning, an algorithm is subjected to "unknown" data for which no previously defined categories or labels exist. The machine learning system must teach itself to classify the data, processing the unlabeled data to learn from its inherent structure. In the case of semi-supervised learning -- a bridge between supervised and unsupervised learning -- an algorithm determines the correlations between data points and then uses a small amount of labeled data to mark those points. The system is then trained based on the newly-applied data labels.

Unsupervised learning excels in domains for which a lack of labeled data exists, but it's not without its own weaknesses -- nor is semi-supervised learning. That's why, particularly in the enterprise, it helps to define the business problem in need of solving before deciding which machine learning approach to take. While supervised learning might be suited for tasks involving classifying, like sorting business documents and spreadsheets, it would adapt poorly in a field like health care if used to identify anomalies from unannotated data, like test results.

Supervised learning

Supervised learning is the most common form of machine learning used in the enterprise. In a recent O'Reilly report, 82% of respondents said that their organization opted to adopt supervised learning versus unsupervised or semi-supervised learning. And according to Gartner, supervised learning will remain the type of machine learning that organizations leverage most through 2022.

Why the preference for supervised learning? It's perhaps because it's effective in a number of business scenarios, including fraud detection, sales forecasting, and inventory optimization. For example, a model could be fed data from thousands of bank transactions, with each transaction labeled as fraudulent or not, and learn to identify patterns that led to a "fraudulent" or "not fraudulent" output.

Supervised learning algorithms are trained on input data annotated for a particular output until they can detect the underlying relationships between the inputs and output results. During the training phase, the system is fed with labeled datasets, which tell it which output is related to each specific input value. The supervised learning process progresses by constantly measuring the resulting outputs and fine-tuning the system to get closer to the target accuracy.

Supervised learning requires high-quality, balanced, normalized, and thoroughly cleaned training data. Biased or duplicate data will skew the system's understanding, with data diversity data usually determining how well it performs when presented with new cases. But high accuracy isn't necessarily a good indication of performance -- it might also mean the model is suffering from overfitting, where it's overtuned to a particular dataset. In this case, the system will perform well in test scenarios but fail when presented with a real-world challenge.

One downside of supervised learning is that a failure to carefully vet the training datasets can lead to catastrophic results. An earlier version of ImageNet, a dataset used to train AI systems around the world, was found to contain photos of naked children, porn actresses, college parties, and more -- all scraped from the web without those individuals' consent. Another computer vision corpus, 80 Million Tiny Images, was found to have a range of racist, sexist, and otherwise offensive annotations, such as nearly 2,000 images labeled with the N-word, and labels like "rape suspect" and "child molester."

Semi-supervised learning

In machine learning problems where supervised learning might be a good fit but there's a lack of quality data available, semi-supervised learning offers a potential solution. Residing between supervised and unsupervised learning, semi-supervised learning accepts data that's partially labeled or where the majority of the data lacks labels.

The ability to work with limited data is a key benefit of semi-supervised learning, because data scientists spend the bulk of their time cleaning and organizing data. In a recent report from Alation, a clear majority of respondents (87%) pegged data quality issues as the reason their organizations failed to successfully implement AI.

Semi-supervised learning is also applicable to real-world problems where a small amount of labeled data would prevent supervised learning algorithms from functioning. For example, it can alleviate the data prep burden in speech analysis, where labeling audio files is typically very labor-intensive. Web classification is another potential application; organizing the knowledge available in billions of webpages would take an inordinate amount of time and resources if approached from a supervised learning perspective.

Unsupervised learning

Where labeled datasets don't exist, unsupervised learning -- also known as self-supervised learning -- can help to fill the gaps in domain knowledge. Clustering is the most common process used to identify similar items in unsupervised learning. The task is performed with the goal of finding similarities in data points and grouping similar data together.

Clustering similar data points helps to create more accurate profiles and attributes for different groups. Clustering can also be used to reduce the dimensionality of the data where there are significant amounts of data.

Reducing dimensions, a process that isn't unique to unsupervised learning, decreases the number attributes in datasets so that the data generated is more relevant to the problem being solved. Reducing dimensions also helps cut down on the storage space required to store datasets and potentially improve performance.

Unsupervised learning can be used to flag high-risk gamblers, for example, by determining which spend more than a certain amount on casino websites. It can also help with characterizing interactions on social media by learning the relationships between things like likes, dislikes, shares, and comments.

Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. In a paper, researchers at the company detail SoftNER, a framework that Microsoft deployed internally to collate information regarding storage, compute, and outages. They claim that it eliminated the need to annotate a large amount of training data while scaling to a high volume of timeouts, slow connections, and other product interruptions.

More recently, Facebook announced SEER, an unsupervised model trained on a billion images that ostensibly achieves state-of-the-art results on a range of computer vision benchmarks. SEER learned to make predictions from random pictures found on Instagram profile pages.

Unfortunately, unsupervised learning doesn't eliminate the potential for bias in the system's predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Some experts, including Facebook chief scientist Yann LeCun, theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to "unteach" specific biases. But more research must be done to figure out the best way to accomplish this.

Choosing the right approach

Between supervised, semi-supervised, and unsupervised learning, there's no flawless approach. So which is the right method to choose? Ultimately, it depends on the use case.

Supervised learning is best for tasks like forecasting, classification, performance comparison, predictive analytics, pricing, and risk assessment. Semi-supervised learning often makes sense for general data creation and natural language processing. As for unsupervised learning, it has a place in performance monitoring, sales functions, search intent, and potentially far more.

As new research emerges addressing the shortcomings of existing training approaches, the optimal mix of supervised, semi-supervised, and unsupervised learning is likely to change. But identifying where these techniques bring the most value -- and do the least harm -- to customers will always be the wisest starting point.