The business value of clustering algorithms

A single type of machine learning algorithm can be used to identify fake news, filter spam, and personalize marketing materials. Known as clustering algorithms, or "clustering" for short, they can automatically discover natural groupings of events, people, and objects in large datasets.

Operating on the theory that data points in groups should have similar features, clustering algorithms have been adopted widely across enterprises to detect fraud, recommend content to users, and more. But they come with challenges that can be difficult for businesses to overcome without the right approaches in place. For example, before a clustering algorithm can be used, data has to be in a standardized format. And the number of clusters sometimes must be decided ahead of deployment, because too many clusters could lead to process inefficiencies while too few could sacrifice accuracy.

Clustering algorithms

Clustering algorithms are a form of unsupervised learning algorithm. With unsupervised learning, an algorithm is subjected to "unknown" data for which no previously defined categories or labels exist. The machine learning system must teach itself to classify the data, processing the unlabeled data to learn from its inherent structure.

This means that clustering algorithms can be used to automatically identify patterns and structures in data. A grocer could employ clustering to segment its loyalty card customers into different groups based on their buying behavior, for example, while an email provider could apply clustering for spam filtering by looking at the different sections of the email (e.g., the header and sender) and grouping together similar messages.

Another example of clustering algorithms in use is recommender systems, which group together users with similar viewing, browsing, or shopping patterns to recommend similar content. Clustering enables anomaly detection in manufacturing, helping to spot defective parts. And in the life sciences, clustering has been applied to analyzing evolutionary biology to surface patterns in DNA.

Choosing a clustering algorithm

A key step in deploying clustering is deciding which algorithm to use. One of the most common is k-means, which works by computing the "distances" (i.e., similarity) between data points and "group centers" (commonalities). But there's also mean-shifted clustering, which attempts to find dense areas of data points; density-based spatial clustering of applications with noise (DBSCAN); and agglomerative hierarchical clustering, to name a few algorithms.

K-means has the advantage of speed, but it requires that someone select many groups and start with a random choice of group commonalities. Because of this, k-means clustering can yield different results on different runs of the algorithm -- which isn't ideal in mission-critical domains like finance.

By contrast, mean-shift clustering doesn't need a person to select the number of groups -- it automatically discovers this in-process. DBSCAN doesn't require a preset number of groups, either, and helpfully identifies outliers as noises. But both processes can be slow.

As for hierarchical clustering, it's useful when the underlying data has a hierarchical structure as it can often recover the hierarchy. However, it's less efficient than k-means clustering.

Using clustering

Despite its potential, clustering isn't appropriate for every business scenario. It's best applied when starting from a large, unstructured dataset divided into an unknown number of classes, which would be too labor-intensive to segment manually.

As the engineering team at data science platform Explorium wrote in a recent blog, clustering should be deployed where and when it'll give the greatest impact and insights. In some cases, clustering might serve as a starting point rather than an end-to-end solution, shedding light on important features in a dataset that can be elucidated with deeper -- and richer -- analyses.

"Much like with other useful algorithms and data science models, you'll get the most out of clustering when you deploy it not as a standalone, but as part of a broader data discovery strategy," the team wrote. "Cluster analysis can help you segment your customers, classify your data better, and generally structure your datasets, but it won't do much more if you don't give your data a broader context."

The road to implementation can be tricky, but successful clustering projects can yield sizeable returns on investment. As McKinsey wrote in a 2020 report, it's possible for any company to get a good amount of value from AI -- including clustering algorithms -- if it's applied effectively in a repeatable way.

Clustering algorithms

Choosing a clustering algorithm

Using clustering

More