Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Table of contents
AI clustering is the machine learning (ML) process of organizing data into subgroups with similar attributes or elements. Clustering algorithms tend to work well in environments where the answer does not need to be perfect, it just needs to be similar or close to be an acceptable match. AI clustering can be particularly effective in identifying patterns in unsupervised learning. Some common applications are in human resources, data analysis, recommendation systems and social science.
Data scientists, statisticians and AI scientists use clustering algorithms to seek answers that are close to other answers. They first use a training dataset to define the problem and then look for potential solutions that are similar to those generated with the training data.
One challenge is defining “closeness,” because the desired answer is usually generated with the training data. When the data has several dimensions, data scientists can also guide the algorithm by assigning weights to the different data columns in the equation used to define closeness. It is not uncommon to work with several different functions that define closeness.
When the closeness function, also called the similarity metric or distance measure, is defined, much of the work is storing the data in a way that it can be searched quickly. Some database designers create special layers to simplify that search. A key part of many algorithms is the distance metric that defines how far apart two data points may be.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
Another approach involves turning the problem on its head and deliberately searching for the worst possible match. This is suited to problems such as anomaly detection in security applications, where the goal is to identify data elements that don’t fit in with the others.
What are some examples of clustering algorithms?
Scientists and mathematicians have created different algorithms for detecting various types of clusters. Choosing the right solution for a specific problem is a common challenge.
The algorithms are not always definitive. Scientists may use methods that fall into only one classification, or they might employ hybrid algorithms that use techniques from multiple categories.
Categories of clustering algorithms include the following:
- Bottom-up: These algorithms, also known as agglomerative or hierarchical, begin by pairing each data element up with its closest neighbor. Then the pairs are, themselves, paired up. The clusters grow and the algorithm continues until a threshold on the number of clusters or the distance between them is reached.
- Divisive: These algorithms are like the bottom-up or agglomerative, but they begin with all points in one cluster and then they look for a way to split them into two smaller clusters. This often means searching for a plane or other function that will cleanly divide the cluster into separate parts.
- K-means: This popular approach searches for k different clusters by first assigning the points randomly to k different groups. The mean of each cluster is calculated and then each point is examined to see if it is closest to the mean of its cluster. If not, it is moved to another. The means are recalculated and the results converge after several iterations.
- K-medoids: This is similar to the k-means, but the center is calculated using a median algorithm.
- Fuzzy: Each point can be a member of multiple clusters that are calculated using any type of algorithm. This can be useful when some points are equally distant from each center.
- Grid: The algorithms begin with a grid that is pre-defined by the scientists to slice up the data space into parts. The points are assigned to clusters based upon which grid block they fit.
- Wave: The points are first compressed or transformed with a function called a wavelet. The clustering algorithm is then applied using the compressed or transformed version of the data, not the original one.
Note: Many database companies often use the word “clustering” in a different way. The word also can be used to describe a group of machines that work together to store data and answer queries. In that context, the clustering algorithms make decisions about which machines will handle the workload. To make matters more confusing, sometimes these data systems will also apply AI clustering algorithms to classify data elements.
How are clustering algorithms used in specific applications?
Clustering algorithms are deployed as part of a wide array of technologies. Data scientists rely upon algorithms to help with classification and sorting.
For instance, a large number of applications for working with people can be more successful with better clustering algorithms. Schools may want to place students in class sections based on their talents and abilities. Clustering algorithms will put students with similar interests and needs together.
Some businesses want to separate their potential customers into different categories so that they can give the customers more appropriate service. Neophyte buyers can be offered extensive help so they can understand the products and the options. Experienced customers can be taken immediately to the offerings, and perhaps be given special pricing that’s worked for similar buyers.
There are many other examples from a diverse range of industries, like manufacturing, banking and shipping. All rely on the algorithms to separate the workload into smaller subsets that can get similar treatment. All of these options depend heavily on data collection.
How do distance metrics define the clustering algorithms? If a cluster is defined by the distances between data elements, the measurement of the distance is an essential part of the process. Many algorithms rely on standard ways to calculate the distance, but some rely on different formulas with different advantages.
Many find the idea of a “distance” itself confusing. We use the term so often to measure how far we must travel in a room or around the globe that it can feel odd to consider two data points — like describing a user’s preferences for ice cream or paint color — as being separated by any distance. But the word is a natural way to describe a number that measures how close the elements may be to each other.
Scientists and mathematicians generally rely on formulas that satisfy what they call the “triangle inequality.” That is, the distance between points A and B plus the distance between B and C is greater than or equal to the distance between A and C. When the formula guarantees this, the process gains more consistency. Some also rely on more rigorous definitions like “ultrametrics” that offer more complex guarantees. The clustering algorithms do not, strictly speaking, need to insist upon this rule because any formula that returns a number might do, but the results are generally better.
How are major companies approaching AI clustering?
The statistics, data science and AI services offered by leading tech vendors include many of the most common clustering algorithms. The algorithms are implemented in the languages that make up the foundation of many of these platforms, which is often Python. Vendors include:
- SageMaker: Amazon’s turnkey solution for building AI models supports a number of approaches, like K-means clustering. These can be tested in notebooks and deployed after the software builds the model.
- Google includes a variety of clustering algorithms that can be deployed, including density-based, centroid-based and hierarchical algorithms. Their Colaboratory offers a good opportunity to explore the potential before deploying an algorithm.
- Microsoft’s Azure tools, like its Machine Learning designer, offer all of the major clustering algorithms in a form that’s open to experimentation. Its systems aim to handle many of the configuration details for building a pipeline that turns data into models.
- IBM offers clustering under both its data science and its AI tools. Both implement the major algorithms and provide tools like the Cloud Pak for Data or the Watson Studio.
- Oracle also offers clustering technology in all of its AI and data science applications. It has also built algorithms into its flagship database so that the clusters can be built inside the data storage without exporting them.
How are challengers and startups handling AI clustering?
Established data specialists and a raft of startups are challenging the major vendors by offering clustering algorithms as part of broader data analysis packages and AI tools.
Teradata, Snowflake and Databricks are leading niche companies focused on helping enterprises manage the often relentless flows of data by building data lakes or data warehouses. Their machine learning tools support some of the standard clustering algorithms so data analysts can begin classification work as soon as the data enters the system.
Startups such as the Chinese firm Zilliz, with its Milvus open-source vector database, and Pinecone, with its SaaS vector database, are gaining traction as efficient ways to search for matches that can be very useful in clustering applications.
Some are also bundling algorithms with tools focused on particular vertical segments. They pre-tune the models and algorithms to work well with the type of problems common in that segment. Zest.ai and Affirm are two examples of startups that are building models for guiding lending. They don’t sell algorithms directly but rely on algorithms’ decisions to guide their product.
A number of companies use clustering algorithms to segment their customers and provide more direct and personalized solutions. You.com is a search engine company that relies on customized algorithms to provide users with personalized recommendations and search results. Observe AI aims to improve call centers by helping companies recognize the opportunities in offering more personalized options.
Is there anything that AI clustering can’t do?
As with all AI, the success of clustering algorithms often depends on the quality and suitability of the data used. If the numbers yield tight clusters with large gaps in between, the clustering algorithm will find them and use them to classify new data with relative success.
The problems occur when there are not tight clusters, or the data elements end up in some gap where they are relatively equidistant between clusters. The solutions are often unsatisfactory because there’s no easy way to choose one cluster over another. One may be slightly closer according to the distance metric, but that may not be the answer that people want.
In many cases, the algorithms aren’t smart enough or flexible enough to accept a partial answer or one that chooses multiple classifications. While there are many real-world examples of people or things that can’t be easily classified, computer algorithms often have one field that can only accept one answer.
The biggest problems arise, though, when the data is too spread out and there are no clearly defined clusters. The algorithms may still run and generate results, but the answers will seem random and the findings will lack cohesion.
Sometimes it is possible to enhance the clusters or make them more distinct by adjusting the distance metric. Adding different weights for some fields or using a different formula may emphasize some parts of the data enough to make the clusters more clearly defined. But if these distinctions are artificial, the users may not be satisfied with the results.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.