Understanding the real capabilities of unsupervised machine learning

"Hey, Siri. What is the capital of New York?" We all know what happens next -- Siri provides the answer. How Siri knows the correct answer is not a mystery (we have the internet to thank for that), but what is more interesting is the fact that Siri is able to understand the question at all.

Siri can understand and respond to human speech for the same reason Facebook knows which friend to tag in a photo before you even type their name. This "knowledge" is a technology called machine learning.

Trained machine learning

There are two types of machine learning: trained and untrained. Most of us experience trained, or supervised, machine learning in our everyday lives, from weather forecasts and sports outcome predictions to Siri and Facebook. These examples are considered trained machine learning because they require input and output data.

Trained machine learning forms either a classification or a regression. A classification is when the machine predicts discrete responses, such as whether an email is spam or legitimate. After enough instances of manual distinction, the machine begins to learn. It uses the information it has collected over time (input data) to determine the outcome, which must fall among the output data.

A regression is when the machine predicts continuous responses. We see this form of trained machine learning through stock market predictions. Imagine you were asked to determine the missing number in this sequence: 3-9, 4-16, 5-25, 8-? -- what would you say? Your answer would likely be 64, and if so, you would be correct. It's safe to assume you came to that conclusion by studying the sequence and recognizing that each number was followed by its perfect square. You determined the outcome by studying a sequence and identifying a pattern.

In the case of both classification and regression, the machine uses input data to determine the output, which must fall among the provided output data.

For a more relatable example, let's look at the way Facebook suggests users to tag in your photos. Facebook does not know what you nor your friends look like; it simply collects data from previously tagged photos and "learns," by repetition, how to identify each person. The more photos someone is tagged in, the more likely Facebook is to make an accurate suggestion. The more input data a machine is fed, the more accurate the outcomes it can deliver.

Untrained machine learning

Untrained, or unsupervised, machine learning is different from trained in that it requires only input data. Most untrained machine learning is a form of cluster analysis, in which a set of data is grouped in a way so that the items in each group (or cluster) are more similar to each other than to those in other clusters.

With untrained machine learning, there aren't necessarily outcomes. The machine allows us to feed data into a machine learning algorithm to determine what is "normal" for a particular set of data. We don't tell the machine what is normal; rather, it goes through the data and determines what is typical and creates groups based on behaviors. The system does not identify anything as bad. It determines what is interesting or different from the rest of a set.

Organizations can leverage untrained machine learning to protect against potential threats. It does this by examining a user's behavior (e.g. login times) to determine if there has been unusual activity. By tracking when, where, and from what device each user logs into a system, the machine can begin to create clusters. Over time, the machine will be able to predict a particular user's login behavior, so if something is far enough outside of the model, it will be flagged as strange behavior.

For example, let's say an employee mainly logs into the company system from work and home but is now logging in from a new location. Though this person has never logged in from the new location before, other users in their group have. Therefore, it is not normal for that particular person to be there, but since it is normal for other users in their group, it may not be abnormal enough to cause concern.

With untrained machine learning, the groups (output) are not manually selected. The system creates clusters by behavior and then uses that information to compare.

The human element

As technology becomes increasingly sophisticated and machine learning becomes more and more integrated into our daily lives, many people fear machines will take over for humans. But the reality is machines are currently not viable for most applications without an added human element. Whether trained or untrained, machine learning will never completely eliminate the need for human participation.

Remember that the machine only learns from the data it is fed. When utilizing machine learning technology, it is important to understand which data points are meaningful. Determining the riskiness of login behaviors or confirming the identity of a Facebook photo is done with human verification.

So instead of fearing machine learning, organizations should learn how to use the technology to the best advantage while also understanding its limitations. It is critical to know the input data you are feeding into the system and have a clear understanding of the output data it produces. After all, in order for the machine to have actual "knowledge," it needs your intelligence.

David Ross is vice president of research at SecureAuth, a provider of access control solutions for premises, cloud and web applications, and resources.

Trained machine learning

Untrained machine learning

The human element

More