Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More

Data is a human invention. Humans define the phenomenon they want to measure, design systems to collect data about it, clean and pre-process it before analysis, and finally, choose how to interpret the results. Even with the same dataset, two people can form vastly different conclusions. This is because data alone is not “ground truth” — observable, provable, and objective data that reflects reality. If researchers infer data from other information, rely on subjective judgment, do not collect data in a rigorous and careful manner, or use sources that are of questionable authenticity, then the data they produce it is not ground truth.

How you choose to conceptualize a phenomenon, determine what to measure, and decide how to take measurements will affect the data that you collect. Your ability to solve a problem with artificial intelligence depends heavily on how you frame your problem and whether you can establish ground truth without ambiguity. We use ground truth as a benchmark to assess the performance of algorithms. If your gold standard is wrong, then your results will not only be wrong but also potentially harmful to your business.

Unless you were directly involved with defining and monitoring your original data collection goals, instruments, and strategy, you are likely missing critical knowledge that may result in incorrect processing, interpretation, and use of that data.

What people call “data” can actually be things like carefully curated measurements selected purely to support an agenda; haphazard collections of random information with no correspondence to reality; or information that looks reasonable but resulted from unconsciously biased collection efforts.


Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.


Register Now

Here’s a crash course on nine common statistical errors that every executive should be familiar with.

1. Undefined goals

Failing to pin down the reason for collecting data means that you’ll miss the opportunity to articulate assumptions and to determine what to collect. The result is that you’ll likely collect the wrong data or incomplete data. A common trend in big data is for enterprises to gather heaps of information without any understanding of why they need it and how they want to use it. Gathering huge but messy volumes of data will only impede your future analytics, since you’ll have to wade through much more junk to find what you actually want.

2. Definition error

Let’s say you want to know how much your customers spent on your services last quarter. Seems like an easy task, right? Unfortunately, even a simple goal like this will require defining a number of assumptions before you can get the information that you want.

First, how are you defining “customer”? Depending on your goals, you might not want to lump everyone into one bucket. You may want to segment customers by their purchasing behaviors in order to adjust your marketing efforts or product features accordingly. If that’s the case, then you’ll need to be sure that you’re including useful information about the customer, such as demographic information or spending history.

There are also tactical considerations, such as how you define quarters. Will you use fiscal quarters or calendar quarters? Many organizations’ fiscal years do not correspond with calendar years. Fiscal years also differ internationally, with Australia’s fiscal year starting on July 1 and India’s fiscal year starting on April 1. You will also need to develop a strategy to account for returns or exchanges. What if a customer bought your product in one quarter but returned it in another? What if they filed a quality complaint against you and received a refund? Do you net these in the last quarter or this one?

As you can see, definitions are not so simple. You will need to discuss your expectations and set appropriate parameters in order to collect the information you actually want.

3. Capture error

Once you’ve identified the type of data that you wish to collect, you’ll need to design a mechanism to capture it. Mistakes here can result in capturing incorrect or accidentally biased data. For example, if you want to test whether product A is more compelling than product B, but you always display product A first on your website, then users may not see or purchase product B as frequently, leading you to the wrong conclusion.

4. Measurement error

Measurement errors occur when the software or hardware you use to capture data goes awry, either failing to capture usable data or producing spurious data. For example, you might lose information about user behavior on your mobile app if the user experiences connectivity issues and the usage logs are not synchronized with your servers. Similarly, if you are using hardware sensors like a microphone, your audio recordings may capture background noise or interference from other electrical signals.

5. Processing error

As you can see from our simple attempt to calculate customer sales earlier, many errors can occur even before you look at your data. Many enterprises own data that is decades old, where the original team capable of explaining their data decisions is long gone. Many of their assumptions and issues are likely not documented and will be up to you to deduce, which can be a daunting task.

You and your team may make assumptions that differ from the original ones made during data collection and achieve wildly different results. Common errors include missing a particular filter that researchers may have used on the data, using different accounting standards, and simply making methodological mistakes.

6. Coverage error

Coverage error describes what happens with survey data when there is insufficient opportunity for all targeted respondents to participate. For example, if you are collecting data on the elderly but only offer a website survey, then you’ll probably miss out on many respondents.

In the case of digital products, your marketing teams may be interested in projecting how all mobile smartphone users might behave with a prospective product. However, if you only offer an iOS app but not an Android app, the iOS user data will give you limited insight into how Android users may behave.

7. Sampling error

Sampling errors occur when you analyze data from a smaller sample that is not representative of your target population. This is unavoidable when data only exists for some groups within a population. The conclusions that you draw from the unrepresentative sample will probably not apply to the whole.

A classic example of a sampling would be to ask only your friends or peers for opinions about your company’s products, then assume the user population will feel similarly.

8. Inference error

Statistical or machine learning models make inference errors when they make incorrect predictions from the available ground truth. False negatives and false positives are the two types of inference errors that can occur. False positives occur when you incorrectly predict that an item belongs in a category when it does not. False negatives occur when an item is in a category, but you predict that it is not.

Assuming you have a clean record of ground truth, calculating inference errors will help you assess the performance of your machine learning models. However, the reality is that many real-world datasets are noisy and may be mislabeled, which means you may not have clarity on the exact inference errors your AI system makes.

9. Unknown error

Reality can be elusive, and you cannot always establish ground truth with ease. In many cases, such as with digital products, you can capture tons of data about what a user did on your platform but not their motivation for those actions. You may know that a user clicked on an advertisement, but you don’t know how annoyed they were with it.

In addition to many known types of errors, there are unknowns about the universe that leave a gap between your representation of reality, in the form of data, and reality itself.

Executives without a data science or machine learning background often make these nine major errors, but many more subtle issues can also thwart the performance of AI technologies you build that make predictions from data.

Mariya Yao is the CTO of Metamaven, an applied AI firm building custom automation solutions for marketing and sales, and the coauthor of Applied Artificial Intelligence, a book for business leaders.

This story originally appeared on Copyright 2018

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.