9 pitfalls to avoid in building a successful machine learning program

During my past two decades working in the IT field, I've seen artificial intelligence technologies move from conceptual to practical -- with machine learning techniques at the forefront, becoming more accessible, even for teams without specialized expertise.

With increased use of predictive modeling across a wide variety of teams, it's critical for leaders and managers to be aware of common issues that can distort the results of their teams' work. Here are nine common pitfalls to avoid, and best practices to follow, for a reliable machine learning process.

Pitfall 1: Sampling bias

The starting point of any machine learning program is to select the training data. Typically, organizations have some data available or can identify relevant external suppliers, such as government entities or industry associations. This is where the problem starts.

Training teams and their business sponsors must define which dataset to use. When choosing the data sample, it can be easy to introduce bias by selecting a dataset that misrepresents or underrepresents the real cases, which will distort the results. For example, an interview that selects only people who walk by a certain location is going to have an overrepresentation of healthy individuals.

Solution: To avoid sampling bias, teams must ensure they truly select their data at random and do not just use a particular slice simply because it's the easiest to access. Clear definitions of the ideal dataset and the logic of the model are essential to guide effective data selection. By working with business owners and having several reviewers validate selection criteria at this early stage, machine learning teams can ensure their data sampling approach is solid.

Pitfall 2: Irrelevant feature selection

In many situations, trainers encounter difficulties due to the nuances of variable selection. Many techniques need a large number of feature sets to fuel the learning process. But, in the effort to gather enough learning data, it can be challenging to ensure you include the right and relevant features.

Solution: The process of building a well-performing model requires careful exploration and analysis to ensure you select and engineer the appropriate features. Understanding the domain and involving subject matter experts are the two most important drivers for selecting the right features. In addition, techniques such as recursive feature elimination (RFE), random forest, principal component analysis (PCA), and autoencoder help focus the training effort on a smaller number of more effective features.

Pitfall 3: Data leakage

Machine learning teams may accidentally gather data for training using criteria that are part of the outcome the team is trying to predict. As a result, models will show performance that is too good to be true. For example, a team might mistakenly include a variable that indicates treatment for certain illness in a model designed to predict the illness.

Solution: The training team must thoughtfully construct their datasets using only data that in practice will be available at the time of the training, before the model estimates results.

Pitfall 4: Missing data

In some cases, datasets will be incomplete due to some records missing values. By incorrectly adjusting for that condition or assuming there are no missing values, trainers may introduce significant bias into the results. For example, missing data may not always be random, such as when survey respondents are less likely to answer a particular question. As a result, mean imputation may mislead the model.

Solution: If you cannot design the training program to ensure use of complete datasets, you can apply statistical techniques, including discarding records with missing values or using proper imputation strategy to estimate values for the missing data.

Pitfall 5: Inaccurate scaling and normalization

Constructing a dataset for machine learning work often requires the team to gather diverse types of inputs that can have different scales of measurement. When you fail to adjust the value of the variables to allow for a common scale prior to training a model, algorithms such as linear regression, support vector machine (SVN), or k nearest neighbors (KNN) can suffer greatly. The problems occur because large ranges will induce high variance to features which, therefore, may become unnecessarily important. For example, salary data may get more weight than age if you use both as unprocessed inputs.

Solution: You must be careful to normalize the dataset prior to beginning model training. You can transform the dataset through common statistical techniques such as standardization or feature scaling, depending on the type of data and your team's preferred algorithm.

Pitfall 6: Neglecting outliers

Forgetting about outliers can have a major impact on a model's performance. For instance, algorithms such as AdaBoost treat outliers as hard cases and put disproportionate weight to fit them, while decision trees are more forgiving. Furthermore, different use cases require different outlier handling. For example, in the case of fraud detection, a focus on outliers in deposits should be a requirement, but in sensory temperature inputs, data outliers might be ignored.

Solution: To address this issue, your team should either use a modeling algorithm that handles outliers properly or filter the outliers out before training. A good starting point is for your team to do an initial check to identify whether there are outliers in the data. The simplest approaches would be reviewing plots of the data or examining any values that are several deviations or more away from the mean.

Pitfall 7: Miscalculated features

When a team engineers inputs for model training, any errors in the derivative process can feed the model with misleading inputs. As a result, the model behaves unexpectedly and produces unreliable results, no matter how well the team performs the training. One example of this issue is when a team weakens the credit score prediction model that relies on calculated utilization because the team included data from inactive tradelines from credit reports.

Solution: Trainers must examine exactly how the team acquired the data. A critical starting point is to understand which features are in raw format and which have been engineered. From there, trainers would be well-served to check the assumptions and calculations of the derived features prior to conducting the training.

Pitfall 8: Ignoring multi-collinear inputs

Using a dataset without considering multi-collinear predictors is another way of misleading a model training (the presence of multi-collinear inputs means there is a high correlation between two or more variables). The result makes it difficult to identify the impact of any one variable. In this situation, small changes in selected features can have significant impacts on the outcomes. An illustration of this problem is when ad budget and traffic are presenting a collinearity as predictor variables.

Solution: An easy way to detect multi-collinearity is to calculate correlation coefficients for all pairs of variables. Then you have a number of options for solving the problem of any identified collinearity, such as building compositions or dropping the redundant variables.

Pitfall 9: Ineffective performance KPIs

Most modeling algorithms perform best when the training data has a balanced representation of various instances. When there is a significant imbalance in data, the right metrics for measuring model performance become critical. For example, with an average default rate of 1.2 percent, a model would produce 98.8 percent accuracy, predicting no default in all cases.

Solution: Unless there are options to generate a more balanced training set or use a cost-based learning algorithm, picking business-driven performance metrics is the best solution. There are various measures for performance of a model beyond accuracy, such as precision, recall, F1 score, and receiver operating characteristic (ROC) curve. Choosing the most appropriate metric will guide the training algorithm to minimize error.

Start with a solid foundation

Machine learning training programs are easier to execute than ever, thanks to advances in technology and tools. However, generating reliable results requires a firm understanding of data science and statistical principles to ensure teams start with a solid underlying dataset that forms the foundation for success.

Pejman Makhfi is chief technology officer of Credit Sesame, an educational credit and personal finance website that provides consumers with a free credit score.