This is a guest post by Michael G. Solomon, PhD CISSP PMP CISM.
Predictive analytics is a general approach to analyzing historical data and attempting to predict what will happen next. Many organizations use this type of analysis to predict future sales, expected costs, equipment failures or any number of other potential events.
Testing data analytics models is different from testing most other types of software. While you are testing the model’s performance, you’re also testing how well it performs on data. Any model’s accuracy depends on the model itself and the data that feeds it. A good model performs well on different sets of data without favoring one set over another. However, insufficient testing can leave unresolved bias and variance in a model that reduces its overall value.
Let’s look at how to test predictive analytics models and what pitfalls to avoid.
Understanding Analytics Models
The volume and variety of data available to organizations of all sizes is growing faster than the ability to fully interpret it all. In short, there is just too much data to process. Today’s organizations realize how valuable their data is, but they often need help making sense out of it.
For example, what product should a Walmart store stock up on when a hurricane approaches? The interesting answer is Strawberry Pop-Tarts. Walmart used a data analytics model to review how an approaching hurricane affects the sales of different products, and those toaster pastries were flying off shelves. They learned that they can use the data from those analytics models to help them prepare for customer demands and increase their profits.
Predictive analytics models come in many flavors. Some of the more common types of models are linear regression, logistic regression, time series, decision trees, neural networks, support vector machines and naïve Bayes classifiers. Each model works best for specific types of data and desired output.
One of the more common predictive model types is linear regression. In simple terms, linear regression fits a line to a data set, making it easy to predict a value, y, based on multiple input values, from x1 to xi. It doesn’t map well to all types of data, but in the right circumstances, it can predict an accurate outcome based on multiple data samples.
For example, suppose a data analyst examines monthly sales and advertising dollars spent. Without going into any details of how we build the model, let’s say the data analyst finds that the linear regression equation y = 125.2 + 43.4x best represents the relationship between advertising dollars spent, x, and the resulting sales, y. The regression equation plots a line that most accurately matches the observed data.
Once you have an equation, you can use it to make predictions. What monthly sales would you expect to see if you invested $10,000 in advertising? To answer the question, simply plug advertising investment into the equation: y = 125.2 + 43.4 * 10,000. In this case, your expected sales would be $434,125.20.
So, does that mean $20,000 invested in advertising would result in $868,250.40 in sales? Maybe, or maybe not. All analytics models have limitations. Your model may not have been built on advertising investments as high as $20,000. Part of the model-building process is testing how well your model works.
Testing analytics models goes far beyond assessing the algorithm’s accuracy, and this thoroughness is required to determine how well the model meets its original objectives. Model testing must include both technical accuracy and business goal alignment.
Balancing Bias and Variance
Testing predictive analytics models is more than just running iterations over data. It must include examining how well — or poorly — the model performs when provided with different input datasets.
The idea behind predictive data analytics models is to use them on current data to predict future outcomes. That means you want your model to work on today’s data as well as next week’s and next month’s data. But because a model that works extremely well on today’s data may not do so well at other times, it is up to testing to reveal how much of a risk this will be.
Two types of model error that you can control are bias and variance. These two types of errors can work against one another and require a delicate balance to result in a stable model.
Bias errors
Bias errors are introduced in a model by relying too heavily on one or more features, or the inputs that a model analyzes to predict an output value. In algebra, you learned that these were called variables.
In our linear regression model above, x is the input feature and y is the output. In real-life analytics, most models use several different features, or variables, as input. Instead of just looking at advertising investment, a more realistic model may consider month of the year, associated consumer confidence indexes and perhaps even geographic location. Most real-life models have many x values to consider to predict an output value, y.
The example regression equation shown above has a very high bias. That means the whole prediction relies on a single input value, which isn’t very accurate. A better prediction would consider several different key features. The trick is to determine which features are important and which aren’t.
That’s where testing plays a crucial part in model accuracy and usability. In general, high-bias models, such as linear regression and logistic regression, are easy to train but are very sensitive to the feature selection. On the other hand, low-bias models, such as decision trees and support vector machines, require more data and time to train but are less sensitive to feature selection.
Variance errors
Variance errors are those encountered when a model is run with different data sets. A perfect model will provide the same results on any data set, but no models are 100% perfect. The goal is to reduce variance as much as possible so the model can focus on exposing hidden knowledge in data sets.
The problem with considering variance in conjunction with bias is that the two types of errors tend to be inversely related. That means that in many cases, low-bias models exhibit high variance, and vice versa. The key is to balance these two types of errors as much as possible through comprehensive testing.
Testing Analytics Models
Testing analytics models is a critical component of overall model development. A great model may perform poorly and not meet its design goals if it is not exhaustively trained and tested over data that aligns with operational data.
It isn’t good enough to just run a model multiple times. The nature of predictive analytics models is that they must first be trained, and then executed over operational data. The training step is the process of running the model using one data set. The training process is actually the process of building the model. Remember our example regression equation?
The constant, 125.2, and the x coefficient, 43.4, came from the model building, or training, step.
To build and test a model, analysts separate an input data set into two or more subset segments. One segment of data is used to train the model, and then the model is run using the non-training data segments.
Another common technique is to separate a data set into at least three sections. The third segment is the validation data set. That gives analysts the ability to train a model, run it on operational data and then run it using the validation data set. You can compare the results of the operational and validation data sets to determine how consistently the model performs.
This process of using multiple data sets can also be extended to carry out a type of testing that can validate a model’s results, called cross-validation. Cross-validation requires that an operational data set be segmented into an arbitrary number of subsets. The key is to run the model test multiple times, using a different subset as the training data set each time. After training the model, run the model using the other subsets as operational data. By comparing the output of each run using a different training data set subset, it is easy to determine whether your model is consistent or unreliable.
All analytics model tests are dependent on the quality of your data. If you have a good starting model, substantive data and a rigorous testing regimen, you can build models that help predict future events. And that can help your organization meet its strategic goals.
Article written by Michael Solomon PhD CISSP PMP CISM, Professor of Information Systems Security and Information Technology at University of the Cumberlands.