What it is and why it matters
Predictive analytics is the use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data. The goal is to go beyond knowing what has happened to providing a best assessment of what will happen in the future.
Predictive Analytics History & Current Advances
Though predictive analytics has been around for decades, it's a technology whose time has come. More and more organizations are turning to predictive analytics to increase their bottom line and competitive advantage. Why now?
- Growing volumes and types of data, and more interest in using data to produce valuable insights.
- Faster, cheaper computers.
- Easier-to-use software.
- Tougher economic conditions and a need for competitive differentiation.
With interactive and easy-to-use software becoming more prevalent, predictive analytics is no longer just the domain of mathematicians and statisticians. Business analysts and line-of-business experts are using these technologies as well.
Why is predictive analytics important?
Organizations are turning to predictive analytics to help solve difficult problems and uncover new opportunities. Common uses include:
Detecting fraud. Combining multiple analytics methods can improve pattern detection and prevent criminal behavior. As cybersecurity becomes a growing concern, high-performance behavioral analytics examines all actions on a network in real time to spot abnormalities that may indicate fraud, zero-day vulnerabilities and advanced persistent threats.
Optimizing marketing campaigns. Predictive analytics are used to determine customer responses or purchases, as well as promote cross-sell opportunities. Predictive models help businesses attract, retain and grow their most profitable customers.
Improving operations. Many companies use predictive models to forecast inventory and manage resources. Airlines use predictive analytics to set ticket prices. Hotels try to predict the number of guests for any given night to maximize occupancy and increase revenue. Predictive analytics enables organizations to function more efficiently.
Reducing risk. Credit scores are used to assess a buyer’s likelihood of default for purchases and are a well-known example of predictive analytics. A credit score is a number generated by a predictive model that incorporates all data relevant to a person’s creditworthiness. Other risk-related uses include insurance claims and collections.
Predictive Analytics in Today's World
With predictive analytics, you can go beyond learning what happened and why to discovering insights about the future. Learn how predictive analytics shapes the world we live in.
Putting predictive analytics to good use
This Harvard Business Review Insight Center Report features
25 articles focusing on how to use predictive analytics in decision making and planning.
How can predictive analytics add validity to your marketing efforts?
Learn how marketing attribution adds the science and removes the sorcery from your marketing efforts by replacing assumptions and arbitrary models with data and analytics.
Best practices for better predictive modeling results
Managing and coordinating all steps in the analytical process can be complex. Learn how to go step-by-step and achieve better, more reliable results.
Who's using it?
Any industry can use predictive analytics to reduce risks, optimize operations and increase revenue. Here are a few examples.
The financial industry, with huge amounts of data and money at stake, has long embraced predictive analytics to detect and reduce fraud, measure credit risk, maximize cross-sell/up-sell opportunities and retain valuable customers. Commonwealth Bank uses analytics to predict the likelihood of fraud activity for any given transaction before it is authorized – within 40 milliseconds of the transaction initiation.
Since the now infamous study that showed men who buy diapers often buy beer at the same time, retailers everywhere are using predictive analytics for merchandise planning and price optimization, to analyze the effectiveness of promotional events and to determine which offers are most appropriate for consumers. Staples gained customer insight by analyzing behavior, providing a complete picture of their customers, and realizing a 137% ROI.
Whether it is predicting equipment failures and future resource needs, mitigating safety and reliability risks, or improving overall performance, the energy industry has embraced predictive analytics with vigor. Salt River Project is the second-largest public power utility in the US and one of Arizona's largest water suppliers. Analyses of machine sensor data predicts when power-generating turbines need maintenance.
Governments have been key players in the advancement of computer technologies. The US Census Bureau has been analyzing data to understand population trends for decades. Governments now use predictive analytics like many other industries – to improve service and performance; detect and prevent fraud; and better understand consumer behavior. They also use predictive analytics to enhance cybersecurity.
In addition to detecting claims fraud, the health insurance industry is taking steps to identify patients most at risk of chronic disease and find what interventions are best. Express Scripts, a large pharmacy benefits company, uses analytics to identify those not adhering to prescribed treatments, resulting in a savings of $1,500 to $9,000 per patient.
For manufacturers it's very important to identify factors leading to reduced quality and production failures, as well as to optimize parts, service resources and distribution. Lenovo is just one manufacturer that has used predictive analytics to better understand warranty claims – an initiative that led to a 10 to 15 percent reduction in warranty costs.
Learn More About Industries Using This Technology
Putting the Magic in the Magic
Sports analytics is a hot area, thanks in part to Nate Silver and tournament predictions. The NBA’s Orlando Magic uses SAS predictive analytics to improve revenue and determine starting lineups. Business users across the Orlando Magic organization have instant access to information. The Magic can now visually explore the freshest data, right down to the game and seat.
How It Works
Predictive models use known results to develop (or train) a model that can be used to predict values for different or new data. Modeling provides results in the form of predictions that represent a probability of the target variable (for example, revenue) based on estimated significance from a set of input variables.
This is different from descriptive models that help you understand what happened, or diagnostic models that help you understand key relationships and determine why something happened. Entire books are devoted to analytical methods and techniques. Complete college curriculums delve deeply into this subject. But for starters, here are a few basics.
There are two types of predictive models. Classification models predict class membership. For instance, you try to classify whether someone is likely to leave, whether he will respond to a solicitation, whether he’s a good or bad credit risk, etc. Usually, the model results are in the form of 0 or 1, with 1 being the event you are targeting. Regression models predict a number – for example, how much revenue a customer will generate over the next year or the number of months before a component will fail on a machine.
Three of the most widely used predictive modeling techniques are decision trees, regression and neural networks.
Regression (linear and logistic) is one of the most popular method in statistics. Regression analysis estimates relationships among variables. Intended for continuous data that can be assumed to follow a normal distribution, it finds key patterns in large data sets and is often used to determine how much specific factors, such as the price, influence the movement of an asset. With regression analysis, we want to predict a number, called the response or Y variable. With linear regression, one independent variable is used to explain and/or predict the outcome of Y. Multiple regression uses two or more independent variables to predict the outcome. With logistic regression, unknown variables of a discrete variable are predicted based on known value of other variables. The response variable is categorical, meaning it can assume only a limited number of values. With binary logistic regression, a response variable has only two values such as 0 or 1. In multiple logistic regression, a response variable can have several levels, such as low, medium and high, or 1, 2 and 3.
Decision trees are classification models that partition data into subsets based on categories of input variables. This helps you understand someone's path of decisions. A decision tree looks like a tree with each branch representing a choice between a number of alternatives, and each leaf representing a classification or decision. This model looks at the data and tries to find the one variable that splits the data into logical groups that are the most different. Decision trees are popular because they are easy to understand and interpret. They also handle missing values well and are useful for preliminary variable selection. So, if you have a lot of missing values or want a quick and easily interpretable answer, you can start with a tree.
Neural networks are sophisticated techniques capable of modeling extremely complex relationships. They’re popular because they’re powerful and flexible. The power comes in their ability to handle nonlinear relationships in data, which is increasingly common as we collect more data. They are often used to confirm findings from simple techniques like regression and decision trees. Neural networks are based on pattern recognition and some AI processes that graphically “model” parameters. They work well when no mathematical formula is known that relates inputs to outputs, prediction is more important than explanation or there is a lot of training data. Artificial neural networks were originally developed by researchers who were trying to mimic the neurophysiology of the human brain.
Other Popular Techniques You May Hear About
Bayesian analysis. Bayesian methods treat parameters as random variables and define probability as "degrees of belief" (that is, the probability of an event is the degree to which you believe the event is true). When performing a Bayesian analysis, you begin with a prior belief regarding the probability distribution of an unknown parameter. After learning information from data you have, you change or update your belief about the unknown parameter.
Ensemble models. Ensemble models are produced by training several similar models and combining their results to improve accuracy, reduce bias, reduce variance and identify the best model to use with new data.
Gradient boosting. This is a boosting approach that resamples your data set several times to generate results that form a weighted average of the resampled data set. Like decision trees, boosting makes no assumptions about the distribution of the data. Boosting is less prone to overfitting the data than a single decision tree, and if a decision tree fits the data fairly well, then boosting often improves the fit. (Overfitting data means you are using too many variables and the model is too complex. Underfitting means the opposite – not enough variables and the model is too simple. Both reduce prediction accuracy.)
Incremental response (also called net lift or uplift models). These model the change in probability caused by an action. They are widely used to reduce churn and to discover the effects of different marketing programs.
K-nearest neighbor (KNN). This is a nonparametric method for classification and regression that predicts an object’s values or class memberships based on the k-closest training examples.
Memory-based reasoning. Memory-based reasoning is a k-nearest neighbor technique for categorizing or predicting observations.
Partial least squares. This flexible statistical technique can be applied to data of any shape. It models relationships between inputs and outputs even when the inputs are correlated and noisy, there are multiple outputs or there are more inputs than observations. The method of partial least squares looks for factors that explain both response and predictor variations.
Principal component analysis. The purpose of principal component analysis is to derive a small number of independent linear combinations (principal components) of a set of variables that retain as much of the information in the original variables as possible.
Support vector machine. This supervised machine learning technique uses associated learning algorithms to analyze data and recognize patterns. It can be used for both classification and regression.
Time series data mining. Time series data is time-stamped and collected over time at a particular interval (sales in a month, calls per day, web visits per hour, etc.). Time series data mining combines traditional data mining and forecasting techniques. Data mining techniques such as sampling, clustering and decision trees are applied to data collected over time with the goal of improving predictions.
What do you need to get started using predictive analytics?
The first thing you need to get started using predictive analytics is a problem to solve. What do you want to know about the future based on the past? What do you want to understand and predict? You’ll also want to consider what will be done with the predictions. What decisions will be driven by the insights? What actions will be taken?
Second, you’ll need data. In today’s world, that means data from a lot of places. Transactional systems, data collected by sensors, third-party information, call center notes, web logs, etc. You’ll need a data wrangler, or someone with data management experience, to help you cleanse and prep the data for analysis. To prepare the data for a predictive modeling exercise also requires someone who understands both the data and the business problem. How you define your target is essential to how you can interpret the outcome. (Data preparation is considered one of the most time-consuming aspects of the analysis process. So be prepared for that.)
After that, the predictive model building begins. Increasingly easy-to-use software means more people can build analytical models. But you’ll still likely need some sort of data analyst who can help you refine your models and come up with the best performer. And then you might need someone in IT who can help deploy your models. That means putting the models to work on your chosen data – and that’s where you get your results.
Predictive modeling requires a team approach. You need people who understand the business problem to be solved. Someone who knows how to prepare data for analysis. Someone who can build and refine the models. Someone in IT to ensure that you have the right analytics infrastructure for model building and deployment. And an executive sponsor can help make your analytic hopes a reality.
Read More About This Topic
- Applying technology to ensure voter integrity in electionsVoter integrity is becoming a serious concern for many elections. Recent disclosures of foreign influence campaigns using social media highlight the potential impact on the integrity of the democratic process. In monitoring your systems, technology can identify both legitimate and fraudulent activity; the balancing act is to minimize the impact on legitimate activity while preventing acts of cyber-criminals and fraudsters.
- Improve child welfare through analyticsWith tremendous potential for child welfare agencies to use data and analytics to prevent child abuse and improve outcomes for children and families, child welfare advocates discuss the benefits of using data and establishing a data-driven culture to advance practice and policy.
- Fraud detection and machine learning: What you need to knowMachine learning and fraud analytics are critical components of a fraud detection toolkit. Here’s what you’ll need to get started – from integrating supervised and unsupervised machine learning in operations to maintaining customer service while defending against fraud.
- Taking pre-emptive action to stem the tide of VAT fraud lossesEU countries lost an estimated €159.5 billion in VAT revenues to VAT fraud in 2014. The solution? Hybrid fraud analytics technology.