The term machine learning is often incorrectly interchanged with artificial intelligence. Actually, machine learning is a subfield of AI. Machine learning is also sometimes confused with predictive analytics, or predictive modelling. Again, machine learning can be used for predictive modeling but it's just one type of predictive analytics, and its uses are wider than predictive modeling.
Coined by American computer scientist Arthur Samuel in 1959, the term machine learning is defined as a “computer’s ability to learn without being explicitly programmed."
At its most basic, machine learning uses programmed algorithms that receive and analyze input data to predict output values within an acceptable range. As new data is fed to these algorithms, they learn and optimize their operations to improve performance, developing intelligence over time.
There are four types of machine learning algorithms: supervised, semi-supervised, unsupervised and reinforcement.
Which machine learning algorithm should I use?
This cheat sheet helps you to choose from a variety of machine learning algorithms to find the appropriate algorithm for your specific problems, and the full article walks you through the process of how to use the sheet.
In supervised learning, the machine is taught by example. The operator provides the machine learning algorithm with a known dataset that includes desired inputs and outputs, and the algorithm must find a method to determine how to arrive at those inputs and outputs. While the operator knows the correct answers to the problem, the algorithm identifies patterns in data, learns from observations and makes predictions. The algorithm makes predictions and is corrected by the operator – and this process continues until the algorithm achieves a high level of accuracy/performance.
Under the umbrella of supervised learning fall: classification, regression and forecasting.
- Classification: In classification tasks, the machine learning program must draw a conclusion from observed values and determine to
what category new observations belong. For example, when filtering emails as spam or not spam, the program looks at existing observational data and filter the emails accordingly.
- Regression: In regression tasks, the machine learning program must estimate – and understand – the relationships among variables. Regression analysis focuses on one dependent variable and a series of other changing variables – making it particularly useful for prediction and forecasting.
- Forecasting: Forecasting is the process of making predictions about the future based on the past and present data, and is commonly used to analyze trends.
Supervised Learning vs Unsupervised Learning
Senior Data Scientist Brett Wujek gives a clear explanation of these two popular types of machine learning, and when to use each.
"The difference really boils down to whether the observations in your data set represent things you know about or things you're trying to learn relationships about," explains Wujek.
Semi-supervised learning is similar to supervised learning but instead uses both labelled and unlabelled data. Labelled data is essentially information that has meaningful tags so that the algorithm can understand the data, while unlabelled data lacks that information. By combining these techniques, machine learning algorithms can learn to label unlabelled data.
Here, the machine learning algorithm studies data to identify patterns. There is no answer key or human operator to provide instruction. Instead, the machine determines the correlations and relationships by analyzing available data. In an unsupervised learning process, the machine learning algorithm is left to interpret large data sets and address that data accordingly. The algorithm tries to organize that data in some way to describe its structure. This might mean grouping the data into clusters or arranging it in a way that looks more organized.
As it assesses more data, its ability to make decisions on that data gradually improves and becomes more refined.
Unsupervised learning techniques include:
- Clustering: Clustering involves grouping sets of similar data (based on defined criteria). It’s useful for segmenting data into several groups and performing analysis on each data set to find patterns.
- Dimension reduction: Dimension reduction reduces the number of variables being considered to find the exact information required.
Reinforcement learning focuses on regimented learning processes, where a machine learning algorithm is provided with a set of actions, parameters and end values. By defining the rules, the machine learning algorithm then tries to explore different options and possibilities, monitoring and evaluating each result to determine which one is optimal. Reinforcement learning teaches the machine trial and error. It learns from past experiences and begins to adapt its approach in response to the situation to achieve the best possible result.
Deciding which machine learning algorithms to use
Choosing the right machine learning algorithm depends on several factors, including, but not limited to: data size, quality and diversity, as well as what answers businesses want to derive from that data. Additional considerations include accuracy, training time, parameters, data points and much more. Therefore, choosing the right algorithm is both a combination of business need, specification, experimentation and time available.
Even the most experienced data scientists cannot tell you which algorithm will perform the best before experimenting with others. We have, however, compiled a machine learning algorithm cheat sheet, which will help you find the most appropriate one for your specific challenges.
What are the most common and popular machine learning algorithms?
Scroll through the slides to the right to learn about the most commonly used machine learning algorithms. This list is not meant to be exhaustive, but it does include the algorithms that data scientists are most likely to run into when solving business problems.
Keep in mind that many of these techniques are combined and used together, and often you have to experiment by trying out different algorithms and comparing the results.
Clearly, there are a lot of things to consider when it comes to choosing the right machine learning algorithms for your business’ analytics. However, you don’t need to be a data scientist or expert statistician to use these models for your business. At SAS, our products and solutions utilize a comprehensive selection of machine learning algorithms, helping you to develop a process that can continuously deliver value from your data.
Naïve Bayes Classifier Algorithm
(Supervised Learning - Classification)
The Naïve Bayes classifier is based on Bayes’ theorem and classifies every value as independent of any other value. It allows us to predict a class/category, based on a given set of features, using probability.
Despite its simplicity, the classifier does surprisingly well and is often used due to the fact it outperforms more sophisticated classification methods.
K Means Clustering Algorithm
(Unsupervised Learning - Clustering)
The K Means Clustering algorithm is a type of unsupervised learning, which is used to categorize unlabelled data, i.e. data without defined categories or groups. The algorithm works by finding groups within the data, with the number of groups represented by the variable K. It then works iteratively to assign each data point to one of K groups based on the features provided.
Support Vector Machine Algorithm
(Supervised Learning - Classification)
Support Vector Machine algorithms are supervised learning models that analyze data used for classification and regression analysis. They essentially filter data into categories, which is achieved by providing a set of training examples, each set marked as belonging to one or the other of the two categories. The algorithm then works to build a model that assigns new values to one category or the other.
Linear regression is the most basic type of regression. Simple linear regression allows us to understand the relationships between two continuous variables.
(Supervised learning – Classification)
Logistic regression focuses on estimating the probability of an event occurring based on the previous data provided. It is used to cover a binary dependent variable, that is where only two values, 0 and 1, represent outcomes.
Artificial Neural Networks
An artificial neural network (ANN) is essentially a large number of interconnected processing elements, working in unison to solve specific problems. ANNs are inspired by biological systems, such as the brain, and how they process information.
ANNs also learn by example and through experience, and they are extremely useful for modelling non-linear relationships in high-dimensional data or where the relationship among the input variables is difficult to understand.
(Supervised Learning – Classification/Regression)
A decision tree is a flow-chart-like tree structure that uses a branching method to illustrate every possible outcome of a decision. Each node within the tree represents a test on a specific variable – and each branch is the outcome of that test.
(Supervised Learning – Classification/Regression)
Forests or random decision forests is an ensemble learning method, combining multiple algorithms to generate better results for classification, regression and other tasks. Each individual classifier is weak, but when combined with others, can produce excellent results. The algorithm starts with a decision tree (a tree-like graph or model of decisions) and an input is entered at the top. It then travels down the tree, with data being segmented into smaller and smaller sets, based on specific variables.
The K-Nearest-Neighbour algorithm estimates how likely a data point is to being a member of one group or another. It essentially looks at the data points around a single data point to determine what group it is actually in. For example, if one point is on a grid and the algorithm is trying to determine what group that data point is in (group A or group B, for example) it would look at the data points near it to see what group the majority of the points are in.
- What is a data lake and why does it matter?A data lake is a storage repository that quickly ingests large amounts of raw data in its native format. As containers for multiple collections of data in one convenient location, data lakes allow for self-service access, exploration and visualization. In turn, businesses can see and respond to new information faster.
- Homelessness data holds insights to a hidden problemSAS partnered with The Carying Place, an organization that supports working homeless families, to find new ways to measure indicators of participant success and provide families the help they deserve.
- 5 data management best practices to help you do data rightFollow these 5 data management best practices to make sure your business data gives you great results from analytics.