Introduction to machine learning: Five things the quants wish we knew
By Kimberly Nevala, Best Practices, SAS
The profusion of new data sources along with analytic platforms that allow processing at scale and in real-time have brought machine learning (not a new concept – it started in the 1950s) to Main Street. Along with a rash of misconceptions, misperceptions and, yes, even fear.
While we can’t tackle the whole problem and provide a comprehensive introduction to machine learning in a single article, we can start by highlighting five common misconceptions that will keep non-quants in the conversation.
Certain problems and data lend themselves to machine learning: in simple terms, problems where accuracy is more important than interpretation and data that presents problems for traditional analysis techniques.
1. It’s a Black Box to Everybody
Unlike most traditional statistical models, the models created by machine learning algorithms are often nonlinear and can have many thousands (and even billions!) of rules or parameters that define the model. So A plus B does not always equal C.
The truth is, the exact processing pathways are a black box, even to the data scientist. It’s like when you don’t understand another person’s thought process or logic but with familiarity can often predict that person's actions – even when motivations or thought processes are completely obscure (or at least appear to be so). While there is a method to the madness, it may not be immediately obvious or linear. The exact path through a neural network, for example, is not easy to trace. The more important question: is the algorithm or method being applied appropriately to the problem at hand? Which leads us to…
2. The Proof Is in the Pudding (aka, Trust but Verify)
Don’t confuse black box processing with blind faith. If the analytic mechanisms or - perhaps more specifically – the processing pathways are not clear or easily reproducible, how do you validate results? When it comes to machine learning the answer is deceptively simple. Does the algorithm accurately predict future events or result in desired outcomes? Are the outputs useful?
That’s it. No more, no less. Machine learning done right can be characterized by the tag line: complicated methods, consumable results. The other nugget here? Machine learning should be integral to analytic discovery, not an adjunct activity.
3. Hammer, Meet Nail
Machine learning is a tool in the analytics toolbox. Like any tool it must be thoughtfully applied lest it become the proverbial hammer looking for a nail. As machine learning emerged from academia, early adopters often found themselves expending significant time and effort on problems that could have been easily solved with traditional statistical algorithms.
Certain problems and data lend themselves to machine learning: in simple terms, problems where accuracy is more important than interpretation and data that presents problems for traditional analysis techniques. For example, consider object recognition in images. We may not care to understand how the model works; we just care that the model identifies certain characters or objects in new images. Image datasets can be wider than they are deep, (because of the high number of pixels in HD images) and can contain many correlated variables (pixels that are close to one another often have very similar values). Wide data and correlated data can present problems for traditional regression analysis.
4. Less Is Sometimes More
When it comes to machine learning, a simple algorithm with more data can often beat a complicated algorithm with less data, even when the bigger data set is slightly dirtier. (No, I’m not arguing that data does not need to be processed before being used in machine learning algorithms.) Regardless, a cautionary note is in order here.
For the unexperienced data scientist, more complicated might seem better. Or, the higher the accuracy the better. However, for many practical applications, minute improvements in model accuracy will not affect germane operational improvements. More data and features may also unnecessarily complicate the algorithm. There is a big difference between the real world and a Kaggle contest! The balancing act here is between complexity and the ability to consume. When should you call time? See “The Proof Is in the Pudding” above.
(OK. Maybe the quants don’t really want you to know this, but I think it’s important.)
5. Humans Welcome to Apply
Yes, the methods can seem obscure and are, in fact, often inscrutable. But while machine learning algorithms are black boxes, machine learning in practice requires human application of the scientific method and human communication skills. The recipe is not as simple as: "add data and stir." Humans, above and beyond the data scientist programming the algorithm, are required to answer questions such as:
- What are we trying to predict? (Which influences feature engineering – deciding what data to incorporate and analyze.)
- How can results be applied? Machine learning is great at determining what to do, but not necessarily so good at defining how (a challenge that has dogged early robotics).
- What is the proper response? For instance, when a pattern emerges that has global health or political ramifications, what is the proper next step?
- Are results in line with expectations? Are there exceptions to be addressed? Consider Stanford and Google’s work in computer vision. While dang good, it was not foolproof. Goats got characterized as dogs, a field of tulips as hot air balloons. And, yes, these are inconsequent gaffes compared to more recently publicized mistakes, but you get the gist.
- Does the model need to be tuned excessively for realistic usage?
The bottom line? While this was just a short introduction to machine learning, one thing we know for sure: it is still a collaboration between man and “machine."
Kimberly Nevala is the Director of Business Strategies for SAS Best Practices, responsible for industry education, key client strategies and market analysis in the areas of business intelligence and analytics, data governance and master data management. She has more than 15 years of experience advising clients on the development and implementation of strategic customer and data management programs and managing mission-critical projects.