Analytics meets Hadoop: Harnessing the Power
By: Pat Finerty, VP of Alliances and Business Development at SAS Canada
Stream it, score it, store it. Data is the foundation of enterprise systems, and managing it so it can be accessible to all business users is a key business advantage.
That was one of the takeaways of a recent SAS Live event in Toronto focusing on the relationship between analytics and Big Data platform Hadoop. The event included a demonstration of how visualization and Big Data technologies combine to create the “citizen data scientist”—the business user who can perform analytics on vast data sets without relying on the information technology department to structure queries to run against that data.
“Visualization is the gateway to analytics,” said Bob Messier, senior director of business analytics for SAS Canada. Tools that democratize analytics help bridge the gap between business users modeling data relationships and the IT professionals charged with structuring the query.
“For that to work, you have to get the data right,” said Steve Holder, National Practice Lead, analytics, for SAS Canada. Proper management makes data consumable for analytics, and it’s no small feat given the variety of data sources being captured, from structured point-of-sale data to machine-generated telemetry—the Internet of Things—to unstructured social media content. That process begins with the data collection process—the three “S”s.
* Stream it: Capture data, at volume, in real-time. Analytics are applied at the gateway to drive real-time decision-making.
* Score it: Assess the value of the data and apply meta-data that makes it usable in an analytical environment. This is obviously an impossible task for a human analyst. It requires a large degree of automation, and alongside that, machine learning, to drive decisions at this scale.
* Store it: Keep the data you need according to the scoring process in an environment that’s accessible for business users.
This data accessibility allows a new approach to analytics, said Vamsi Chemitiganti, general manager of Hadoop open platform provider Hortonworks Inc. In earlier analytics environments, business staff would base models on hard business cases—confirmed relationships within data sets. The combination of visual analytical tools and an open storage platform enables a process of discovery—exploring the connections among various data sets. The data can now actually inform the opportunity.
In a traditional analytics environment, Chemitiganti said, the query process was sequential and heavily dependent on IT: design a single query based on a list of questions, collect the structured data to run the query against, run the query, and determine additional questions to run through the same process again. Using Hadoop and visual analytics tools, the process becomes one of real-time discovery and exploration, along with iterative reasoning and data schema—the business user can re-architect the query process on the fly, without relying on IT.
A common misconception about Hadoop is to view it as an application, Chemitiganti said. It’s actually a platform, built on a number of open source components.
HDFS: The Hadoop File Distribution System is the basic building block, turning data and files into smaller blocks that can be streamed from a cluster of servers in response to user commands.
YARN: Sometimes referred to as MapReduce 2.0 (or Yet Another Resource Negotiator), it is the operating system that allows a variety of approaches to data processing and supports the applications that sit on top of it.
Applications: A number of open source applications run on top of YARN to process analytics processes, including Solr, Spark, Storm, Hive, Pig and several others. Other applications take care of data governance, data security, and operations. Isolating users from this complexity is key to enabling the citizen data scientist.
“Data is an asset,” Chemitiganti said, and leveraging it through analytics can make a company a disruptive influence in its market. The option is to allow your competitors to disrupt your business.
Chemitiganti described information as a “data lake,” with huge volumes of data in a variety of formats.
To drink from that lake, consistent data management practices are necessary, according to Tim Trussell, data science specialist with SAS. A lack of data standards is a roadblock to the discovery process. For example, something as simple as geographic information can conflict depending on the source; is it CA, Calif., or California?
That’s the first step in a loop of data analysis: A stable body of data. This enables the discovery process, which leads to deployment of models. Add the three “D”s to the three “S”s.
It’s also effective to push the modeling to where the data lives, and it lives in Hadoop. In a visual analytics environment, there will be many models; push the best competing models to where the data is, Trussell said.
But for an enterprise just starting down the road to a real-time analytics culture, what does the roadmap look like? Chris Dingle, director of customer intelligence for communications giant Rogers, was on hand for a panel discussion of analytics in the real enterprise world.
“Analytics is a team sport,” said Dingle, alluding to Rogers’s role as co-owner of Maple Leaf Sports and Entertainment (MLSE), parent company of sports franchises including the NHL’s Toronto Maple Leafs, the NBA’s Toronto Raptors, and Major League Soccer’s Toronto Football Club (TFC).
Watch Why Analytics is a Differentiator and an Enabler for Rogers Communications.
Cross-functional teams of players from various business units—marketing, logistics, executive, IT, etc.—begin work on small, proof-of-concept projects. Repeatable patterns are important for developing an analytics culture: awareness of the problem or opportunity that the relationships among data sets can help resolve; small-scale proof-of-concept projects; optimization of the results of those projects; then deployment in the enterprise.
But there’s a two-track analytics process within the business. It needs analytics that support the stability of the business—how do we optimize the way we’re running now? But innovation is powerful leverage in any given vertical: How can we create new products and services based on our data stores that will drive new demand and create new customers?
The citizen data scientist can manage short-term analytics that provide tactical advantage—pricing, merchandising, inventory, and much more. This frees the MBA-trained data scientist talent to focus on more strategic data relationships.
Much of the material covered at the event—using cross-functional teams, starting with a proof-of-concept, using disparate data sources, etc.—qualify as what SAS considers best practices for advanced analytics. Fern Halper of The Data Warehousing Institute enumerates many best practices for advanced analytics on the SAS Insights blog, and in the TDWI report Next-Generation Analytics and Platforms for Business Success. It’s a good place to start for your strategic journey to advanced analytics that will help your organization wring more value out of the data you collect.
Click to watch Vamsi Chemitiganti discuss the explosion of data and why existing database architectures are not enough.