How openness can supercharge event stream analytics

By David Loshin, President, Knowledge Integrity Inc.

Many organizations place laser focus on tools that can help them collect and store data streams from internal and external sources. To that end, big data platforms are designed to ingest massive amounts of data. And several ecosystem architectures (such as Spark, Kafka and Flume) are engineered specifically to manage and process multiple data streams simultaneously. At the same time, data scientists are fueling organizations’ desires to uncover profitable opportunities that lie dormant in the conglomeration of data landing in their growing data lakes.

The implication: It’s time to strike a balance between the operational aspects of rapidly absorbing data streams and integrating them with predictive and prescriptive analytical models that can drive profitable actions.

This is where openness comes into play.

Standard APIs simplify discovery analytics as well as application development, especially streaming analytics.

Four types of analysts who work with streaming big data

To establish practical methods for blending operational activities and streaming data with integrated analytics, we must first understand who the actors are and the challenges that impede their progress.

  • Exploratory analysts – generally known as “data scientists” – apply a variety of analytics algorithms to big data to mine predictive patterns related to business opportunities. But their actions are focused on discovery, which requires others to adapt what they’ve learned and integrate it into the operational environment. These analysts need to access raw data in the data lake to apply algorithms.
  • Applied analysts validate patterns discovered by data scientists and integrate the predictive pattern assessment into the operational environments that ingest streaming data. These analysts work with development teams to configure event stream analytics techniques bound to incoming data feeds – ensuring that those tools can access both the big data environment and conventional data warehouses.
  • Business analysts want to assess the success of integrated predictive analytics. Business analysts often scan real-time dashboards reflecting scores of key performance indicators, review periodic (e.g., hourly or daily) reports, and occasionally execute ad hoc queries to drill down on particular areas of concern. They require access to users’ business intelligence tools that are fed by real-time operational analytics.
  • Technical analysts are tasked with ensuring the fidelity of all these processes.

Challenges created by organic system sprawl

Even with the most agile teams, each of these archetypes have to contend with a multitude of issues. For example:

  • Computational constraints. No matter the size, big computing clusters are eventually going to be bogged down by data latency issues. At the same time, they’ll be strained to keep up with increasing numbers of incoming data feeds.
  • Cost constraints. Organizations often invest in systems that are much larger than what they currently need. That’s because maintaining a scalable solution demands investing in infrastructure that can maintain expected performance at peak usage times.
  • Integration with analytics. Each type of analyst uses a different mix of reporting and analytics tools. As more analysts are engaged, the need to connect different toolsets to the data rises. That means increased complexity in establishing connections between applications, especially with proprietary tools.
  • Ecosystem governance. Complexity and confusion introduced by the diversity of tools increases the administrative and operational demands of keeping tools and versions compatible across the organization.
  • Analytics governance. With each type of analyst working in a virtual vacuum, it’s hard to ensure consistency of interpretation and adaptation of different types of analyses. However, you don't want to have two different analysts using the same methodologies and the same algorithms yet come up with conflicting results.

The organic enterprise system sprawl that has been the typical modus operandi for system design and development is a root cause of these challenges. Such organic development has created enterprises that are rife with cross-generational hardware and software. As system incompatibilities increase, newer systems compete for operations support with legacy applications desperately clinging to outdated technology.

Overcoming incompatibility with openness

One approach to breaking the implicit system oligarchy is to embrace openness as a fundamental practice for data management and analytics. The approach has been successful in the operating system (OS) and platform tools arena – note the success of Linux and its disruptive effect on the stranglehold of proprietary OS software.

The evolution of the open source Hadoop ecosystem reflects a modern approach to openness that addresses the challenges of event stream analytics. How? In the same way that the Apache ecosystem is used for many emerging high-performance applications, its scalable, massively parallel hardware configuration works with a suite of extensible components that can be standardized from both an operations and a content-directed perspective.

What openness means for event stream analytics

From the perspective of streaming analytics, openness implies standards for development, implementation, access and utilization. But openness also embraces two critical facets:

  • It provides a single virtual framework that simultaneously supports the needs of different types of analysts without requiring significant hacks to ensure interoperability. This means being able to access the data, use analytics tools, integrate event stream analytics, and deliver reports and populate dashboards – all from the same environment.
  • It simplifies governance and day-to-day oversight of operations.

Finally, an open environment must address performance and cost challenges. One approach is to employ a high-performance, in-memory data architecture layered on top of a scalable-yet-elastic cloud computing environment. This type of scalable, in-memory configuration addresses the need for computational performance. At the same time, deployment on an elastic cloud allows the system to grow and shrink as necessary, addressing corresponding cost constraints.

Creating a common, consolidated environment to develop, deploy and manage tasks across the entire analytical life cycle streamlines operational governance by ensuring tool alignment. Standard APIs simplify discovery analytics as well as application development, especially streaming analytics.

More importantly, effective communication – coupled with the combination of a common environment and standard APIs – provides a means for different types of analysts to continually collaborate about end-to-end application design, development and deployment processes. Analysts can work together to standardize definitions, evaluate analytical models and ensure consistency when adapting discovered models in operational streaming processes.

The key takeaway is this: An open, big data environment facilitates an agile development life cycle. That, in turn, helps you speed, manage and govern the full streaming analytics life cycle.

David Loshin

David Loshin is the president of Knowledge Integrity Inc., a consulting, training and development services company that works with clients on business intelligence, big data, data quality, data governance and master data management initiatives. Loshin writes for many industry publications, including the Data Roundtable, TDWI Upside, and several TechTarget websites; he also speaks frequently at conferences and creates and teaches courses for The Data Warehousing Institute and other educational organizations. In addition, Loshin is the author of numerous books, including Big Data Analytics and Business Intelligence, Second Edition: The Savvy Manager's Guide.