The seven steps of big data delivery
Everything you need to know to get started
Remember Tim Allen's character on the 1990s hit show, Home Improvement? When Allen – outfitted in his hard hat and tool belt – starts talking shop, he makes a simian grunt that represents the particular pleasure men take from high-performance machines: "Arghh." Belt sander? Arghh. Sliding miter saw? Arghh. Classic Ford truck with a blown and injected 426 Hemi? Arrrgggggghhhh!
The big data trend represents the evolving need to process large amounts of data with a new crop of technology solutions that aren't necessarily your father's database. So, what does a company need to consider when contemplating getting started with big data?
Before we go too far, here's my definition of big data: The emerging technologies and practices that enable the collection, processing, discovery and storage of large volumes of structured and unstructured data quickly and cost-effectively.
Big data – from financial trades to human genomes to telemetry sensors in cars to social media interactions to Web logs and beyond – is expensive to process and store in traditional databases. To solve that problem, new technologies use open source solutions and commodity hardware to store data efficiently, parallelize workloads and deliver screaming-fast processing power.
As more IT departments research big data alternatives, the discussion centers on stacks, processing speeds and platforms. Inasmuch as IT departments are savvy enough to grasp the limitations of their incumbent technologies, many can't articulate the business value of these alternative solutions, let alone how they will classify and prioritize the data once they identify it. Enter big data governance.
As companies develop their big data business cases, the platform and speed discussions are only part of the overall conversation about big data delivery. In reality, we're seeing seven steps necessary for realizing the full potential of big data:
- Collect: Data is collected from the data sources and distributed across multiple nodes – often a grid – each of which processes a subset of data in parallel.
- Process: The system then uses that same high-powered parallelism to perform fast computations against the data on each node. Next, the nodes reduce the resulting data findings into more consumable data sets to be used by either a human being (in the case of analytics) or machine (in the case of large-scale interpretation of results).
- Manage: Often the big data being processed is heterogeneous, originating from different transactional systems. Nearly all of that data needs to be understood, defined, annotated, cleansed and audited for security purposes.
- Measure: Companies will often measure the rate at which data can be integrated with other customer behaviors or records, and whether the rate of integration or correction is increasing over time. Business requirements should determine the type of measurement and the ongoing tracking.
- Consume: The resulting use of the data should fit in with the original requirement for the processing. For instance, if bringing in a few hundred terabytes of social media interactions demonstrates whether and how social media data delivers additional product purchases, then there should be rules for how social media data is accessed and updated. This is equally important for machine-to-machine data access.
- Store: As the "data-as-a-service" trend takes shape, increasingly the data stays in a single location, while the programs that access it move around. Whether the data is stored for short-term batch processing or longer-term retention, storage solutions should be deliberately addressed.
- Govern: Data governance encompasses the policies and oversight of data from a business perspective. As defined, data governance applies to each of the six preceding stages of big data delivery. By establishing processes and guiding principles, governance sanctions behaviors around data. And big data needs to be governed according to its intended consumption. Otherwise, the risk is disaffection of constituents, not to mention overinvestment.
Most of the early adopters charged with researching and acquiring big data solutions focus on the Collect and Store steps at the expense of the others. The question is implicit: "How do we gather all these petabytes of data, and where do we put 'em all once we have 'em?"
But the processes for defining discrete business requirements for big data still elude many IT departments. Business people often see the big data trend as just another pretext for IT résumé building with no clear endgame. Such an environ-ment of mutual cynicism is the single biggest culprit for why big data never transcends the tire-kicking phase.
As Lorraine Lawson, author of IT Business Edge, said in a recent blog post, "The only way to ensure your analysis is sound is to ensure you have a governance program in place for big data."
Entrenching data governance processes on behalf of a big data effort ensures that:
- Business value and desired outcomes are clear.
- Policies for the treatment of key data have been sanctioned.
- The right subject matter expertise is applied to the big data problem.
- Definitions and rules for key data are clear.
- There is an escalation process for conflict and questions.
- Data management – the tactical execution of data governance policies – is deliberate and relevant.
- There are decision rights for key issues during development.
- The results of big data analytics are useful and can be put into action.
- Data privacy policies are enforced.
In short, data governance means that the application of big data spurs business results. It's an insurance policy that guarantees that the right questions are being asked. So the immense power of new big data technologies is being truly harnessed to make processing, storage and delivery speed more cost-effective and more nimble than ever.
Bio: Jill Dyche is Vice President of Thought Leadership at SAS, a popular blogger, and the author of three books on the business value of information technology.