Tips for harnessing Hadoop the right way
There are some strong reasons to use Hadoop – and some limitations to consider
The cost of storing data has plunged dramatically. A gigabyte that cost $19 to store in 2000 costs 7 cents today. Pair cheap storage with data source growth, and organizations have a potential wealth of information at their fingertips. The key word is "potential," because simply storing data does nothing if you can't effectively use it. Data is useless until you analyze it.
Most enterprise data warehouses are constrained by the cost and scalability of relational databases. Storage might be cheap, but without a relational database to make sense of it, it is nothing more than a storage unit stuffed to the rafters with random artifacts.
Look for solutions that are built to take advantage of Hadoop storage through access software and other features, and then seek analytic projects that could benefit from Hadoop's ability to corral big data sets.
Relational databases, designed to organize those artifacts, can only deal with hundreds of gigabytes to a few terabytes, and can only support at most 1,000-2,000 variables. The problem is that when you're doing true predictive analytics you often want to test a few thousand variables, or analyze tens of thousands of terms in a text mining example.
In the case of bioinformatics, analyzing something like a DNA microarray might require as many as 64,000 data points or more. In these examples, the limits of relational database technology becomes readily apparent.
A number of technologies and solutions have emerged to work around the constraints of relational databases. Foremost among those is the open-source distributed application framework Hadoop. If you believe its most enthusiastic supporters, Hadoop storage is nearly free and gets around the messy limitations of relational databases. Make no mistake: There are some great reasons to use Hadoop, but there are also some limitations to consider.
More than just a pretty face
But there is a way to get value from Hadoop without acquiring an army of Hadoop programmers. Look for solutions that are built to take advantage of Hadoop storage through access software and other features, and then seek analytic projects that could benefit from Hadoop's ability to corral big data sets. Here are three of my favorite examples that show why Hadoop is more than just a technology to drive down storage costs:
Take the example of a major bank that needed to perform personally identifiable information redaction, a procedure growing in popularity given concerns about financial institutions inadvertently exposing customer names with social security, address or taxpayer ID information.
The bank needed to crawl through 40 terabytes of data and not only redact names (when connected to other identifiers) but also track people on government watch lists, such as the politically exposed persons list. High-performance analytics coupled with Hadoop makes it possible to regularly sweep through data – ensuring nothing gets missed.
A few cheap storage caveats
Two other issues involve security and enterprise-class management. As an enterprise technology, Hadoop is still green. To deal with security and data management issues, it makes more sense to pair Hadoop with a solution-based access engine that provides a security layer and level of control over the movement of data.
Anything that helps companies manage their growing data is a good thing. Research shows that companies that successfully use data to make decisions do better than those that work on instinctive decision making.
Today, storage technology like Hadoop is all the rage; the real value, however, comes from your ability to harvest meaningful information from the piles of data. The only way to do this is through the application of advanced analytics.
Bio: Michael Ames is the SAS Data Integration Product Manager.
This story appears in the Third Quarter 2012 issue of