Newsroom

 

Tips for harnessing Hadoop the right way

There are some strong reasons to use Hadoop – and some limitations to consider

The cost of storing data has plunged dramatically. A gigabyte that cost $19 to store in 2000 costs 7 cents today. Pair cheap storage with data source growth, and organizations have a potential wealth of information at their fingertips. The key word is "potential," because simply storing data does nothing if you can't effectively use it. Data is useless until you analyze it.

Most enterprise data warehouses are constrained by the cost and scalability of relational databases. Storage might be cheap, but without a relational database to make sense of it, it is nothing more than a storage unit stuffed to the rafters with random artifacts.

Look for solutions that are built to take advantage of Hadoop storage through access software and other features, and then seek analytic projects that could benefit from Hadoop's ability to corral big data sets.

Relational databases, designed to organize those artifacts, can only deal with hundreds of gigabytes to a few terabytes, and can only support at most 1,000-2,000 variables. The problem is that when you're doing true predictive analytics you often want to test a few thousand variables, or analyze tens of thousands of terms in a text mining example.

In the case of bioinformatics, analyzing something like a DNA microarray might require as many as 64,000 data points or more. In these examples, the limits of relational database technology becomes readily apparent.

A number of technologies and solutions have emerged to work around the constraints of relational databases. Foremost among those is the open-source distributed application framework Hadoop. If you believe its most enthusiastic supporters, Hadoop storage is nearly free and gets around the messy limitations of relational databases. Make no mistake: There are some great reasons to use Hadoop, but there are also some limitations to consider.

More than just a pretty face
You can do a lot of neat things with Hadoop. You can treat it like a database and use Hive to query it. It offers extreme massive parallel processing, and it's got a really simplified MapReduce design. Hadoop, however, is not convenient and requires high-end programming support from staff members fluent in Java and UNIX. The scripting language Pig is a partial solution to the skill-set needs, but not a complete one.

But there is a way to get value from Hadoop without acquiring an army of Hadoop programmers. Look for solutions that are built to take advantage of Hadoop storage through access software and other features, and then seek analytic projects that could benefit from Hadoop's ability to corral big data sets. Here are three of my favorite examples that show why Hadoop is more than just a technology to drive down storage costs:

  • In Arizona, a bank uses Hadoop to prepare 10 years of historical data for stress test models. Hadoop is a terrific tool for quickly extracting, transforming and loading data. Prior to using Hadoop, the bank had the analytic capabilities to do the stress testing but didn't have the staff to prepare the data. Hadoop can do in 20 minutes what it would take a traditional ETL process two days to do.
  • Hadoop is an outstanding tool when you have a small number of very large records – common in the bioinformatics world. A traditional relational database might be able to handle up to 2,000 variables. In bioinformatics microarray analysis, you often have records with 64,000 variables. Hadoop can handle this.
  • If you need to mash structured and semistructured data with binary data, you can do it quickly with Hadoop.

Fig.1 Click to enlarge. 

SAS provides two options for accessing and operating on data stored in Hadoop's HDFS, which is the primary storage system used by Hadoop applications.

Take the example of a major bank that needed to perform personally identifiable information redaction, a procedure growing in popularity given concerns about financial institutions inadvertently exposing customer names with social security, address or taxpayer ID information.

The bank needed to crawl through 40 terabytes of data and not only redact names (when connected to other identifiers) but also track people on government watch lists, such as the politically exposed persons list. High-performance analytics coupled with Hadoop makes it possible to regularly sweep through data – ensuring nothing gets missed.

A few cheap storage caveats
For all the potential, there are some issues with Hadoop. For example, Hadoop is already the de facto standard for analyzing e-commerce blogs. That's a good thing, as it showcases Hadoop's ability to work with large data sets. But the analysis is typically simple – basic recommendations as people browse through a website. Like those gray pants? Here's a blue shirt to match. But these recommendations aren't built on in-depth analysis. They aren't based on factors like age, geographic region and previous purchase behavior. For that, you need analytics on top of a Hadoop-driven database.

Two other issues involve security and enterprise-class management. As an enterprise technology, Hadoop is still green. To deal with security and data management issues, it makes more sense to pair Hadoop with a solution-based access engine that provides a security layer and level of control over the movement of data.

Anything that helps companies manage their growing data is a good thing. Research shows that companies that successfully use data to make decisions do better than those that work on instinctive decision making.

Today, storage technology like Hadoop is all the rage; the real value, however, comes from your ability to harvest meaningful information from the piles of data. The only way to do this is through the application of advanced analytics.

Bio: Michael Ames is the SAS Data Integration Product Manager.

 

Michael Ames, SAS

Read More

This story appears in the Third Quarter 2012 issue of