Three scenarios where Hadoop can help – even if you don’t have big data
By Tamara Dull, Director of Emerging Technologies at SAS
Hadoop was originally developed to address the big data needs of Web and media companies, but today it’s being used around the world to address a wider set of data needs, big and small, in practically every industry. When the Apache Hadoop project was initially released, it had two primary components:
1. A storage component called HDFS (Hadoop Distributed File System) that works on low-cost, commodity hardware.
2. A resource management and processing component called MapReduce.
Although MapReduce processing is lightning fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for organizations that need to process data more frequently. With the recent release of Hadoop 2.0, however, the resource management functionality has been packaged separately from MapReduce (it’s called YARN) so that MapReduce can stay focused on what it does best – process data.
Keeping these two Hadoop components in mind, HDFS and MapReduce, let’s take a quick look at how Hadoop addresses three business scenarios that aren’t necessarily related to big data:
1. Data staging: Corporate data is growing, and it’s going to grow even faster. It’s just getting too expensive to extend and maintain a data warehouse.
2. Data processing: Organizations are having so much trouble processing and analyzing normal data that they can’t even think about dealing with big data.
3. Data archiving: Businesses must keep their data for seven years for compliance reasons, but would like to store and analyze decades of data – without breaking the bank (or the server).
Do any of these scenarios ring a bell? If so, Hadoop may be able to help.
Today, many organizations have a traditional data warehouse setup that looks something like this:
- Application data, such as ERP or CRM, is captured in one or more relational databases.
- ETL tools then extract, transform and load this data into a data warehouse ecosystem (EDW, data marts, operational data stores, analytic sandboxes, etc.).
- Users then interact with the data warehouse ecosystem via BI and analytical tools.
What if you used Hadoop to handle your ETL processing? You could write MapReduce jobs to load the application data into HDFS, transform it and then send the transformed data to the data warehouse. The bonus? Because of the low cost of Hadoop storage, you could store both versions of the data in HDFS: the “before” application data and the “after” transformed data. Your data would all be in one place, making it easier to manage, reprocess, and possibly analyze at a later date.
This particular Hadoop scenario was quite popular early on. Some went so far as to call Hadoop an “ETL killer,” putting ETL vendors at risk and on the defense. Fortunately, many of these vendors quickly responded with new HDFS connectors, making it easier for organizations to leverage their ETL investments in this new Hadoop world.
This strategy is a good alternative if you’re experiencing rapid application data growth or you’re having trouble getting all your ETL jobs to finish in a timely manner. Consider handing off some of this work to Hadoop – using your ETL vendor’s Hadoop/HDFS connector or MapReduce – to get ahead of your data, not behind it.
This is a simple example from a Facebook presentation a few years ago:
Instead of using costly data warehouse resources to update data in the warehouse, why not send the necessary data to Hadoop, let MapReduce do its thing, and then send the updated data back to the warehouse? The Facebook example used was updating your mutual friends list on a regular basis. As you can imagine, this would be a resource-intensive process involving a lot of data – a job that is easily handled by Hadoop.
This example not only applies to the processing of data stored in your data warehouse, but in any of your operational or analytical systems. Take advantage of Hadoop’s low-cost processing power so that your relational systems are freed up to do what they do best.
This third scenario is very common and pretty straightforward. Since Hadoop runs on commodity hardware that scales easily and quickly, organizations can now store and archive a lot more data at a much lower cost.
For example, what if you didn’t have to destroy data after its regulatory life to save on storage costs? What if you could easily and cost-effectively keep all your data? Or maybe it’s not just about keeping the data on hand, but rather, a need to analyze more data. Why limit your analysis to the last three, five or seven years when you can easily store and analyze decades of data? Isn’t this a data geek’s paradise?
The bottom line
Don’t fall into the trap of believing that Hadoop is a big-data-only solution. It’s much more than that. Hadoop is powerful open source technology that is fully capable of supporting and managing one of your organization’s greatest assets: your data. Hadoop is ready for the challenge. Are you?
- Download the white paper: A Non-Geek’s Big Data Playbook: Hadoop and the Enterprise Data Warehouse
Want to get more from Hadoop?
SAS® In-Memory Statistics for Hadoop provides an interactive programming environment that gives you access to powerful statistical and machine learning techniques. The technology helps your data scientists manage data, transform variables, explore data and build and score models.
Why you need SAS® and Hadoop
- Comprehensive support for Hadoop.
SAS/ACCESS® not only retrieves big data stored in HDFS, but also allows you to incorporate and use other capabilities, such as the Pig and Hive languages and the MapReduce framework.
- Flexible architecture.
Because SAS is focused on analytics, not storage, we offer a flexible approach to choosing hardware and database vendors. We work with our customers to deploy the right mix of technologies, including the ability to deploy Hadoop with other data warehouse technologies.
- Complete data-to-decision support.
SAS supports the entire analytics life cycle, from data preparation and exploration to model development, production deployment and monitoring.
- Transparent, collaborative, interactive and iterative.
SAS enables you to analyze large, diverse and complex data sets in Hadoop within a single environment – instead of using a mix of languages and products from different vendors.
Read more sas.com/hadoop