Busting 10 myths about Hadoop
By Philip Russom, TDWI
Although Hadoop and related technologies have been with us for more than five years now, most BI professionals and their business counterparts still harbor a few misconceptions that need to be corrected about Hadoop and related technologies such as MapReduce.
The following list of 10 facts will clarify what Hadoop is and does relative to BI/DW, as well as in which business and technology situations Hadoop-based business intelligence (BI), data warehousing (DW), data integration (DI), and analytics can be useful.
Fact No. 1: Hadoop consists of multiple products
We talk about Hadoop as if it's one monolithic thing, but it's actually a family of open-source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.)
The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL engine for low-latency data access to HDFS and Hive data.
Fact No. 2: Hadoop is open source but available from vendors, too
Apache Hadoop's open-source software library is available from ASF at apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools, maintenance, and technical support. A handful of vendors offer their own non-Hadoop-based implementations of MapReduce.
Fact No. 3: Hadoop is an ecosystem, not a single product
In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products (e.g., database management systems and tools for analytics, reporting, and DI) that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.
Fact No. 4: HDFS is a file system, not a database management system (DBMS)
Hadoop is primarily a distributed file system and therefore lacks capabilities we associate with a database management system (DBMS), such as indexing, random access to data, support for standard SQL, and query optimization.
That's OK, because HDFS does things DBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality, users can layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.
Fact No. 5: Hive resembles SQL but is not standard SQL
Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand code Hive, but that doesn't solve compatibility issues with SQL-based tools. TDWI believes that over time, Hadoop products will support standard SQL and SQL-based vendor tools will support Hadoop, so this issue will eventually be moot.
Fact No. 6: Hadoop and MapReduce are related but don't require each other
Some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some relational DBMSs. Some users deploy HDFS with Hive or HBase, but not MapReduce.
Fact No. 7: MapReduce provides control for analytics, not analytics per se
MapReduce handles the complexities of network communication, parallel programming, and fault tolerance for a wide variety of hand-coded logic and other applications - not just analytics.
Fact No. 8: Hadoop is about data diversity, not just data volume
Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it's largely true, and it's exactly what brings many users to Apache HDFS and related Hadoop products. After all, many types of big data that require analysis are inherently file based, such as Web logs, XML files, and personal productivity documents.
Fact No. 9: Hadoop complements a DW; it's rarely a replacement
Most organizations have designed their DWs for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multistructured data types most DWs simply weren't designed for.
Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adopters offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform. The result is that DW resources are freed for the workloads with which they excel.
Fact No. 10: Hadoop enables many types of analytics, not just Web analytics
Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data, but other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples - such as customer base segmentation, fraud detection, and risk analysis - can benefit from the additional big data managed by Hadoop. Likewise, Hadoop's additional data can expand 360-degree views to create a more complete and granular view of customers, financials, partners, and other business entities.
* Excerpted from the TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing, Q2 2013.©2013 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media Inc. Reprinted with permission. Visit tdwi.org for more information.
Bio: Philip Russom is the Research Director for Data Management at The Data Warehousing Institute (TDWI). He has been an industry analyst at Forrester Research, Giga Information Group and Hurwitz Group, specializing in BI issues.
What is SAS® doing with Hadoop?
- Out of the box, SAS can read and write data to and from Hadoop, as well as execute Map Reduce programs.
- With SAS/ACCESS® to Hadoop, you can connect SAS to Hive and Hive Server2 as well as manage and manipulate data via SAS procedures and HiveQL.
- SAS has developed specialized Hadoop (Hive, Pig, MapReduce, and HDFS) transformations to manage, load, transform, and prepare data within Hadoop.
- SAS Visual Analytics can explore data stored in Hadoop and take advantage of the distributed processing capabilities of a Hadoop Cluster for exploratory analysis.
- SAS high-performance analytic procedures can also be run on a Hadoop Cluster.