Busting 10 myths about Hadoop
By Philip Russom, TDWI
Although Hadoop and related technologies have been with us for more than five years now, most BI professionals and their business counterparts still harbor a few misconceptions that need to be corrected about Hadoop and related technologies such as MapReduce.
The following list of 10 facts will clarify what Hadoop is and does relative to BI/DW, as well as in which business and technology situations Hadoop-based business intelligence (BI), data warehousing (DW), data integration (DI), and analytics can be useful.
Fact No. 1: Hadoop consists of multiple products
The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Pig, Hive, HBase, HCatalog, Ambari, Mahout, Flume, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Pig, Hive, and HBase) constitute a useful technology stack for applications in BI, DW, DI, and analytics. More Hadoop projects are coming that will apply to BI/DW, including Impala, which is a much-needed SQL engine for low-latency data access to HDFS and Hive data.
Fact No. 2: Hadoop is open source but available from vendors, too
Fact No. 3: Hadoop is an ecosystem, not a single product
Fact No. 4: HDFS is a file system, not a database management system (DBMS)
That's OK, because HDFS does things DBMSs do not do as well, such as managing and processing massive volumes of file-based, unstructured data. For minimal DBMS functionality, users can layer HBase over HDFS and layer a query framework such as Hive or SQL-based Impala over HDFS or HBase.
Fact No. 5: Hive resembles SQL but is not standard SQL
Fact No. 6: Hadoop and MapReduce are related but don't require each other
Fact No. 7: MapReduce provides control for analytics, not analytics per se
Fact No. 8: Hadoop is about data diversity, not just data volume
Fact No. 9: Hadoop complements a DW; it's rarely a replacement
Furthermore, Hadoop can enable certain pieces of a modern DW architecture, such as massive data staging areas, archives for detailed source data, and analytic sandboxes. Some early adopters offload as many workloads as they can to HDFS and other Hadoop technologies because they are less expensive than the average DW platform. The result is that DW resources are freed for the workloads with which they excel.
Fact No. 10: Hadoop enables many types of analytics, not just Web analytics
* Excerpted from the TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing, Q2 2013.©2013 by TDWI (The Data Warehousing InstituteTM), a division of 1105 Media Inc. Reprinted with permission. Visit tdwi.org for more information.
What is SAS® doing with Hadoop?
* With SAS/ACCESS® to Hadoop, you can connect SAS to Hive and Hive Server2 as well as manage and manipulate data via SAS procedures and HiveQL.
* SAS has developed specialized Hadoop (Hive, Pig, MapReduce, and HDFS) transformations to manage, load, transform, and prepare data within Hadoop.
* SAS Visual Analytics can explore data stored in Hadoop and take advantage of the distributed processing capabilities of a Hadoop Cluster for exploratory analysis.
* SAS high-performance analytic procedures can also be run on a Hadoop Cluster.
This story appears in the Fourth Quarter 2013 issue of