What it is and why it matters
Developed as an Apache top-level project, Hadoop is an open-source programming framework that allows data to be spread over large clusters of commodity servers and processed in parallel. In addition, the software also detects and handles failures, which is critical for distributed processing.
With the release of Hadoop 2.0 in October 2013, there are now three major components:
- HDFS (Hadoop Distributed File System) – a distributed file system that acts as a storage system for big data, both structured and unstructured. Users load files to the file system using simple commands and HDFS takes care of making multiple copies of data blocks and distributing those blocks over multiple nodes in the Hadoop system.
- MapReduce – a parallel programming model for distributed processing of large data sets. The Map phase performs operations such as filtering, transforming and sorting. The Reduce phase takes that output and aggregates it. MapReduce programs are written in Java.
- YARN (Yet Another Resource Negotiator) – a general-purpose resource management framework. It handles and schedules resource requests from distributed applications (MapReduce and others) and supervises their execution.
Other components augment HDFS, MapReduce and YARN. Some of these include:
- Pig – a high-level procedural language that helps manipulate data stored in HDFS. It provides a way to do ETL and basic analysis without having to write MapReduce programs.
- Hive – A declarative SQL-like language that presents data in the form of tables. Hive incorporates HiveQL (Hive Query Language) for declaring source tables, target tables, joins and other functions to SQL that are applied to a file or set of files available in HDFS. Hive programming is similar to database programming.
The advantages of Hadoop – and there are several
- Inexpensive. It uses lower-cost commodity hardware to reliably store large quantities of data.
- Parallel processing power. Its distributed computing model can process really large volumes of data. The more computing nodes you use, the more processing power you have.
- Scalability. You can easily scale your system simply by adding more nodes. This requires very little administration.
- Inherent data redundancy. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other working servers. And each file is copied three or more times on its cluster.
- Storage flexibility. Unlike traditional relational databases, you don’t have to preprocess data before storing it. And you can easily store unstructured data.
Get more insights on big data including articles, research and other hot topics.
Big Data, Hadoop and SAS
Combining the benefits of Hadoop with the business analytics power of SAS helps your data scientists transform big data into big knowledge. But big data efforts aren’t confined to just accessing information. That’s why SAS products and services create seamless, transparent access to the Pig and Hive languages and the MapReduce framework.
The SAS environment also provides a visual, interactive Hadoop experience, making it easier to gain insights and discover trends. Powerful analytical algorithms help extract valuable information from the data, while in-memory technology lets you process all of this data faster. And with integrated, automated deployment of analytical models, you can score data directly in Hadoop for faster results.
How SAS Can Help
SAS support for Hadoop spans the entire data-to-decision process and centers on a singular goal – helping you know more, faster, so you can make better decisions. A new interactive programming environment (coming July 2014) will multiple users concurrently manage data, transform variables, perform exploratory analysis, build and compare models, and score – with virtually no limits on the size of the data stored in Hadoop.
- Easily access and use data stored in Hadoop. SAS/ACCESS software provides fast, efficient access to data stored in Hadoop via HiveQL. You can access Hive tables as if they were native SAS data sets. Then apply text mining and predictive analytics to the data to gain and share new insights.
- Maximize Hadoop’s distributed processing capabilities. Your SAS programmers can submit MapReduce, scripting and HDFS commands from within Base SAS. SAS also supports external file references, allowing you to conveniently find and use Hadoop files from any SAS product.
- Better manage data stored in Hadoop. One issue plaguing Hadoop implementations is the lack of – or immaturity of – tools for managing deployments. SAS Data Management Advanced provides an intuitive GUI so you can easily build job flows that use Pig, MapReduce and HDFS commands and Hive queries. SAS Data Management streamlines Pig and MapReduce code generation through visual editing tools and a built-in syntax checker. You also get the added advantages of metadata management, data lineage and security features.
Explore and Visualize
- Quickly visualize your data stored in Hadoop, discover new patterns and publish reports. To get value from vast and diverse data stored in Hadoop, organizations often start with data exploration. Now you can explore all types of data stored in Hadoop in an interactive and very visual way. SAS Visual Analytics is an in-memory solution that can help identify relevant variables, trends and relationships that weren’t evident before. It helps you identify opportunities for further analysis and share results via Web reports or mobile devices.
Analyze and Model
- Apply domain-specific high-performance analytics to data stored in Hadoop. SAS High-Performance Analytics products provide in-memory capabilities that let you develop analytical models using all data, not just a subset. These products quickly move large amounts of data in-memory from HDFS where threaded analytical algorithms are processed on collocated data. Run frequent modeling iterations. And use sophisticated analytics to get answers to questions you never thought of – or had time to ask.
- Uncover patterns and trends in Hadoop data with an interactive and visual environment for analytics. SAS Visual Statistics (coming July 2014) enables multiple users to concurrently solve complex problems and identify new opportunities by uncovering patterns and trends faster than ever. This interactive, non-programming environment will provide access to powerful predictive modeling and machine learning techniques. So you can base your decisions on fact-based insights derived from all of your data.
Deploy and Execute
- Automatically deploy and score analytic models in the parallel environment. SAS Scoring Accelerator for Cloudera automates model deployment inside the Hadoop distributed file system and allows you to score new data in Hadoop without moving the data. This speeds ad hoc modeling and scoring of new data for faster results.
Why You Need SAS and Hadoop
Comprehensive support for Hadoop.
SAS/ACCESS not only retrieves big data stored in HDFS, but also allows you to incorporate and use other capabilities, such as the Pig and Hive languages and the MapReduce framework.
Because SAS is focused on analytics, not storage, we offer a flexible approach to choosing hardware and database vendors. We work with you to deploy the right mix of technologies, including the ability to deploy Hadoop with other data warehouse technologies.
Complete data-to-decision support.
SAS supports the entire analytics life cycle, from data preparation and exploration to model development, production deployment and monitoring.
Transparent, collaborative, interactive and iterative.
SAS enables you to analyze large, diverse and complex data sets in Hadoop within a single environment – instead of using a mix of languages and products from different vendors.