How It Works and a Hadoop Glossary
Currently, four core modules are included in the basic framework from the Apache Foundation:
Hadoop Common – the libraries and utilities used by other Hadoop modules.
Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data across multiple machines without prior organization.
YARN – (Yet Another Resource Negotiator) provides resource management for the processes running on Hadoop.
MapReduce – a parallel processing software framework. It is comprised of two steps. Map step is a master node that takes inputs and partitions them into smaller subproblems and then distrubutes them to worker nodes. After the map step has taken place, the master node takes the answers to all of the subproblems and combines them to produce output.
Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:
Commercial Hadoop distributions
Open-source software is created and maintained by a network of developers from around the world. It's free to download, use and contribute to, though more and more commercial versions of Hadoop are becoming available (these are often called "distros.") With distributions from software vendors, you pay for their version of the Hadoop framework and receive additional capabilities related to security, governance, SQL and management/administration consoles, as well as training, documentation and other services. Popular distros include Cloudera, Hortonworks, MapR, IBM BigInsights and PivotalHD.
Big Data, Hadoop and SAS
SAS support for big data implementations, including Hadoop, centers on a singular goal – helping you know more, faster, so you can make better decisions. Regardless of how you use the technology, every project should go through an iterative and continuous improvement cycle. And that includes data preparation and management, data visualization and exploration, analytical model development, model deployment and monitoring. So you can derive insights and quickly turn your big Hadoop data into bigger opportunities.
Because SAS is focused on analytics, not storage, we offer a flexible approach to choosing hardware and database vendors. We can help you deploy the right mix of technologies, including Hadoop and other data warehouse technologies.
And remember, the success of any project is determined by the value it brings. So metrics built around revenue generation, margins, risk reduction and process improvements will help pilot projects gain wider acceptance and garner more interest from other departments. We've found that many organizations are looking at how they can implement a project or two in Hadoop, with plans to add more in the future.
More on SAS and Hadoop