Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Hadoop and MapReduce

This section explains the roles of Apache Hadoop and MapReduce in Amazon Elastic MapReduce (Amazon EMR) and how these two methodologies work together to process data.

What Is Hadoop?

Apache Hadoop is an open-source Java software framework that supports massive data processing across a cluster of servers. Hadoop uses a programming model called MapReduce that divides a large data set into many small fragments. Hadoop distributes a data fragment and a copy of the MapReduce executable to each of the slave nodes in a Hadoop cluster. Each slave node runs the MapReduce executable on its subset of the data. Hadoop then combines the results from all of the nodes into a finished output. Amazon EMR enables you to upload that output into an Amazon S3 bucket you designate.

For more information about Hadoop, go to http://hadoop.apache.org.

What Is MapReduce?

MapReduce is a combination of mapper and reducer executables that work together to process data. The mapper executable processes the raw data into key/value pairs, called intermediate results. The reducer executable combines the intermediate results, applies additional algorithms, and produces the final output, as described in the following process.

MapReduce Process

1 Amazon Elastic MapReduce (Amazon EMR) starts your instances in two security groups: one for the master node and another for the core node and task nodes.
2 Hadoop breaks a data set into multiple sets if the data set is too large to process quickly on a single cluster node.
3

Hadoop distributes the data files and the MapReduce executable to the core and task nodes of the cluster.

Hadoop handles machine failures and manages network communication between the master, core, and task nodes. In this way, developers do not need to know how to perform distributed programming or handle the details of data redundancy and fail over.

4

The mapper function uses an algorithm that you supply to parse the data into key/value pairs. These key/value pairs are passed to the reducer.

As an example, for a job flow that counts the number of times a word appears in a document, the mapper might take each word in a document and assign it a value of 1. Each word is a key in this case, and all values are 1.

5

The reducer function collects the results from all of the mapper functions in the cluster, eliminates redundant keys by combining values of all like keys, then performs the designated operation on all the values for each key, and then outputs the results.

Continuing with the previous example, the reducer takes all of the word counts from all of the mappers functions running in a cluster, adds up the number of times each word was found, and then outputs that result to Amazon S3.


You can write the executables in any programming language. Mapper and reducer applications written in Java are compiled into a JAR file. Executables written in other programming languages use the Hadoop streaming utility to implement the mapper and reducer algorithms.

The mapper executable reads the input from standard input and the reducer outputs data through standard output. By default, each line of input/output represents a record and the first tab on each line of the output separates the key and value.

For more information about MapReduce, go to How Map and Reduce operations are actually carried out (http://wiki.apache.org/hadoop/HadoopMapReduce).

Instance Groups

Amazon EMR runs a managed version of Apache Hadoop, handling the details of creating the cloud-server infrastructure to run the Hadoop cluster. Amazon EMR refers to this cluster as a job flow, and defines the concept of instance groups, which are collections of Amazon EC2 instances that perform roles analogous to the master and slave nodes of Hadoop. There are three types of instance groups: master, core, and task.

Each Amazon EMR job flow includes one master instance group that contains one master node, a core instance group containing one or more core nodes, and an optional task instance group, which can contain any number of task nodes.

If the job flow is run on a single node, then that instance is simultaneously a master and a core node. For job flows running on more than one node, one instance is the master node and the remaining are core or task nodes.

For more information about instance groups, see Resizing Running Job Flows.

Master Instance Group

The master instance group manages the job flow: coordinating the distribution of the MapReduce executable and subsets of the raw data, to the core and task instance groups. It also tracks the status of each task performed, and monitors the health of the instance groups. To monitor the progress of the job flow, you can SSH into the master node as the Hadoop user and either look at the Hadoop log files directly or access the user interface that Hadoop publishes to the web server running on the master node. For more information, see How to View Logs Using SSH.

As the job flow progresses, each core and task node processes its data, transfers the data back to Amazon S3, and provides status metadata to the master node.

[Note]Note

The instance controller on the master node uses MySQL. If MySQL becomes unavailable, the instance controller will be unable to launch and manage instances.

Core Instance Group

The core instance group contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.

The EC2 instances you assign as core nodes are capacity that must be allotted for the entire job flow run. Because core nodes store data, you can't remove them from a job flow. However, you can add more core nodes to a running job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons.

[Caution]Caution

Removing HDFS from a running node runs the risk of losing data.

For more information about core instance groups, see Resizing Running Job Flows.

Task Instance Group

The task instance group contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress.

Task nodes are managed by the master node. While a job flow is running you can increase and decrease the number of task nodes. Because they don't store data and can be added and removed from a job flow, you can use task nodes to manage the EC2 instance capacity your job flow uses, increasing capacity to handle peak loads and decreasing it later. Task nodes only run a TaskTracker Hadoop daemon.

For more information about task instance groups, see Resizing Running Job Flows.

Supported Hadoop Versions

Amazon Elastic MapReduce (Amazon EMR) allows you to choose to run either Hadoop version 0.18, Hadoop version 0.20, or Hadoop version 0.20.205.

For more information on Hadoop configuration, see Hadoop Configuration

Supported File Systems

Amazon EMR and Hadoop typically use two or more of the following file systems when processing a job flow:

  • Hadoop Distributed File System (HDFS)

  • Amazon S3 Native File System (S3N)

  • Local file system

  • Legacy Amazon S3 Block File System

HDFS and S3N are the two main file systems used with Amazon EMR

HDFS is a distributed, scalable, and portable file system for Hadoop. An advantage of HDFS is data awareness between the Hadoop cluster nodes managing the job flows and the Hadoop cluster nodes managing the individual steps. For more information on how HDFS works, see http://hadoop.apache.org/hdfs/.

The Amazon S3 Native File System (S3N) is a file system for reading and writing regular files on Amazon S3. The advantage of this file system is that you can access files on Amazon S3 that were written with other tools. For information on how Amazon S3 and Hadoop work together, see http://wiki.apache.org/hadoop/AmazonS3.

The local file system refers to a locally connected disk. When a Hadoop cluster is created, each node is created from an Amazon EC2 instance which comes with a preconfigured block of preattached disk storage called an Amazon EC2 local instance store. Data on instance store volumes persists only during the life of the associated Amazon EC2 instance. The amount of this disk storage varies by Amazon EC2 instance type. It is ideal for temporary storage of information that is continually changing, such as buffers, caches, scratch data, and other temporary content. For more information about Amazon EC2 instances, see Amazon Elastic Compute Cloud.

The Amazon S3 Block File System Files is a legacy file storage system. We strongly discourage the use of this system.

For more information on how to use and configure file systems in Amazon EMR, see File System Configuration.