Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Architectural Overview of Amazon EMR

Amazon Elastic MapReduce (Amazon EMR) works in conjunction with Amazon EC2 to create a Hadoop cluster, and with Amazon S3 to store scripts, input data, log files, and output results. The Amazon EMR process is outlined in the following figure and table.

Amazon EMR Process

1 Upload to Amazon S3 the data you want to process, as well as the mapper and reducer executables that process the data, and then send a request to Amazon EMR to start a job flow.
2 Amazon EMR starts a Hadoop cluster, which loads any specified bootstrap actions and then runs Hadoop on each node.
3 Hadoop executes a job flow by downloading data from Amazon S3 to core and task nodes. Alternatively, the data is loaded dynamically at run time by mapper tasks.
4 Hadoop processes the data and then uploads the results from the cluster to Amazon S3.
5 The job flow is completed and you retrieve the processed data from Amazon S3.

For details on mapping legacy job flows to instance groups, see Mapping Legacy Job Flows to Instance Groups.