Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Overview of Amazon EMR

Amazon Elastic MapReduce (Amazon EMR) is a data analysis tool that simplifies the set-up and management of a computer cluster, the source data, and the computational tools that help you implement sophisticated data processing jobs quickly.

Typically, data processing involves performing a series of relatively simple operations on large amounts of data. In Amazon EMR, each operation is called a step and a sequence of steps is a job flow. A job flow that processes encrypted data might look like the following example.

Step 1 Decrypt data
Step 2 Process data
Step 3 Encrypt data
Step 4 Save data

Amazon EMR uses Hadoop to divide up the work among the instances in the cluster, tracks status, and combine the individual results into one output. For an overview of Hadoop, see What Is Hadoop?.

Amazon EMR takes care of provisioning a Hadoop cluster, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. Amazon EMR removes most of the cumbersome details of setting up the hardware and networking required by the Hadoop cluster, such as monitoring the setup, configuring Hadoop, and executing the job flow. Together, Amazon EMR and Hadoop provide all of the power of Hadoop processing with the ease, low cost, scalability, and power that Amazon S3 and Amazon EC2 offer.