| Did this page help you? Yes No Tell us about it... |
Topics
This section describes AWS concepts and terminology you need to understand to use Amazon Elastic MapReduce (Amazon EMR) effectively.
The following sections describe Amazon EC2 features used by Amazon EMR.
Amazon EMR enables you to choose the number and kind of Amazon EC2 instances that comprise the cluster that processes your job flow. Amazon EC2 offers four basic types.
Standard—You can use Amazon EC2 standard instances for most applications.
High-CPU—These instances have proportionally more CPU resources than memory (RAM) for compute-intensive applications.
High-Memory—These instances offer large memory sizes for high throughput applications, including database and memory caching applications.
Cluster Compute—These instances provide proportionally high CPU resources with increased network performance. They are well suited for demanding network-bound applications.
![]() | Note |
|---|---|
Amazon EMR does not support micro instances at this time. |
The following table describes all the instance types that Amazon EMR supports.
| Instance Type | RAM (GB) | Compute Units | Disk Drive (GB) | Platform (bits) | I/O Performance | Name |
|---|---|---|---|---|---|---|
| Small (default) | 1.7 | 1 | 160 | 32 | Moderate | m1.small |
| Large | 7.5 | 4 | 850 | 64 | High | m1.large |
| Extra Large | 15 | 8 | 1690 | 64 | High | m1.xlarge |
| High-CPU Medium | 1.7 | 5 | 350 | 32 | Moderate | c1.medium |
| High-CPU Extra Large | 7 | 20 | 1690 | 64 | High | c1.xlarge |
| High-Memory Extra Large | 17.1 | 6.5 | 420 | 64 | Moderate | m2.xlarge |
| High-Memory Double Extra Large | 34.2 | 13 | 850 | 64 | Moderate | m2.2xlarge |
| High-Memory Quadruple Extra Large | 68.4 | 26 | 1690 | 64 | High | m2.4xlarge |
| Cluster Compute Quadruple Extra Large Instance* | 23 | 33.5 | 1690 | 64 |
Very High (10 Gigabit Ethernet) | cc1.4xlarge |
| Cluster Compute Eight Extra Large* | 60.5 | 88 | 3370 | 64 |
Very High (10 Gigabit Ethernet) | cc2.8xlarge |
| Cluster GPU Instance* | 23** | 33.5 | 1690 | 64 |
Very High (10 Gigabit Ethernet) | cg1.4xlarge |
*Cluster Compute Instances are available only in the US-East (Northern Virginia) Region.
**Cluster GPU instances have 22 GB, with 1 GB reserved for GPU operation.
The practical limit of the amount of data you can process depends on the number and type of Amazon EC2 instances selected as your cluster nodes, and on the size of your intermediate and final data. This is because the input, intermediate, and output data sets reside on the cluster nodes while your job flow runs. For example, the maximum amount of data that you can process on a 20-node cluster is 34 TB (20 Extra Large instances x 1.69 TB of hard disk per Amazon EC2 instance = 34 TB).
The default maximum number of Amazon EC2 instances you can specify is 20. If you need more instances, you can make a formal request. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.
Reserved Instances provide guaranteed capacity and are an additional Amazon EC2 pricing option. You make a one-time payment for an instance to reserve capacity and reduce hourly usage charges. Reserved Instances complement existing Amazon EC2 On-Demand Instances and provide an option to reduce computing costs. As with On-Demand Instances, you pay only for the compute capacity that you actually consume, and if you don't use an instance, you don't pay usage charges for it.
To use a Reserved Instance with Amazon EMR, launch your job flow in the same Availability Zone as your Reserved Instance. For example, let's say you purchase one m1.small Reserved Instance in US-East. If you launch a job flow that uses two m1.small instances in the same Availability Zone in Region US-East, one instance is billed at the Reserved Instance rate and the other is billed at the On-Demand rate. If you have a sufficient number of available Reserved Instances for the total number of instances you want to launch, you are guaranteed capacity. Your Reserved Instances are used before any On-Demand Instances are created.
You can use Reserved Instances by using either the Amazon EMR console, the command line interface (CLI), Amazon EMR API actions, or the AWS SDKs.
Related Topics
Elastic IP addresses are static IP addresses designed for dynamic cloud computing. An Elastic IP address is associated with your account, not a particular instance. You control the addresses associated with your account until you choose to explicitly release them.
You can associate one Elastic IP address with only one job flow at a time. To ensure our customers are efficiently using Elastic IP addresses, we impose a small hourly charge when IP addresses associated with your account are not mapped to a job flow or Amazon EC2 instance. When Elastic IP addresses are mapped to an instance, they are free of charge.
For more information about enabling Elastic IP addresses with Amazon EMR, see Using Elastic IP Addresses. For more information about using IP addresses in AWS, go to the Using Elastic IP Addresses section in the Amazon Elastic Compute Cloud User Guide.
When Amazon EMR starts an Amazon EC2 instance, it uses a 2048-bit RSA key pair that you have named. Amazon EC2 stores the public key. Amazon EMR stores the private key and uses the private key to validate all requests.
The key pair ensures that only you can access your job flows. When you launch an instance using your key pair name, the public key becomes part of the instance metadata. This allows you to access the cluster node securely.
Although specifying the key pair is optional, we strongly recommend that you use key pairs. This key pair becomes associated with all of the nodes created to process your job flow. The key pair name creates a handle you can use to access the master node in the Hadoop cluster. With the key pair name, you can log in to the master node without using a password, enabling you to monitor the progress of your job flows. On the master node, you can retrieve detailed job flow processing status and statistics.
For more information on how to create and use an Amazon EC2 key pair with Amazon EMR, see "Creating an Amazon EC2 Key Pair" in the Getting Started Guide.
Topics
The following sections describe Amazon S3 features used by Amazon EMR.
Amazon EMRrequires Amazon S3 buckets to hold the input and output data of your Hadoop processing. Amazon EMR uses the Amazon S3 Native File System for Hadoop processing. Amazon S3 uses the hostname method for accessing data, which places restrictions on bucket names used in Amazon EMR job flows.
For more information on creating Amazon S3 buckets for use with Amazon EMR, see Setting Up Your Environment to Run a Job Flow. For more information on Amazon S3 buckets, go to Working with Amazon S3 Buckets in the Amazon S3 Developer Guide.
Amazon Elastic MapReduce (Amazon EMR) supports Amazon S3 multipart upload through the AWS SDK for Java. Multipart upload lets you upload a single object as a set of parts. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles the parts and creates the object.
For more information about enabling multipart uploads with Amazon EMR, see Multipart Upload. For more information on Amazon S3 multipart uploads, go to Uploading Objects Using Multipart Upload in the Amazon S3 Developer Guide.
Amazon Elastic MapReduce (Amazon EMR) supports AWS Identity and Access Management (IAM) policies. IAM is a web service that enables AWS customers to manage users and user permissions. For more information about enabling IAM policies with Amazon EMR, see Configuring User Permissions. For more information on IAM, go to Using IAM in the Using AWS Identity and Access Management guide.
You can choose the geographical region where Amazon EC2 creates the cluster to process your data. You might choose a region to optimize latency, minimize costs, or address regulatory requirements. Setting a region-specific endpoint guarantees where your data resides. Amazon EMR currently works in the following regions.
US-East (Northern Virginia)—To send an API
request to this region, use the endpoint
us-east-1.elasticmapreduce.amazonaws.com.
US-West (Oregon)—To send an API
request to this region, use the endpoint
us-west-2.elasticmapreduce.amazonaws.com.
US-West (Northern California)—To send an API
request to this region, use the endpoint
us-west-1.elasticmapreduce.amazonaws.com.
EU (Ireland)—To send an API request to this
region, use the endpoint
eu-west-1.elasticmapreduce.amazonaws.com.
Asia Pacific (Singapore)—To send an API request to this
region, use the endpoint
ap-southeast-1.elasticmapreduce.amazonaws.com.
Asia Pacific (Tokyo)—To send an API request to this
region, use the endpoint
ap-northeast-1.elasticmapreduce.amazonaws.com.
South America (Sao Paulo)—To send an API request to this
region, use the endpoint
sa-east-1.elasticmapreduce.amazonaws.com.
Amazon EMR uses Amazon S3 and Amazon SimpleDB data storage systems when processing a job flow. For more information about using Amazon S3 with Hadoop, go to http://wiki.apache.org/hadoop/AmazonS3. For more information about Amazon SimpleDB, go to the Amazon SimpleDB product description page.