Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Setting Up Your Environment to Run a Job Flow

This section walks you through how to set up required resources and permissions to run a job flow. The tasks that follow show you how to create the resources that your job flow uses to process data. Once created, you can reuse these resources for other job flows. Depending on your application, however, it may make operational sense to create new resources for each job flow.

The tasks that must be completed before you create a job flow are as follows:

The following sections provide instructions on how to perform each of the tasks.

Choose a Region

AWS enables you to place resources in multiple locations. Locations are composed of Regions and Availability Zones within those Regions. Availability Zones are distinct geographical locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region.

All Amazon EC2 Instances, key pairs, security groups, and Amazon Elastic MapReduce (Amazon EMR) job flows must be located in the same Region. To optimize performance and reduce latency, all resources (such as Amazon S3 buckets) and job flows should be located in the same Availability Zone.

For more information about Regions and Availability Zones, go to Using Regions and Availability Zones in the Amazon Elastic Compute Cloud User Guide

[Note]Note

Not all AWS products offer the same support in all Regions. For example, Cluster Compute instances are available only in the US-East (Northern Virginia) Region. Confirm that you are working in the appropriate Region for the resources you want to use.

You must ensure that you use the same Region for each resource you create. Use the table below to identify the correct Region name.

If your Amazon EMR Region is...The Amazon EMR CLI and API Region is...The Amazon S3 Region is...The Amazon EC2 Region is...
US East (Virginia)us-east-1US StandardUS East (Virginia)
US West (Oregon)us-west-2OregonUS West (Oregon)
US West (N. California)us-west-1Northern CaliforniaUS West (N. California)
EU West (Ireland)eu-west-1IrelandEU West (Ireland)
Asia Pacific (Singapore)ap-southeast-1SingaporeAsia Pacific (Singapore)
Asia Pacific (Tokyo)ap-northeast-1TokyoAsia Pacific (Tokyo)
South America (Sao Paulo)sa-east-1Sao PauloSouth America (Sao Paulo)

Using the Amazon EMR Console to Specify a Region

To select a region in Amazon EMR

  • From the Amazon EMR console, select the Region from the drop-down list in the upper-left part of the Amazon EMR console.

    Amazon EMR Regions

Using the CLI to Specify a Region

Specify the Region with the --region parameter, as in the following example. If the --region parameter is not specified, the job flow is created in the us-east-1 region.

$ ./elastic-mapreduce --create --alive --stream --input myawsbucket \
 --output myawsbucket --log-uri --region eu-west-1
[Tip]Tip

To reduce the number of parameters required each time you issue a command from the CLI, you can store information such as Region in your credentials.json file. For more information on creating a credentials.json file, go to the Create a Credentials File.

Using the API to Specify a Region

To select a region, configure your application to use that Region's endpoint. If you are creating a client application using an AWS SDK, you can change the client endpoint by calling setEndpoint, as shown in the following example:

client.setEndpoint(“eu-west-1.elasticmapreduce.amazonaws.com”);

Once your application has specified a region by setting the endpoint, you can set the Availability Zone for your job flow's Amazon EC2 instances with a query request that contains a Instances.Placement.AvailabilityZone parameter, as in the following example. If you do not specify the Availability Zone for your job flow, Amazon EMR launches the job flow instances in the best Availability Zone in that region based on system health and available capacity.

https://elasticmapreduce.amazonaws.com?
Operation=
...
Instances.Placement.AvailabilityZone=eu-west-1a&
...

For more information about the parameters in an Amazon EMR request, see API Reference.

[Note]Note

For more information on specifying Regions from the CLI and API, see Available Region Endpoints for the AWS SDKs .

Create and Configure an Amazon S3 Bucket

Amazon Elastic MapReduce (Amazon EMR) uses Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as buckets. To conform with Amazon S3 requirements, DNS requirements, and restrictions in the supported data analysis tools, we recommend following the following guidelines for bucket names. All bucket names must:

  • Be between 3 and 63 characters long

  • Contain only lowercase letters, numbers, or periods (.)

  • Not contain a dash (-) or underscore (_)

For additional details on valid bucket names, go to Bucket Restrictions and Limitations in the Amazon Simple Storage Service Developers Guide.

This section shows you how to use the AWS Management Console to create and then set permissions for an Amazon S3 bucket. However, you can also create and set permissions for an Amazon S3 bucket using the Amazon S3 API or the third-party Curl command line tool. For information about Curl, go to Amazon S3 Authentication Tool for Curl. For information about using the Amazon S3 API to create and configure an Amazon S3 bucket, go to the Amazon Simple Storage Service API Reference.

Using the AWS Management Console to Create an Amazon S3 Bucket

To create an Amazon S3 bucket

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Click Create Bucket.

    The Create a Bucket dialog box opens.

  3. Enter a bucket name, such as mylog-uri.

    This name should be globally unique, and cannot be the same name used by another bucket.

  4. Select the Region for your bucket. To avoid paying cross-region bandwidth charges, create the Amazon S3 bucket in the same region as your job flow.

    Refer to Choose a Region for guidance on choosing a Region.

  5. Click Create.

You created a bucket with the URI s3n://mylog-uri/.

[Note]Note

If you enable logging in the Create a Bucket wizard, it enables only bucket access logs, not Amazon EMR job flow logs.

[Note]Note

For more information on specifying Region-specific buckets, refer to Buckets and Regions in the Amazon Simple Storage Service Developer Guide and Available Region Endpoints for the AWS SDKs .

After you create your bucket you can set the appropriate permissions on it. Typically, you give yourself (the owner) read and write access and authenticated users read access.

Using the AWS Management Console to configure an Amazon S3 bucket

To set permissions on an Amazon S3 bucket

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. In the Buckets pane, right-click the bucket you just created.

  3. Select Properties.

  4. In the Properties pane, select the Permissions tab.

  5. Click Add more permissions.

  6. Select Authenticated Users in the Grantee field.

  7. To the right of the Grantee drop-down list, select List.

  8. Click Save.

You have created a bucket and restricted permissions to authenticated users.

Create an Amazon EC2 Key Pair and PEM File

Amazon EMR uses an Amazon Elastic Compute Cloud (Amazon EC2) key pair to ensure that you alone have access to the instances that you launch. The PEM file associated with this key pair is required to ssh directly to the master node of the cluster running your job flow.

To create an Amazon EC2 key pair

  1. Sign in to the AWS Management Console and open the Amazon EC2 console at https://console.aws.amazon.com/ec2/.

  2. From the Amazon EC2 console, select a Region.

  3. In the Navigation pane, click Key Pairs.

  4. On the Key Pairs page, click Create Key Pair.

  5. In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair.

  6. Click Create.

  7. Save the resulting PEM file in a safe location.

Your Amazon EC2 key pair and an associated PEM file are created.

Modify Your PEM File

Amazon Elastic MapReduce (Amazon EMR) enables you to work interactively with your job flow, allowing you to test job flow steps or troubleshoot your cluster environment. To log in directly to the master node of your running job flow, you can use ssh or PuTTY. You use your PEM file to authenticate to the master node. The PEM file requires a modification based on the tool you use that supports your operating system. You use the CLI to connect on Linux or UNIX computers. You use PuTTY to connect on Microsoft Windows computers. For more information on how to install the Amazon EMR CLI or how to install PuTTY, go to the Getting Started Guide.

To modify your credentials file

  • Create a local permissions file:

    If you are using... Do this...
    Linux or UNIX

    Set the permissions on the PEM file or your Amazon EC2 key pair. For example, if you saved the file as mykeypair.pem, the command looks like the following:

    $ chmod og-rwx mykeypair.pem  
    Microsoft Windows
    1. Download PuTTYgen.exe to your computer from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.

    2. Launch PuTTYgen.

    3. Click Load. Select the PEM file you created earlier.

    4. Click Open.

    5. Click OK on the PuTTYgen Notice telling you the key was successfully imported.

    6. Click Save private key to save the key in the PPK format.

    7. When PuTTYgen prompts you to save the key without a pass phrase, click Yes.

    8. Enter a name for your PuTTY private key, such as, mykeypair.ppk.

    9. Click Save.

    10. Exit the PuTTYgen application.

Your credentials file is modified to allow you to log in directly to the master node of your running job flow.

Get Security Credentials

AWS assigns you an Access Key ID and a Secret Access Key to identify you as the sender of your request. AWS uses these security credentials to help protect your data. You include your Access Key ID in all AWS requests made through the CLI or API. The AWS Management Console provides these security credentials automatically.

[Note]Note

Your Secret Access Key is a shared secret between you and AWS. Keep this key secret. Amazon uses this key to bill you for the AWS services you use. Never include your key in your requests to AWS and never email your key to anyone, even if an inquiry appears to originate from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key.

To get your Access Key ID and Secret Access Key

  1. Go to the AWS website.

  2. Click My Account to display a list of options.

  3. Click Security Credentials and log in to your AWS Account. Your Access Key ID is displayed in the Access Credentials section. Your Secret Access Key remains hidden as a further precaution.

  4. To display your Secret Access Key, click Show in the Your Secret Access Key area, as shown in the following figure.

    AWS Security Credentials

You have your Access Key ID and a Secret Access Key to securely identify yourself to AWS. You need this information to create a credentials file, as described in the following section.

Create a Credentials File

You can use an Amazon EMR credentials file to simplify job flow creation and authentication of requests. The credentials file provides information required for many commands. The credentials file is a convenient place for you to store command parameters so you don't have to repeatedly enter the information.

Your credentials are used to calculate the signature value for every request you make. The Amazon EMR CLI automatically looks for these credentials in the file credentials.json. you can edit the credentials.json file and include your AWS credentials. If you do not have a credentials.json file, you must include your credentials in every request you make.

To create your credentials file

  1. Create a file named credentials.json on your computer.

  2. Add the following lines to your credentials file:

    {
      "access-id": "AccessKeyID",
      "private-key": "PrivateKey",
      "key-pair": "KeyName",
      "key-pair-file": "location of key pair file",
      "region": "Region",
      "log-uri": "location of bucket on Amazon S3"    
    }

The access-id and private-key are the AWS Access Key ID and a Secret Access Key described in Get Security Credentials. The key-pair and key-pair-file are the Amazon EC2 key pair and the path and name of PEM file you created in Create an Amazon EC2 Key Pair and PEM File. The region is the Region you selected in Choose a Region. The log-uri is the path to the bucket you created in Create and Configure an Amazon S3 Bucket using the format s3n://BucketName/FolderName.

Your credentials.json file is configured.

Each of the preceding tasks guided you through the steps to set up the objects and permissions required for a job flow. You are now ready to create your job flow. Instructions on how to create a job flow are at Creating a Job Flow.