Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Running Job Flows on an Amazon VPC

Amazon Virtual Private Cloud (Amazon VPC) enables you to provision a private area within AWS where you can configure a virtual network, controlling aspects such as private IP address ranges, subnets, routing tables and network gateways. For more information about Amazon VPC, go to the Amazon Virtual Private Cloud User Guide.

When you launch an Amazon Elastic MapReduce (Amazon EMR) job flow, you can choose to launch it either on the AWS cloud (the default) or on an Amazon VPC.

Reasons why you might choose to launch your job flow on Amazon VPC include:

  • Processing sensitive data

    Launching a job flow on Amazon VPC is similar to launching the job flow on a private network with additional tools, such as routing tables and Network ACLs, for defining who has access to the network. If you are processing sensitive data in your job flow, you may want the additional access control that launching your job flow on Amazon VPC provides.

  • Accessing resources on an internal network

    If your data store is located on a private network, it may be impractical or undesirable to upload that data to AWS for import into Amazon EMR, either because of the amount of data to transfer or because of the sensitive nature of the data. Instead, you can launch the job flow on a Amazon VPC and connect to your data center through a VPN connection, enabling the job flow to access resources on your internal network. For example, if you have an Oracle database on a private VPN, launching your job flow on a Amazon VPC connected to that VPN makes it possible for the job flow to access the Oracle database.

The following diagram illustrates how an Amazon EMR job flow runs on Amazon VPC. The job flow is launched within a VPC subnet. Through the internet gateway the job flow is able to contact resources on the AWS cloud such as Amazon S3 buckets.

Because access to the AWS cloud is a requirement of the job flow, you must add an internet gateway to the VPC subnet hosting the job flow. If your application requires subnets without an internet gateway, you can launch those components in additional VPC subnets.

The following diagram shows how set up a Amazon VPC in order for the job flow to access resources on a local VPN.

Restricting Permissions with IAM on Amazon VPC

When you launch a job flow on Amazon VPC, you can use IAM to control access to job flows and restrict actions via policies just as you would with job flows launched on the AWS cloud. For more information about how IAM works with Amazon EMR, go to Using AWS Identity and Access Management.

You can also use IAM to control who can create and administer VPC subnets. For more information about administering policies and actions, go to Configuring User Permissions in the IAM User’s Guide.

By default, all IAM users can see of the VPC subnets for the account, and any user can launch a job flow in any subnet.

You can limit access to the ability to administer the VPC subnet, while still allowing users to launch job flows into VPC subnets. To do so, create one user account which has permissions to create and configure VPC subnets and a second user account that can launch job flows but which can’t modify Amazon VPC settings.

To allow users to launch job flows in a Amazon VPC without the ability to modify the Amazon VPC

  1. Create the Amazon VPC and launch Amazon EMR into a subnet of that Amazon VPC using an account with permissions to administer Amazon VPC and Amazon EMR.

  2. Create a second user account with permissions to call the RunJobFlow, DescribeJobFlows, TerminateJobFlows, and AddJobFlowStep actions in the Amazon EMR API. You should also create an IAM policy that allows this user to launch EC2 instances. An example of this is shown below.

    {
      "Statement": [
        {
          "Action": [
    	     "ec2:AuthorizeSecurityGroupIngress",
    	     "ec2:CancelSpotInstanceRequests",
    	     "ec2:CreateSecurityGroup",
    	     "ec2:CreateTags",
    	     "ec2:DescribeAvailabilityZones",
    	     "ec2:DescribeInstances",
    	     "ec2:DescribeSecurityGroups",
    	     "ec2:DescribeSpotInstanceRequests",
    	     "ec2:ModifyImageAttribute",	
    	     "ec2:ModifyInstanceAttribute",
    	     "ec2:RequestSpotInstances",
    	     "ec2:RunInstances",
    	     "ec2:TerminateInstances"  
          ],
          "Effect": "Allow",
          "Resource": "*"
        },
        {
          "Action": [
             "elasticmapreduce:AddInstanceGroups",
             "elasticmapreduce:AddJobFlowSteps", 
             "elasticmapreduce:DescribeJobFlows",
             "elasticmapreduce:ModifyInstanceGroups",
             "elasticmapreduce:RunJobFlow"
    	     "elasticmapreduce:TerminateJobFlows"
          ],
          "Effect": "Allow",
          "Resource": "*"
        }
      
      }
    				

    Users with the IAM permissions set above will be able to launch job flows within the VPC subnet, but will not be able to change the Amazon VPC configuration.

    [Note]Note

    You should be cautious when granting ec2:TerminateInstances permissions because this action gives the recipient the ability to shut down any Amazon EC2 instance in the account, including those outside of Amazon EMR.

Setting up an Amazon VPC to Host Job Flows

Before you can launch job flows on an Amazon VPC, you must create an Amazon VPC, a VPC subnet, and an internet gateway. The following instructions describe how to create an Amazon VPC capable of hosting Amazon EMR job flows using the Amazon EMR console.

To create a VPC subnet to run Amazon EMR job flows

  1. Sign in to the AWS Management Console and open the Amazon VPC console at https://console.aws.amazon.com/vpc/.

  2. Create an Amazon VPC by clicking Get started creating a VPC. Make sure that the Region drop-down box is set to the same Region where you'll be running your job flow. In this example, we're creating a Amazon VPC in the US East (Virginia) Region.

  3. Choose the VPC configuration by selecting one of the radio buttons.

    If the data used in the job flow is available on the Internet (eg: Amazon S3, Amazon RDS) select VPC with a Single Public Subnet Only.

    If the data used in the job flow is stored locally (eg: an Oracle database) select VPC with Public and Private subnets and Hardware VPN Access.

  4. Confirm the Amazon VPC settings. In order to work with Amazon EMR the Amazon VPC must have both an Internet Gateway and a subnet.

  5. A dialog box confirms that the Amazon VPC was successfully created. Click Close.

Once you've created an Amazon VPC you need to locate its subnet identifier; you'll use this value to launch the Amazon EMR job flow on the Amazon VPC.

To find the Amazon VPC subnet identifier

  • Click on Subnets in the navigation menu of the Amazon VPC console. The right pane displays information about the Amazon VPC, including its subnet identifier.

Launching job flows on Amazon VPC

Once you have a VPC subnet that is configured to host Amazon EMR job flows, launching job flows on that VPC subnet is as simple as specifying the subnet identifier during the job flow creation.

If the VPC subnet does not have an internet gateway, the job flow creation call will fail with the error: “Subnet not correctly configured, missing route to an internet gateway."

When the job flow is launched, Amazon EMR adds two security groups to the Amazon VPC: ElasticMapReduce-slave and ElasticMapReduce-master. By default, the ElasticMapReduce-master security group does not allow inbound SSH connections. If you require this functionality, you can add it to the security group.

To manage the job flow on an Amazon VPC Amazon EMR attaches a second network device to the master node and manages it through this device. You can view this device using the Amazon EC2 API DescribeInstances. If you disconnect this device, the job flow will fail.

Once the job flow is created, it will be able to access AWS services such as Amazon S3 to connect to data stores.

[Note]Note

Amazon VPC currently does not support cluster compute instances. Thus you cannot specify a cc1.4xlarge, cc2.8xlarge, or cg1.4xlarge instance type for nodes of a job flow launched in a Amazon VPC.

To launch a job flow on an Amazon VPC using the Amazon EMR console

  1. In the Amazon EMR console, click the Create New Job Flow button.

  2. Follow the instructions in the Create a New Job Flow wizard, selecting options that match the job flow you want to launch.

  3. When you reach the ADVANCED OPTIONS page, choose the Amazon VPC subnet you created previously from the Amazon VPC Subnet Id drop-down box. If you have not created a Amazon VPC subnet, click on the Create a VPC link underneath the drop-down box to open the Amazon VPC console and create a Amazon VPC and subnet.

  4. Continue the Create a Job Flow Wizard until it is complete and the job flow is launched. It will be launched within the Amazon VPC subnet you specified in Step 3.

To launch a job flow on an Amazon VPC using the CLI

  • Once your Amazon VPC is configured, you can launch Amazon EMR job flows on it by using the --subnet argument and specifying the subnet address. This is illustrated in the following example, which creates a long-running job flow on the specified VPC subnet.

    elastic-mapreduce --create --alive --subnet subnet-identifier
    					

To launch a job flow on an Amazon VPC using the API

  • Once your Amazon VPC is configured, you can launch Amazon EMR job flows on it by providing the VPC subnet identifier as the value for Ec2SubnetId, an optional String parameter on the JobFlowInstancesConfig structure.

    https://elasticmapreduce.amazonaws.com?
                Operation=RunJobFlow&
                Name=MyJobFlowName&
                LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir&
                Instances.MasterInstanceType=m1.small&
                Instances.SlaveInstanceType=m1.small&
                Instances.InstanceCount=4&
                Instances.Ec2KeyName=myec2keyname&
                Instances.Placement.AvailabilityZone=us-east-1a&
                Instances.KeepJobFlowAliveWhenNoSteps=true&
                Instances.Ec2SubnetId=subnet-identifier&
                Steps.member.1.Name=MyStepName&
                Steps.member.1.ActionOnFailure=CONTINUE&
                Steps.member.1.HadoopJarStep.Jar=MyJarFile&
                Steps.member.1.HadoopJarStep.MainClass=MyMailClass&
                Steps.member.1.HadoopJarStep.Args.member.1=arg1&
                Steps.member.1.HadoopJarStep.Args.member.2=arg2&
                AWSAccessKeyId=AWS Access Key ID&
                SignatureVersion=2&
                SignatureMethod=HmacSHA256&
                Timestamp=2009-01-28T21%3A48%3A32.000Z&
                Signature=calculated value