Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

How to Create a Job Flow Using a Custom JAR

This section covers the basics of creating a job flow using a custom JAR file in Amazon Elastic MapReduce (Amazon EMR).You'll step through how to create a job flow using a Custom JAR with either the Amazon EMR console, the CLI, or the Query API. Before you create your job flow you'll need to create objects and permissions; for more information see Setting Up Your Environment to Run a Job Flow.

A job flow using a custom JAR file enables you to write a script to process your data using the Java programming language. The example that follows is based on the Amazon EMR sample: CloudBurst.

In this example, the JAR file is located in an Amazon S3 bucket at s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar. All of the data processing instructions are located in the JAR file and the script is referenced by the main class org.myorg.WordCount. The input data is located in the Amazon S3 bucket s3n://elasticmapreduce/samples/cloudburst/input. The output is saved to an Amazon S3 bucket you created as part of Setting Up Your Environment to Run a Job Flow.

Amazon EMR Console

This example describes how to use the Amazon EMR console to create a job flow using a custom JAR file.

To create a job flow using a custom JAR file

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Create New Job Flow.

    Amazon EMR console
  3. In the DEFINE JOB FLOW page, enter the following in the Define Job Flow section of the Create a New Job Flow dialog box:

    1. Enter a name in the Job Flow Name field.

      We recommended you use a descriptive name. It does not need to be unique.

    2. Select Run your own application.

    3. Select Custom JAR in the drop-down list.

    4. Click Continue.

    New Custom JAR Job Flow
  4. In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.

    FieldAction
    JAR Location* Specify the URI where your script resides in Amazon S3. The value must be in the form BucketName/path/ScriptName.
    JAR Arguments* Enter a list of arguments (space-separated strings) to pass to the JAR file.

    * Required parameter

    Specify Custom JAR Parameters
  5. In the CONFIGURE EC2 INSTANCES page, select the type and number of instances, using the following table as a guide, and then click Continue.

    [Note]Note

    Twenty is the default maximum number of nodes per AWS account. For example, if you have two job flows running, the total number of nodes running for both job flows must be 20 or less. If you need more than 20 nodes, you must submit a request to increase your Amazon EC2 instance limit. For more information, go to the Request to Increase Amazon EC2 Instance Limit Form.

    FieldAction
    Instance CountSpecify the number of nodes to use in the Hadoop cluster. There is always one master node in each job flow. You can specify the number of core and tasks nodes.
    Instance TypeSpecify the Amazon EC2 instance types to use as master, core, and task nodes. Valid types are m1.small (default), m1.large, m1.xlarge, c1.medium, c1.xlarge, m2.xlarge, m2.2xlarge, and m2.4xlarge, cc1.4xlarge, cg1.4xlarge.
    Request Spot InstancesClick this checkbox to run master, core, or task nodes on Spot Instances. For more information, see Lowering Costs with Spot Instances

    * Required parameter

    Configure EC2 Instances
  6. In the ADVANCED OPTIONS page, set additional configuration options, using the following table as a guide, and then click Continue.

    FieldAction
    Amazon EC2 Key PairOptionally, specify a key pair that you created previously. For more information, see Create an Amazon EC2 Key Pair and PEM File. If you do not enter a value in this field, you cannot SSH into the master node.
    Amazon VPC Subnet IdOptionally, specify a VPC subnet identifier to launch the job flow in an Amazon VPC. For more information, see Running Job Flows on an Amazon VPC.
    Amazon S3 Log PathOptionally, specify a path in Amazon S3 to store the Amazon EMR log files. The value must be in the form BucketName/path. If you do not supply a location, Amazon EMR does not log any files.
    Enable DebuggingSelect Yes to store Amazon Elastic MapReduce-generated log files. You must enable debugging at this level if you want to store the log files generated by Amazon EMR.

    If you select Yes, you must supply an Amazon S3 bucket name where Amazon Elastic MapReduce can upload your log files.

    For more information, see Troubleshooting.

    [Important]Important

    You can enable debugging for a job flow only when you initially create the job flow.

    Keep AliveSelect Yes to cause the job flow to continue running when all processing is completed.

    Advanced Options
  7. In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click Continue.

    For more information about bootstrap actions, see Bootstrap Actions.

    Bootstrap Actions
  8. In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.

    After you click Create Job Flow your request is processed; when it succeeds, a message appears.

    Amazon EMR console
  9. Click Close.

    The Amazon EMR console shows the new job flow starting.

    Amazon EMR console

CLI

This section explains how to run a job flow that uses a custom JAR file.

To create a job flow using a Custom JAR

  • Use the information in the following table to create your job flow:

    If you are using... Enter the following...
    Linux or UNIX
    & ./elastic-mapreduce --create --name "Test custom JAR" \
      --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
        --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
        --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
        --arg s3n://myawsbucket/cloud \
        --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \
        --arg 24 --arg 128 --arg 16
    Microsoft Windows c:\ruby elastic-mapreduce --create --name "Test custom JAR" --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br --arg s3n://myawsbucket/cloud --arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24--arg 24 --arg 128 --arg 16
[Note]Note

The individual --arg values above could also be represented as --args followed by a comma-separated list, as shown in the preceding examples.

The output looks similar to the following.

Created job flow JobFlowID

By default, this command launches a job flow to run on a single-node cluster using an Amazon EC2 m1.small instance. Later, when your steps are running correctly on a small set of sample data, you can launch job flows to run on multiple nodes. You can specify the number of nodes and the type of instance to run with the --num-instances and --instance-type parameters, respectively.

API

This section describes the Amazon EMR API Query request parameters you need to create a job flow using a custom JAR file. For an explanation of the parameters unique to RunJobFlow, see RunJobFlow. The response includes a <JobFlowID>, which you use in other Amazon EMR operations, such as when describing or terminating a job flow. For this reason, it is important to store job flow IDs.

To start a job flow specifying a JAR file, send a RunJobFlow request similar to the following.

https://elasticmapreduce.amazonaws.com?
Operation=RunJobFlow&
Name=Test custom JAR&
LogUri=s3://myawsbucket/subdir&
Instances.MasterInstanceType=m1.small&
Instances.SlaveInstanceType=m1.small&
Instances.InstanceCount=4&
Instances.Ec2KeyName=myec2keyname&
Instances.Placement.AvailabilityZone=us-east-1a&
Instances.KeepJobFlowAliveWhenNoSteps=true&
Steps.member.1.Name=MyStepName&
Steps.member.1.ActionOnFailure=CONTINUE&
Steps.member.1.HadoopJarStep.Jar=s3://elasticmapreduce/samples/cloudburst/cloudburst.jar&
Steps.member.1.HadoopJarStep.MainClass=MyMainClass&
Steps.member.1.HadoopJarStep.Args.member.1=arg1&
Steps.member.1.HadoopJarStep.Args.member.2=arg2&
AWSAccessKeyId=AccessKeyID&
SignatureVersion=2&
SignatureMethod=HmacSHA256&
Timestamp=2009-01-28T21%3A48%3A32.000Z&
Signature=calculated value