Amazon Elastic MapReduce
Getting Started Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Job Flow Essentials

Job Flow Essentials

This section provides general information on how to create and manage job flows using the Amazon EMR command line interface (CLI).

Amazon Elastic MapReduce (Amazon EMR) takes care of provisioning an Amazon EC2 cluster, terminating it, moving the data between it and Amazon S3, and optimizing Hadoop. Amazon EMR removes most of the details of setting up the hardware and networking required by the server cluster, such as monitoring the setup, configuring Hadoop, and executing the job flow.

Creating a Job Flow

Using the Amazon EMR CLI, you can construct a job flow that will continue to run until you terminate it. This process is useful for debugging. When a step fails, you can add another step to your active job flow without having to incur the shutdown and startup cost of a new job flow.

Typically, a step involves performing relatively simple operations on very large amounts of data. A step corresponds roughly to one algorithm that manipulates the data. A job flow typically consists of multiple steps. The output of one step often becomes the input of the next. A sequence of one or more steps is called a job flow.

The following command starts a job flow that consumes resources until you terminate it.

To create a job flow

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --create --alive	
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --alive

The output will look similar to:

Created job flow JobFlowID	

This command launches a job flow running on a single m1.small instance. The --alive option tells the job flow to keep running even when it has finished all its steps.

A unique job flow ID is assigned to each newly created job flow. You use the job flow ID to identify and manage your job flow.

Managing a Job Flow

This section presents several methods to identify and manage your job flows.

List All Amazon EMR Commands

You can use the --help parameters to list all of the commands available in the Amazon EMR CLI.

To list all Amazon EMR commands

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --help 	
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --help 	

For more information on each of the Amazon EMR commands, see the Amazon Elastic MapReduce Developer Guide.

List All Job Flows

You can use the --list parameter to list all of your job flows for the past two weeks.

To list all job flows

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --list
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --list

    The response looks similar to the following:

    JobFlowID     STARTING
    			Development Job Flow  (requires manual termination)			

For details on job flow STATES and additional methods to list job flows, see the Amazon Elastic MapReduce Developer Guide.

Retrieve Information About a Specific Job Flow

You can get information about a job flow using the --describe option and the associated job flow ID.

To get information about your job flow

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --describe --jobflow [JobFlowID]
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --describe --jobflow [JobFlowID]

The response looks similar to the following:

{
  "JobFlows": [
    {
      "Name": "Development Job Flow (requires manual termination)",
      "LogUri": "s3n:\/\/myawsbucket\/FileName\/",
      "ExecutionStatusDetail": {
        "StartDateTime": null,
        "EndDateTime": null,
        "LastStateChangeReason": "Starting instances",
        "CreationDateTime": DateTimeStamp,
        "State": "STARTING",
        "ReadyDateTime": null
      },
      "Steps": [],
      "Instances": {
        "MasterInstanceId": null,
        "Ec2KeyName": "KeyName",
        "NormalizedInstanceHours": 0,
        "InstanceCount": 5,
        "Placement": {
          "AvailabilityZone": "us-east-1a"
        },
        "SlaveInstanceType": "m1.small",
        "HadoopVersion": "0.20",
        "MasterPublicDnsName": null,
        "KeepJobFlowAliveWhenNoSteps": true,
        "InstanceGroups": [
          {
            "StartDateTime": null,
            "SpotPrice": null,
            "Name": "Master Instance Group",
            "InstanceRole": "MASTER",
            "EndDateTime": null,
            "LastStateChangeReason": "",
            "CreationDateTime": DateTimeStamp,
            "LaunchGroup": null,
            "InstanceGroupId": "InstanceGroupID",
            "State": "PROVISIONING",
            "Market": "ON_DEMAND",
            "ReadyDateTime": null,
            "InstanceType": "m1.small",
            "InstanceRunningCount": 0,
            "InstanceRequestCount": 1
          },
          {
            "StartDateTime": null,
            "SpotPrice": null,
            "Name": "Task Instance Group",
            "InstanceRole": "TASK",
            "EndDateTime": null,
            "LastStateChangeReason": "",
            "CreationDateTime": DateTimeStamp,
            "LaunchGroup": null,
            "InstanceGroupId": "InstanceGroupID",
            "State": "PROVISIONING",
            "Market": "ON_DEMAND",
            "ReadyDateTime": null,
            "InstanceType": "m1.small",
            "InstanceRunningCount": 0,
            "InstanceRequestCount": 2
          },
          {
            "StartDateTime": null,
            "SpotPrice": null,
            "Name": "Core Instance Group",
            "InstanceRole": "CORE",
            "EndDateTime": null,
            "LastStateChangeReason": "",
            "CreationDateTime": DateTimeStamp,
            "LaunchGroup": null,
            "InstanceGroupId": "InstanceGroupID",
            "State": "PROVISIONING",
            "Market": "ON_DEMAND",
            "ReadyDateTime": null,
            "InstanceType": "m1.small",
            "InstanceRunningCount": 0,
            "InstanceRequestCount": 2
          }
        ],
        "MasterInstanceType": "m1.small"
      },
      "BootstrapActions": [],
      "JobFlowId": "JobFlowID"
    }
  ]
}

For details on job flow parameter names and values, see the Amazon Elastic MapReduce Developer Guide and the Amazon Elastic MapReduce API Reference.

Debugging Job Flows

To use Amazon EMR debugging you must specify an Amazon S3 bucket location in your credentials.json file. You specified the log_uri parameter in the file you created as part of the Configuring Credentials step.

You access Amazon EMR log files either by using the Amazon EMR console or by viewing them directly from the Amazon S3 console.

[Note]Note

A five-minute delay occurs between when the log files stop being written and when they are available on Amazon S3.

Hadoop debugging is also available to identify issues and problems in your job flows. For details on how to enable and configure Hadoop debugging, see the Amazon Elastic MapReduce Developer Guide.

Adding Steps to a Streaming Job Flow

You can add steps to a job flow if the RunJobFlow parameter KeepJobFlowAliveWhenNoSteps is set to True. This value keeps the Amazon EC2 cluster engaged even after the successful completion of a job flow. The default setting for KeepJobFlowAliveWhenNoSteps is True and can be verified using the --describe --jobflow [JobFlowID] commands. To identify your job flow ID, refer to the preceding Retrieve Information About a Specific Job Flow section.

To add a step using default parameter values to a job flow

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce -j JobFlowID --stream
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce -j JobFlowID --stream

The --stream command adds a streaming step using default parameters. In the Amazon EMR console, Hadoop streaming is a feature of Hadoop that lets you create and run job flows using any executable program or script as Hadoop mappers and reducers. You can view the step you just added in the Amazon EMR console from either the CLI or the Amazon EMR console.

To view a job flow from the Amazon EMR console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Refresh.

  3. Click the job flow with the added step.

  4. In the Details pane at the bottom of the window, click the Steps tab.

Information about the step you added is displayed in the Steps tab.

Terminate a Job Flow

Once you finish working with a job flow, you terminate it so you are no longer being charged for using AWS resources.

To terminate a job flow

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --terminate JobFlowID
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --terminate JobFlowID

Congratulations! You have successfully created and terminated an Amazon EMR instance and learned about a few of the options available to you.

Now that you know how to create, debug, and terminate a job flow, you are ready to process actual data.

Click one of the following buttons to create either a streaming job flow or a job flow using Hive.

Streaming Job Flow
Job Flow Using Hive