Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Add More than 256 Steps to a Job Flow

Amazon Elastic MapReduce (Amazon EMR) currently limits the number of steps in a job flow to 256. If your job flow is long-running (such as a Hive data warehouse) or complex, you may require more than 256 steps to process your data.

You can employ several methods to get around this limitation:

  1. Have each step submit several jobs to Hadoop. This does not allow you unlimited steps, but it is the easiest solution if you need a fixed number of steps greater than 256.

  2. Write a workflow program that runs in a step on a long-running job flow and submits jobs to Hadoop. You could have the workflow program either:

    • Listen to an Amazon SQS queue to receive information about new steps to run.

    • Check an Amazon S3 bucket on a regular schedule for files containing information about the new steps to run.

  3. Write a workflow program that runs on an Amazon EC2 instance outside of Amazon EMR and submits jobs to your job flows using SSH.

  4. Manually SSH into the master node and submit job flows.

You can add more steps to a job flow by using the SSH shell to connect to the master node and submitting queries directly to the software running on the master node, such as Hive and Hadoop.

You can SSH directly into the master node using a conventional SSH connection, as outlined in How to Monitor Hadoop on a Master Node. Or you can use the --ssh command line argument to pass queries in and save yourself the process of establishing a new SSH connection.

CLI

To manually submit steps to Hadoop on the master node

  • From a terminal or command-line window, call the CLI client, specifying the --ssh parameter, and set its value to the command you want to run on the master node. The CLI uses its connection to the master node to run the command.

    elastic-mapreduce –-jobflow JobFlowID –-scp myjar.jar \
    –-ssh “hadoop jar myjar.jar”
    				

    The preceding example uses the --scp parameter to copy the JAR file myjar.jar from your local directory to the master node of job flow JobFlowID. The example uses the --ssh parameter to command the copy of Hadoop running on the master node to run myjar.jar.

CLI

To manually submit queries to Hive on the master node

  1. If Hive is not already installed, use the following command to install it.

    elastic-mapreduce -–jobflow JobFlowID –-hive-interactive
               	    
  2. Create a Hive script file containing the query or command you wish to run. The following example script creates two tables, aTable and anotherTable, and copies the contents of one table to another, replacing all data.

    ---- sample Hive script file: my-hive.q ----
    create table aTable (aColumn string) ;
    create table anotherTable like aTable;
    insert overwrite table anotherTable select * from aTable
    			
  3. Call the CLI client, specifying the --ssh parameter, and set its value to a Hive script containing the command you want to run on the master node. The CLI uses its connection to the master node and your .pem credentials file to run the command.

    elastic-mapreduce –-jobflow JobFlowID –-scp my-hive.q \
    –-ssh “hive -f my-hive.q”					
    				

    The preceding example connects to Hive on the master node of the JobFlowID job flow and runs the query contained in the script file my-hive.q.

To manually submit tasks based on Python files to Hadoop while Connected Using SSH

  • Use the Hadoop streaming jar, as shown in the example below.

    hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar
      -input s3n://elasticmapreduce/samples/wordcount/input \
      -output hdfs:///rubish/1 \
      -mapper s3n://elasticmapreduce/samples/wordcount/wordSplitter.py \
      -reducer aggregate