Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Bootstrap Actions

Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data.

Bootstrap Action Basics

Bootstrap actions execute as the Hadoop user by default. A bootstrap action can execute with root privileges if you use sudo.

[Note]Note

If the bootstrap action returns a nonzero error code, Amazon Elastic MapReduce (Amazon EMR) treats it as a failure and terminates the instance. If too many instances fail their bootstrap actions, then Amazon EMR terminates the job flow. If just a few instances fail, then an attempt is made to reallocate the failed instances and continue. Refer to the job flow lastStateChangeReason error code to identify failures caused by a bootstrap action.

All three Amazon EMR interfaces support bootstrap actions. You can specify up to 16 bootstrap actions per job flow by providing multiple --bootstrap-action parameters from the CLI or API.

From the CLI, references to bootstrap action scripts are passed to Elastic MapReduce by adding the bootstrap-action parameter after the create parameter. The syntax for a bootstrap-action parameter is as follows:

--bootstrap-action "s3://myawsbucket/FileName" --args "arg1","arg2"            

From the Amazon EMR console, you can specify a bootstrap action optionally while creating a job flow on the Bootstrap Actions page in the Job Flow Creation Wizard.

For more information on how to reference a bootstrap action from the API, go to the Amazon Elastic MapReduce API Reference.

Using Predefined Bootstrap Actions

Amazon provides a number of predefined bootstrap action scripts that you can use to customize Hadoop settings. This section describes the available predefined bootstrap actions. References to predefined bootstrap action scripts are passed to Elastic MapReduce by using the bootstrap-action parameter.

You can specify up to 16 bootstrap actions per job flow by providing multiple bootstrap-action parameters.

Configure Daemons

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-daemons.

The following example sets the heap size to 2048 and configures the Java namenode option.

Example

$ ./elastic-mapreduce –create –alive \
  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
  --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19   

Configure Hadoop

This bootstrap action allows you to set cluster-wide Hadoop settings. This script provides two types of command line options:

  • Option 1—Enables you to upload an XML file containing configuration settings to Amazon S3. The bootstrap action merges the new configuration settings with the existing Hadoop configuration.

  • Option 2—Allows you to specify a Hadoop key value pair from the command line that overrides the existing Hadoop configuration.

The location of the script is s3://elasticmapreduce/bootstrap-actions/configure-hadoop.

The following example demonstrates how to change the configuration for the maximum number of map tasks in the hadoop-config-file.xml file.

Example

$ ./elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop
--args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.tasktracker.map.tasks.maximum=2"       

The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.

[Note]Note

The configuration file you supply in the Amazon S3 bucket must be a valid Hadoop configuration file, for example:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <configuration>
  <property><name>mapred.userlog.retain.hours</name><value>4</value></property>
  </configuration> 

The configuration file for Hadoop 0.18 is hadoop-site.xml. In Hadoop 0.20 and later, the old configuration file is replaced with three new files: core-site.xml, mapred-site.xml, and hdfs-site.xml.

For Hadoop 0.18, the name and location of the configuration file is /conf/hadoop-site.xml. The default hadoop-site.xml properties are as follows.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property>
  <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>
  <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>
  <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
  <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property>
  <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property>
  <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property>
  <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property>
  <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property>
  <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property>
  <property><name>mapred.userlog.retain.hours</name><value>48</value></property>
  <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property>
  <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property>
  <property><name>dfs.namenode.handler.count</name><value>20</value></property>
  <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>
  <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property>
  <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property>
  <property><name>io.sort.factor</name><value>40</value></property>
  <property><name>fs.default.name</name><value>hdfs://domU-12-31-39-06-7E-53.compute-1.internal:9000</value></property>
  <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property>
  <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property>
  <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property>
  <property><name>mapred.reduce.parallel.copies</name><value>20</value></property>
  <property><name>tasktracker.http.threads</name><value>20</value></property>
  <property><name>mapred.reduce.tasks</name><value>1</value></property>
  <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property>
  <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property>
  <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property>
  <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>
  <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property>
  <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property>
  <property><name>io.file.buffer.size</name><value>65536</value></property>
  <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property>
  <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property>
  <property><name>dfs.block.size</name><value>134217728</value></property>
  <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property>
  <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property>
  <property><name>mapred.job.tracker</name><value>domU-12-31-39-06-7E-53.compute-1.internal:9001</value></property>
  <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property>
  <property><name>io.sort.mb</name><value>150</value></property>
  <property><name>hadoop.job.history.user.location</name><value>none</value></property>
  <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property>
  <property><name>dfs.replication</name><value>1</value></property>
  <property><name>mapred.job.tracker.handler.count</name><value>20</value></property>
  <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property>
</configuration>			

In Hadoop 0.20, the configuration file names and locations are core-site.xml, hdfs-site.xml, and mapred-site.xml.

The default core-site.xml properties are as follows.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property>
  <property><name>fs.default.name</name><value>hdfs://ip-10-116-159-127.ec2.internal:9000</value></property>
  <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property>
  <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property>
  <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property>
  <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property>
  <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>
  <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property>
  <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property>
  <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property>
  <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property>
  <property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
  <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property>
</configuration>

The default hdfs-site.xml properties are listed below.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property>
  <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property>
  <property><name>dfs.namenode.handler.count</name><value>20</value></property>
  <property><name>io.file.buffer.size</name><value>65536</value></property>
  <property><name>dfs.block.size</name><value>134217728</value></property>
  <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property>
  <property><name>dfs.replication</name><value>1</value></property>
  <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property>
  <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property>
  <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property>
  <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property>
  <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property>
  <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property>
  <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property>
</configuration>

The default mapred-site.xml properties are listed below.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property><name>mapred.output.committer.class</name><value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value></property>
  <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property>
  <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property>
  <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property>
  <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property>
  <property><name>mapred.userlog.retain.hours</name><value>48</value></property>
  <property><name>mapred.job.reuse.jvm.num.tasks</name><value>20</value></property>
  <property><name>io.sort.factor</name><value>40</value></property>
  <property><name>mapred.reduce.tasks</name><value>1</value></property>
  <property><name>tasktracker.http.threads</name><value>20</value></property>
  <property><name>mapred.reduce.parallel.copies</name><value>20</value></property>
  <property><name>hadoop.job.history.user.location</name><value>none</value></property>
  <property><name>mapred.job.tracker.handler.count</name><value>20</value></property>
  <property><name>mapred.map.output.compression.codec</name><value>com.hadoop.compression.lzo.LzoCodec</value></property>
  <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property>
  <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property>
  <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property>
  <property><name>mapred.compress.map.output</name><value>true</value></property>
  <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property>
  <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property>
  <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property>
  <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property>
  <property><name>mapred.job.tracker</name><value>ip-10-116-159-127.ec2.internal:9001</value></property>
  <property><name>io.sort.mb</name><value>150</value></property>
</configuration>

Configure Memory-Intensive Workloads

This bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for job flows with memory-intensive workloads.

The following Hadoop configuration parameters are set:

Parameters modified in hadoop.env.sh

  • HADOOP_JOBTRACKER_HEAPSIZE

  • HADOOP_NAMENODE_HEAPSIZE

  • HADOOP_TASKTRACKER_HEAPSIZE

  • HADOOP_DATANODE_HEAPSIZE

Parameters modified in mapred-site.xml

  • mapred.child.java.opts

  • mapred.tasktracker.map.tasks.maximum

  • mapred.tasktracker.reduce.tasks.maximum

The bootstrap script is located at s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive.

The default configurations for cc1.4xlarge, cc2.8xlarge, and cg1.4xlarge instances are sufficient for memory-intensive workloads. This bootstrap action does not modify the settings for these instance types.

For information about the configuration values for each supported Amazon EC2 instance type, see Hadoop Memory-Intensive Configuration Settings.

The following example creates a default job flow with the memory-intensive bootstrap action. The bootstrap action modifies the Hadoop cluster configuration settings to the recommended configuration for an Amazon EC2 m1.small instance.

Example

$ ./elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

Run If

You can use this predefined bootstrap action to conditionally run a command when an instance-specific value is found in the instance.json or job-flow.json files. The command can refer to a file in Amazon S3 that MapReduce can download and execute.

The location of the script is s3://elasticmapreduce/bootstrap-actions/run-if.

The following example echoes the string running on master node if the node is a master.

Example

$ ./elastic-mapreduce --create --alive \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if 
--args "instance.isMaster=true echo running on master node"

Shutdown Actions

A bootstrap action script can create one or more shutdown actions by writing scripts to the /mnt/var/lib/instance-controller/public/shutdown-actions/ directory. When a job flow is terminated, all the scripts in this directory are executed in parallel. Each script must run and complete within 60 seconds.

[Note]Note

Shutdown action scripts are not guaranteed to run if the node terminates with an error.

Using Custom Bootstrap Actions

In addition to predefined bootstrap action, you can write a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.

Running Custom Bootstrap Actions from the CLI

The following example uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at: http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

The sample script looks like the following:

#!/bin/bash
set -e
bucket=elasticmapreduce
path=samples/bootstrap-actions/file.tar.gz
wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path
mkdir -p /home/hadoop/contents
tar -C /home/hadoop/contents -xzf file.tar.gz                         

To create a job flow with a custom bootstrap action

  • Create the job flow.

    If you are using...Enter the following...
    Linux or UNIX
    & ./elastic-mapreduce --create --stream --alive \
    --input s3n://elasticmapreduce/samples/wordcount/input \
    --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
    --output s3n://myawsbucket 
    --bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh  
    Microsoft Windowsc:\ ruby elastic-mapreduce --create --stream --alive \ --input s3n://elasticmapreduce/samples/wordcount/input \ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ --output s3n://myawsbucket --bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh"

Running Custom Bootstrap Actions from the Amazon EMR Console

The example in the following procedure creates a predefined word count sample job flow with a bootstrap action script that downloads and extracts a compressed tar archive from Amazon S3. The sample script is stored in Amazon S3 at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.

To create a job flow with a custom bootstrap action

  1. Start a new job flow:

    1. From the Amazon EMR console select a Region.

    2. Click Create a New Job Flow.

      The Create a New Job Flow page appears.

  2. In the DEFINE JOB FLOW page, enter the following information:

    1. Enter a name in the Job Flow Name field.

      We recommend that you use a descriptive name. It does not need to be unique.

    2. Select Run a sample application.

    3. Select Word Count (Streaming) from the menu and click Continue.

    Create a New Job Flow: Define Job Flow
  3. In the SPECIFY PARAMETERS page, replace the <myawsbucket> text in the Output Location text field with the name of a valid Amazon S3 bucket an then click Continue.

  4. On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.

  5. On the ADVANCED OPTIONS page, accept the default parameters and click Continue.

  6. On the BOOTSTRAP ACTIONS page, select Configure your Bootstrap Actions.

    Enter the following information:

    1. Select Custom Action from the Action Type drop-down list box.

    2. Enter the following text in the Amazon S3 Location text box:

      s3://elasticmapreduce/bootstrap-actions/download.sh
    3. Click Continue.

  7. In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.

    After you click Create Job Flow your request is processed; when it succeeds, a message appears.

    Amazon EMR console
  8. Click Close.

    The Amazon EMR console shows the new job flow starting.

    Amazon EMR console

While the job flow master node is running, you can connect to the master node and see the log files the that the bootstrap action script generated stored in the /mnt/var/log/bootstrap-actions/1 directory.