| Did this page help you? Yes No Tell us about it... |
Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data.
Bootstrap actions execute as the Hadoop user by default. A bootstrap action can execute
with root privileges if you use sudo.
![]() | Note |
|---|---|
If the bootstrap action returns a nonzero error code, Amazon Elastic MapReduce (Amazon EMR) treats it as
a failure and terminates the instance. If too many instances fail their bootstrap
actions, then Amazon EMR terminates the job flow. If just a few instances fail,
then an attempt is made to reallocate the failed instances and continue. Refer to the
job flow |
All three Amazon EMR interfaces support bootstrap actions. You can specify up to
16 bootstrap actions per job flow by providing multiple
--bootstrap-action parameters from the CLI or API.
From the CLI, references to bootstrap action scripts are passed to Elastic MapReduce by
adding the bootstrap-action parameter after the
create parameter. The syntax for a
bootstrap-action parameter is as follows:
--bootstrap-action "s3://myawsbucket/FileName" --args "arg1","arg2"
From the Amazon EMR console, you can specify a bootstrap action optionally while creating a job flow on the Bootstrap Actions page in the Job Flow Creation Wizard.
For more information on how to reference a bootstrap action from the API, go to the Amazon Elastic MapReduce API Reference.
Amazon provides a number of predefined bootstrap action scripts that you can use to
customize Hadoop settings. This section describes the available predefined bootstrap
actions. References to predefined bootstrap action scripts are passed to Elastic MapReduce
by using the bootstrap-action parameter.
You can specify up to 16 bootstrap actions per job flow by providing multiple
bootstrap-action parameters.
This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.
The location of the script is
s3://elasticmapreduce/bootstrap-actions/configure-daemons.
The following example sets the heap size to 2048 and configures the Java
namenode option.
Example
$ ./elastic-mapreduce –create –alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \ --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19
This bootstrap action allows you to set cluster-wide Hadoop settings. This script provides two types of command line options:
Option 1—Enables you to upload an XML file containing configuration settings to Amazon S3. The bootstrap action merges the new configuration settings with the existing Hadoop configuration.
Option 2—Allows you to specify a Hadoop key value pair from the command line that overrides the existing Hadoop configuration.
The location of the script is
s3://elasticmapreduce/bootstrap-actions/configure-hadoop.
The following example demonstrates how to change the configuration for the maximum
number of map tasks in the hadoop-config-file.xml file.
Example
$ ./elastic-mapreduce --create \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.tasktracker.map.tasks.maximum=2"
The configuration options are applied in the order described in the bootstrap action script. Settings specified later in the sequence override those specified earlier.
![]() | Note |
|---|---|
The configuration file you supply in the Amazon S3 bucket must be a valid Hadoop configuration file, for example: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.userlog.retain.hours</name><value>4</value></property> </configuration> |
The configuration file for Hadoop 0.18 is hadoop-site.xml. In
Hadoop 0.20 and later, the old configuration file is replaced with three new files:
core-site.xml, mapred-site.xml, and
hdfs-site.xml.
For Hadoop 0.18, the name and location of the configuration file is
/conf/hadoop-site.xml. The default
hadoop-site.xml properties are as follows.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property> <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property> <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property> <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property> <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property> <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property> <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property> <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property> <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property> <property><name>mapred.userlog.retain.hours</name><value>48</value></property> <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property> <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property> <property><name>dfs.namenode.handler.count</name><value>20</value></property> <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property> <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property> <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>io.sort.factor</name><value>40</value></property> <property><name>fs.default.name</name><value>hdfs://domU-12-31-39-06-7E-53.compute-1.internal:9000</value></property> <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property> <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property> <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property> <property><name>mapred.reduce.parallel.copies</name><value>20</value></property> <property><name>tasktracker.http.threads</name><value>20</value></property> <property><name>mapred.reduce.tasks</name><value>1</value></property> <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property> <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property> <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property> <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>io.file.buffer.size</name><value>65536</value></property> <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property> <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property> <property><name>dfs.block.size</name><value>134217728</value></property> <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property> <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property> <property><name>mapred.job.tracker</name><value>domU-12-31-39-06-7E-53.compute-1.internal:9001</value></property> <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property> <property><name>io.sort.mb</name><value>150</value></property> <property><name>hadoop.job.history.user.location</name><value>none</value></property> <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property> <property><name>dfs.replication</name><value>1</value></property> <property><name>mapred.job.tracker.handler.count</name><value>20</value></property> <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property> </configuration>
In Hadoop 0.20, the configuration file names and locations are
core-site.xml, hdfs-site.xml, and
mapred-site.xml.
The default core-site.xml properties are as follows.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>fs.s3.impl</name><value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value></property> <property><name>fs.default.name</name><value>hdfs://ip-10-116-159-127.ec2.internal:9000</value></property> <property><name>hadoop.tmp.dir</name><value>/mnt/var/lib/hadoop/tmp</value></property> <property><name>fs.s3n.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>fs.s3n.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>fs.s3.buffer.dir</name><value>/mnt/var/lib/hadoop/s3</value></property> <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property> <property><name>fs.s3bfs.impl</name><value>org.apache.hadoop.fs.s3.S3FileSystem</value></property> <property><name>fs.s3.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>fs.s3bfs.awsAccessKeyId</name><value>AccessKeyID</value></property> <property><name>fs.s3bfs.awsSecretAccessKey</name><value>SecretKey</value></property> <property><name>io.compression.codec.lzo.class</name><value>com.hadoop.compression.lzo.LzoCodec</value></property> <property><name>fs.s3.awsAccessKeyId</name><value>AccessKeyID</value></property> </configuration>
The default hdfs-site.xml properties are listed below.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>dfs.datanode.https.address</name><value>0.0.0.0:9402</value></property> <property><name>dfs.datanode.du.reserved</name><value>536870912</value></property> <property><name>dfs.namenode.handler.count</name><value>20</value></property> <property><name>io.file.buffer.size</name><value>65536</value></property> <property><name>dfs.block.size</name><value>134217728</value></property> <property><name>dfs.data.dir</name><value>/mnt/var/lib/hadoop/dfs</value></property> <property><name>dfs.replication</name><value>1</value></property> <property><name>dfs.secondary.http.address</name><value>0.0.0.0:9104</value></property> <property><name>dfs.https.address</name><value>0.0.0.0:9202</value></property> <property><name>dfs.http.address</name><value>0.0.0.0:9101</value></property> <property><name>dfs.datanode.http.address</name><value>0.0.0.0:9102</value></property> <property><name>dfs.datanode.address</name><value>0.0.0.0:9200</value></property> <property><name>dfs.name.dir</name><value>/mnt/var/lib/hadoop/dfs-name</value></property> <property><name>dfs.datanode.ipc.address</name><value>0.0.0.0:9201</value></property> </configuration>
The default mapred-site.xml properties are listed below.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.output.committer.class</name><value>org.apache.hadoop.mapred.DirectFileOutputCommitter</value></property> <property><name>mapred.reduce.tasks.speculative.execution</name><value>false</value></property> <property><name>mapred.tasktracker.map.tasks.maximum</name><value>2</value></property> <property><name>mapred.task.tracker.http.address</name><value>0.0.0.0:9103</value></property> <property><name>mapred.map.tasks.speculative.execution</name><value>true</value></property> <property><name>mapred.userlog.retain.hours</name><value>48</value></property> <property><name>mapred.job.reuse.jvm.num.tasks</name><value>20</value></property> <property><name>io.sort.factor</name><value>40</value></property> <property><name>mapred.reduce.tasks</name><value>1</value></property> <property><name>tasktracker.http.threads</name><value>20</value></property> <property><name>mapred.reduce.parallel.copies</name><value>20</value></property> <property><name>hadoop.job.history.user.location</name><value>none</value></property> <property><name>mapred.job.tracker.handler.count</name><value>20</value></property> <property><name>mapred.map.output.compression.codec</name><value>com.hadoop.compression.lzo.LzoCodec</value></property> <property><name>mapred.output.direct.NativeS3FileSystem</name><value>true</value></property> <property><name>mapred.reduce.tasksperslot</name><value>1.75</value></property> <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1</value></property> <property><name>mapred.compress.map.output</name><value>true</value></property> <property><name>mapred.output.compression.codec</name><value>org.apache.hadoop.io.compress.GzipCodec</value></property> <property><name>mapred.job.tracker.http.address</name><value>0.0.0.0:9100</value></property> <property><name>mapred.child.java.opts</name><value>-Xmx725m</value></property> <property><name>mapred.local.dir</name><value>/mnt/var/lib/hadoop/mapred</value></property> <property><name>mapred.job.tracker</name><value>ip-10-116-159-127.ec2.internal:9001</value></property> <property><name>io.sort.mb</name><value>150</value></property> </configuration>
This bootstrap action allows you to set cluster-wide Hadoop settings to values appropriate for job flows with memory-intensive workloads.
The following Hadoop configuration parameters are set:
Parameters modified in hadoop.env.sh
HADOOP_JOBTRACKER_HEAPSIZE
HADOOP_NAMENODE_HEAPSIZE
HADOOP_TASKTRACKER_HEAPSIZE
HADOOP_DATANODE_HEAPSIZE
Parameters modified in mapred-site.xml
mapred.child.java.opts
mapred.tasktracker.map.tasks.maximum
mapred.tasktracker.reduce.tasks.maximum
The bootstrap script is located at
s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive.
The default configurations for cc1.4xlarge, cc2.8xlarge, and cg1.4xlarge instances are sufficient for memory-intensive workloads. This bootstrap action does not modify the settings for these instance types.
For information about the configuration values for each supported Amazon EC2 instance type, see Hadoop Memory-Intensive Configuration Settings.
The following example creates a default job flow with the memory-intensive bootstrap action. The bootstrap action modifies the Hadoop cluster configuration settings to the recommended configuration for an Amazon EC2 m1.small instance.
Example
$ ./elastic-mapreduce --create \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive
You can use this predefined bootstrap action to conditionally run a command when an
instance-specific value is found in the instance.json or
job-flow.json files. The command can refer to a file in Amazon
S3 that MapReduce can download and execute.
The location of the script is
s3://elasticmapreduce/bootstrap-actions/run-if.
The following example echoes the string running on master node if the node is a master.
Example
$ ./elastic-mapreduce --create --alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/run-if --args "instance.isMaster=true echo running on master node"
A bootstrap action script can create one or more shutdown
actions by writing scripts to the
/mnt/var/lib/instance-controller/public/shutdown-actions/ directory.
When a job flow is terminated, all the scripts in this directory are executed in
parallel. Each script must run and complete within 60 seconds.
![]() | Note |
|---|---|
Shutdown action scripts are not guaranteed to run if the node terminates with an error. |
Topics
In addition to predefined bootstrap action, you can write a custom script to perform a customized bootstrap action. Any of the Amazon EMR interfaces can reference a custom bootstrap action.
The following example uses a bootstrap action script to download and extracts a compressed TAR archive from Amazon S3. The sample script is stored in Amazon S3 at: http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.
The sample script looks like the following:
#!/bin/bash set -e bucket=elasticmapreduce path=samples/bootstrap-actions/file.tar.gz wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path mkdir -p /home/hadoop/contents tar -C /home/hadoop/contents -xzf file.tar.gz
To create a job flow with a custom bootstrap action
Create the job flow.
| If you are using... | Enter the following... |
|---|---|
| Linux or UNIX |
& ./elastic-mapreduce --create --stream --alive \ --input |
| Microsoft Windows | c:\ ruby elastic-mapreduce --create --stream --alive \ --input
s3n://elasticmapreduce/samples/wordcount/input
\ --mapper
s3://elasticmapreduce/samples/wordcount/wordSplitter.py
\ --output s3n://myawsbucket
--bootstrap-action
"s3://elasticmapreduce/bootstrap-actions/download.sh"
|
The example in the following procedure creates a predefined word count sample job flow with a bootstrap action script that downloads and extracts a compressed tar archive from Amazon S3. The sample script is stored in Amazon S3 at http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh.
To create a job flow with a custom bootstrap action
Start a new job flow:
From the Amazon EMR console select a Region.

Click Create a New Job Flow.
The Create a New Job Flow page appears.
In the DEFINE JOB FLOW page, enter the following information:
Enter a name in the Job Flow Name field.
We recommend that you use a descriptive name. It does not need to be unique.
Select Run a sample application.
Select Word Count (Streaming) from the menu and click Continue.

In the SPECIFY PARAMETERS page, replace the
<myawsbucket> text in the
Output Location text field with the name of a valid
Amazon S3 bucket an then click Continue.

On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.
On the ADVANCED OPTIONS page, accept the default parameters and click Continue.
On the BOOTSTRAP ACTIONS page, select Configure your Bootstrap Actions.
Enter the following information:
Select Custom Action from the Action Type drop-down list box.
Enter the following text in the Amazon S3 Location text box:
s3://elasticmapreduce/bootstrap-actions/download.sh
Click Continue.

In the REVIEW page, review the information, edit as necessary to correct any of the values, and then click Create Job Flow when the information is correct.
After you click Create Job Flow your request is processed; when it succeeds, a message appears.

Click Close.
The Amazon EMR console shows the new job flow starting.

While the job flow master node is running, you can connect to the master node and see
the log files the that the bootstrap action script generated stored in the
/mnt/var/log/bootstrap-actions/1 directory.
Related Topics