Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Supported Hadoop Versions

You can choose to run one of three Hadoop versions. You set the --hadoop-version as shown in the following table.

Hadoop VersionConfiguration Parameters
0.18* --hadoop-version 0.18 --ami-version 1.0
0.20 --hadoop-version 0.20 --ami-version 1.0
0.20.205* --hadoop-version 0.20.205 --ami-version 2.0

*The default configuration for the Amazon EMR console and copies of the CLI downloaded after 11 December 2011 is the latest AMI version, the default for the SDK, the API, and CLIs downloaded prior to 11 December 2011 is AMI version 1.0, Hadoop 0.18. For details about the configuration and software available on AMIs used by Amazon Elastic MapReduce (Amazon EMR) see Specify the Amazon EMR AMI Version.

[Note]Note

We recommend using Hadoop 0.20.205 to take advantage of the latest performance enhancements and functionality.

Specify the Hadoop version when creating the job flow, similar to the following:

$ ./elastic-mapreduce --create --alive --name "Test Hadoop" \
  --hadoop-version 0.20.205 \
  --num-instances 5 --instance-type m1.small  

This input creates a waiting job flow running Hadoop 0.20.205.

Hadoop 0.18 was not designed to efficiently handle multiple small files. The following enhancements in Hadoop 0.20 and later improve the performance of processing small files:

  • Hadoop 0.20 and later assigns multiple tasks per heartbeat. A heartbeat is a method that periodically checks to see if the client is still alive. By assigning multiple tasks, Hadoop can distribute tasks to slave nodes faster, thereby improving performance. The time taken to distribute tasks is an important part of the processing time usage.

  • Historically, Hadoop processes each task in its own Java Virtual Machine (JVM). If you have many small files that take only a second to process, the overhead is great when you start a JVM for each task. Hadoop 0.20 and later can share one JVM for multiple tasks, thus significantly improving your processing time.

  • Hadoop 0.20 and later allows you to process multiple files in a single map task, which reduces the overhead associated with setting up a task. A single task can now process multiple small files.

Hadoop 0.20 and later also supports the following features:

  • A new command line option, -libjars, enables you to include a specified JAR file in the class path of every task.

  • The ability to skip individual records rather than entire files. In previous versions of Hadoop, failures in record processing caused the entire file containing the bad record to skip. Jobs that previously failed can now return partial results.

In addition to the Hadoop 0.18 streaming parameters, Hadoop 0.20 and later introduces the three new streaming parameters listed in the following table:

ParameterDefinition
-files Specifies comma-separated files to copy to the map reduce cluster.
-archives Specifies comma-separated archives to restore to the compute machines.
-D Specifies a value for the key you enter, in the form of <key>=<value>.