| Did this page help you? Yes No Tell us about it... |
You can choose to run one of three Hadoop versions. You set the --hadoop-version as
shown in the following table.
| Hadoop Version | Configuration Parameters |
|---|---|
| 0.18* |
--hadoop-version 0.18 --ami-version 1.0
|
| 0.20 |
--hadoop-version 0.20 --ami-version 1.0
|
| 0.20.205* |
--hadoop-version 0.20.205 --ami-version 2.0
|
*The default configuration for the Amazon EMR console and copies of the CLI downloaded after 11 December 2011 is the latest AMI version, the default for the SDK, the API, and CLIs downloaded prior to 11 December 2011 is AMI version 1.0, Hadoop 0.18. For details about the configuration and software available on AMIs used by Amazon Elastic MapReduce (Amazon EMR) see Specify the Amazon EMR AMI Version.
![]() | Note |
|---|---|
We recommend using Hadoop 0.20.205 to take advantage of the latest performance enhancements and functionality. |
Specify the Hadoop version when creating the job flow, similar to the following:
$ ./elastic-mapreduce --create --alive --name "Test Hadoop" \
--hadoop-version 0.20.205 \
--num-instances 5 --instance-type m1.small This input creates a waiting job flow running Hadoop 0.20.205.
Hadoop 0.18 was not designed to efficiently handle multiple small files. The following enhancements in Hadoop 0.20 and later improve the performance of processing small files:
Hadoop 0.20 and later assigns multiple tasks per heartbeat. A heartbeat is a method that periodically checks to see if the client is still alive. By assigning multiple tasks, Hadoop can distribute tasks to slave nodes faster, thereby improving performance. The time taken to distribute tasks is an important part of the processing time usage.
Historically, Hadoop processes each task in its own Java Virtual Machine (JVM). If you have many small files that take only a second to process, the overhead is great when you start a JVM for each task. Hadoop 0.20 and later can share one JVM for multiple tasks, thus significantly improving your processing time.
Hadoop 0.20 and later allows you to process multiple files in a single map task, which reduces the overhead associated with setting up a task. A single task can now process multiple small files.
Hadoop 0.20 and later also supports the following features:
A new command line option, -libjars, enables you to include a specified JAR file in the
class path of every task.
The ability to skip individual records rather than entire files. In previous versions of Hadoop, failures in record processing caused the entire file containing the bad record to skip. Jobs that previously failed can now return partial results.
In addition to the Hadoop 0.18 streaming parameters, Hadoop 0.20 and later introduces the three new streaming parameters listed in the following table:
| Parameter | Definition |
|---|---|
-files
| Specifies comma-separated files to copy to the map reduce cluster. |
-archives
| Specifies comma-separated archives to restore to the compute machines. |
-D
| Specifies a value for the key you enter, in the form of <key>=<value>. |