| Did this page help you? Yes No Tell us about it... |
Amazon Elastic MapReduce (Amazon EMR) enables you to specify the number and kind of Amazon EC2 instances in the cluster. These specifications are the primary means of affecting the speed with which your job flow completes. There are, however, a number of Hadoop parameter values that govern the operation of Amazon EC2 instances at a much finer level of granularity.
By default, Amazon EMR sets many Hadoop parameters. Some of
these parameter values can be overridden by parameter values set in a
RunFlowJob request. For more information, see
RunJobFlow in the Amazon Elastic MapReduce (Amazon EMR) API
Reference. Hadoop parameters govern such things as the
number of mapper and reducer tasks assigned to each node in the cluster,
the amount of memory allocated for these tasks, the number of threads,
timeouts, and other configuration parameters for the various Hadoop
components.
Hadoop configuration parameters reside in Hadoop's
JobConf file. You set the Hadoop configuration
parameters by including them in your JAR file. For
streaming jobs you can specify JobConf parameters using the
--jobconf option. For more information from the
Hadoop website, go to the Hadoop Map/Reduce Tutorial.
JobConf parameters often act in concert with
related parameters or the entire framework, and therefore they are more
difficult to set. For more information, go to Job Configuration on the Hadoop website.
To assist with debugging and performance tuning,
Amazon EMR keeps a log of the Hadoop settings (from the Hadoop
JobConf) that were used to execute each job flow. These XML files are
stored under jobs/ in Amazon S3 or at
/mnt/var/log/hadoop/history/ on the master
node.