Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Performance Tuning

Amazon Elastic MapReduce (Amazon EMR) enables you to specify the number and kind of Amazon EC2 instances in the cluster. These specifications are the primary means of affecting the speed with which your job flow completes. There are, however, a number of Hadoop parameter values that govern the operation of Amazon EC2 instances at a much finer level of granularity.

By default, Amazon EMR sets many Hadoop parameters. Some of these parameter values can be overridden by parameter values set in a RunFlowJob request. For more information, see RunJobFlow in the Amazon Elastic MapReduce (Amazon EMR) API Reference. Hadoop parameters govern such things as the number of mapper and reducer tasks assigned to each node in the cluster, the amount of memory allocated for these tasks, the number of threads, timeouts, and other configuration parameters for the various Hadoop components.

Hadoop configuration parameters reside in Hadoop's JobConf file. You set the Hadoop configuration parameters by including them in your JAR file. For streaming jobs you can specify JobConf parameters using the --jobconf option. For more information from the Hadoop website, go to the Hadoop Map/Reduce Tutorial.

JobConf parameters often act in concert with related parameters or the entire framework, and therefore they are more difficult to set. For more information, go to Job Configuration on the Hadoop website.

To assist with debugging and performance tuning, Amazon EMR keeps a log of the Hadoop settings (from the Hadoop JobConf) that were used to execute each job flow. These XML files are stored under jobs/ in Amazon S3 or at /mnt/var/log/hadoop/history/ on the master node.