Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Specify the Amazon EMR AMI Version

Amazon Elastic MapReduce (Amazon EMR) uses Amazon Machine Images (AMIs) to initialize the Amazon EC2 instances it launches to run a job flow. The AMIs contain the Linux operating system, Hadoop, and other software used to run the job flow. These AMIs are specific to Amazon EMR and can be used only in the context of running a job flow. Periodically, Amazon EMR updates these AMIs with new versions of Hadoop and other software, so users can take advantage of improvements and new features.

For general information about AMIs, go to Using AMIs in the Amazon Elastic Compute Cloud User Guide. For details about the software versions included in the Amazon EMR AMIs, go to the section called “AMI Versions Supported in Amazon EMR”.

If your application depends on a specific version or configuration of Hadoop, you might want delay upgrading to the new AMI until you have tested your application on it. AMI versioning gives you the option to specify which AMI version your job flow uses to launch Amazon EC2 instances.

Specifying the AMI version during job flow creation is optional; if you do not provide an AMI-version parameter, and you are using the CLI, your job flows will run on the most recent AMI version. This means you always have the latest software running on your job flows, but you must ensure that your application will work with new changes as they are released.

If you specify an AMI version when you create a job flow, your instances will be created using that AMI. This provides stability for long-running or mission-critical applications. The trade-off is that your application will not have access to new features on more up-to-date AMI versions.

[Note]Note

The default configuration for the Amazon EMR console and copies of the CLI downloaded after 11 December 2011 is the latest AMI version. The default for the SDK, the API, and CLIs downloaded prior to 11 December 2011 is AMI version 1.0, Hadoop 0.18. For details about the configuration and applications available on AMI versions, see AMI Versions Supported in Amazon EMR.

Specifying the AMI Version for a New Job Flow

You can specify which AMI version a new job flow should use when you create it.

[Note]Note

AMI versioning is not currently supported in the Amazon EMR console. Job flows created through the Amazon EMR console will use the most current version available.

To specify an AMI version using the CLI

  • When creating a job flow using the CLI, add the --ami-version parameter, as shown in the following example. If you do not specify this parameter, or if you specify --ami-version latest the most recent version of AMI will be used.

    	$ ./elastic-mapreduce --create --alive --name "Test AMI Versioning" \
      --ami-version 1.0 --hadoop-version 0.20\
      --num-instances 5 --instance-type m1.small  
    

To specify an AMI version using the API

  • When creating a job flow using the API, add the AmiVersion and the HadoopVersion parameters to the request string, as shown in the following example. If you do not specify these parameters, Amazon EMR will create the job flow using the version 1.0 AMI and Hadoop 0.20. For more information, go to RunJobFlow in the Amazon Elastic MapReduce API Reference.

    https://elasticmapreduce.amazonaws.com?Operation=RunJobFlow
    &Name=MyJobFlowName
    &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir
    &AmiVersion=1.0
    &HadoopVersion=0.20	
    &Instances.MasterInstanceType=m1.small
    &Instances.SlaveInstanceType=m1.small
    &Instances.InstanceCount=4
    &Instances.Ec2KeyName=myec2keyname
    &Instances.Placement.AvailabilityZone=us-east-1a
    &Instances.KeepJobFlowAliveWhenNoSteps=true
    &Steps.member.1.Name=MyStepName
    &Steps.member.1.ActionOnFailure=CONTINUE
    &Steps.member.1.HadoopJarStep.Jar=MyJarFile
    &Steps.member.1.HadoopJarStep.MainClass=MyMainClass
    &Steps.member.1.HadoopJarStep.Args.member.1=arg1
    &Steps.member.1.HadoopJarStep.Args.member.2=arg2
    &AuthParams	
    

Check the AMI Version of a Running Job Flow

If you need to find out which AMI version a job flow is running, you can retrieve this information using either the CLI or the API.

[Note]Note

AMI versioning is not currently supported in the Amazon EMR console. Job flows created through the Amazon EMR console will use the most current version available.

To check the current AMI version using the CLI

  • Use the --describe parameter to retrieve the AMI version on a job flow. In the following example JobFlowID is the identifier of the job flow. The AMI version will be returned along with other information about the job flow.

    $ ./elastic-mapreduce --describe -–jobflow JobFlowID	
    

To check the current AMI version using the API

  • Call DescribeJobFlows to check which AMI version a job flow is using. The version will be returned as part of the response data, as shown in the following example. For the complete response syntax, go to DescribeJobFlows in the Amazon Elastic MapReduce API Reference.

    <DescribeJobFlowsResponse xmlns=&quot;http://elasticmapreduce.amazonaws.com/doc/2009-03-31&quot;>
       <DescribeJobFlowsResult> 
          <JobFlows> 
             <member>
    		...
                <AmiVersion>
                   1.0
                </AmiVersion>
    		...
             </member>
          </JobFlows> 
       </DescribeJobFlowsResult> 
       <ResponseMetadata>
          <RequestId> 
             9cea3229-ed85-11dd-9877-6fad448a8419 
          </RequestId>
       </ResponseMetadata> 
    </DescribeJobFlowsResponse> 	
    

Amazon EMR AMIs and Hadoop Versions

An AMI can contain multiple versions of Hadoop. If the AMI you specify has multiple versions of Hadoop available, you can select the version of Hadoop you want to run as described in the section called “Hadoop Configuration”. You cannot specify a Hadoop version that is not available on the AMI. For a list of the versions of Hadoop supported on each AMI, go to AMI Versions Supported in Amazon EMR.

Amazon EMR AMI Deprecation

Eighteen months after an AMI version is released, the Amazon EMR team might choose to deprecate that AMI version and no longer support it. In addition, the Amazon EMR team might deprecate an AMI before eighteen months has elapsed if a security risk or other issue is identified in the software or operating system of the AMI. If a job flow is running when its AMI is depreciated, the job flow will not be affected. You will not, however, be able to create new job flows with the deprecated AMI version. The best practice is to plan for AMI obsolescence and move to new AMI versions as soon as is practical for your application.

Before an AMI is deprecated, the Amazon EMR team will send out an announcement specifying the date on which the AMI version will no longer be supported.

AMI Versions Supported in Amazon EMR

Amazon EMR supports the AMI versions listed in the following table. You can specify the AMI version you want to use when you create a job flow.

If you do not specify an AMI version, the default version is used. For the Amazon EMR console and versions of the CLI released after released after the AMI versioning release (12-08-11), the default version is the latest version. For the API, SDK, and versions of the CLI downloaded before AMI versioning was released, the default version is AMI 1.0.

AMI VersionDescriptionRelease Date
2.0.5
[Note]Note

Because of an issue with AMI 2.0.5, AMI 2.0.4 is currently set as the "latest" version for the purposes of job flows launched by the Amazon EMR console and versions of the CLI released after released after the AMI versioning release (12-08-11). Although you can explicitly set the AMI version to 2.0.5, we recommend that you use AMI 2.0.4 instead.

Same as AMI 2.0.4, with the following additions:

  • Improves Hadoop performance by reinitializing the recycled compressor object for mappers only if they are configured to use the GZip compression codec for output.

  • Adds a configuration variable to Hadoop called mapreduce.jobtracker.system.dir.permission that can be used to set permissions on the system directory. For more information, see Setting Permissions on the System Directory.

  • Changes InstanceController to use an embedded database rather than the MySQL instance running on the box. MySQL remains installed and running by default.

  • Improves the collectd configuration. For more information about collectd, go to http://collectd.org/.

  • Fixes a rare race condition in InstanceController.

  • Changes the default shell from dash to bash.

19 April 2012
2.0.4

Same as AMI 2.0.3, with the following additions:

  • Changes the default for fs.s3n.blockSize to 33554432 (32MiB).

  • Fixes a bug in reading zero-length files from Amazon S3.

30 January 2012
2.0.3

Same as AMI 2.0.2, with the following additions:

  • Adds support for Amazon EMR metrics in Amazon CloudWatch.

  • Improves performance of seek operations in Amazon S3.

24 January 2012
2.0.2

Same as AMI 2.0.1, with the following additions:

  • Adds support for the Python API Dumbo. For more information about Dumbo, go to https://github.com/klbostee/dumbo/wiki/.

  • The AMI now runs the Network Time Protocol Daemon (NTPD) by default. For more information about NTPD, go to http://en.wikipedia.org/wiki/Ntpd.

  • Updates the Amazon Web Services SDK to version 1.2.16.

  • Improves the way Amazon S3 file system intialization checks for the existence of Amazon S3 buckets.

  • Adds support for configuring the Amazon S3 block size to facilitate splitting files in Amazon S3. You set this in the fs.s3n.blockSize parameter. You set this parameter by using the configure-hadoop bootstrap action. The default value is 9223372036854775807 (8 EiB).

  • Adds a /dev/sd symlink for each /dev/xvd device. For example, /dev/xvdb now has a symlink pointing to it called /dev/sdb. Now you can use the same device names for AMI 1.0 and 2.0.

17 January 2012
2.0.1

Same as AMI 2.0 except for the following bug fixes:

  • Task attempt logs are pushed to Amazon S3.

  • Fixed /mnt mounting on 32-bit AMIs.

19 December 2011
2.0

Operating system: Debian 6.0.2 (Squeeze)

Applications: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1

Languages: Perl 5.10.1, PHP 5.3.3, Python 2.6.6, R 2.11.1, Ruby 1.8.7

File system: ext3 for root, xfs for ephemeral

Kernel: Amazon Linux

Note: Added support for the Snappy compression/decompression library.

11 December 2011
1.0.1

Same as AMI 1.0 except for the following change:

  • Updates sources.list to the new location of the Lenny distribution in archive.debian.org.

3 April 2012
1.0

Operating system: Debian 5.0 (Lenny)

Applications: Hadoop 0.20 and 0.18 (default); Hive 0.5, 0.7 (default), 0.7.1; Pig 0.3 (on Hadoop 0.18), 0.6 (on Hadoop 0.20)

Languages: Perl 5.10.0, PHP 5.2.6, Python 2.5.2, R 2.7.1, Ruby 1.8.7

File system: ext3 for root and ephemeral

Kernel: Red Hat

Note: This was the last AMI released before the CLI was updated to support AMI versioning. For backward compatibility, job flows launched with versions of the CLI downloaded before 11 December 2011 use this version.

26 April 2011