Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Multipart Upload

Multipart upload allows you to upload a single file to Amazon S3 as a set of parts. Using the AWS Java SDK, you can upload these parts incrementally and in any order. Using the multipart upload method can result in faster uploads and shorter retries than when uploading a single large file.

Amazon Elastic MapReduce (Amazon EMR) supports multipart upload, but disables the feature by default. If a cluster node fails, the in-progress upload still exists in Amazon S3, and you are charged for the partial data stored on Amazon S3. It is up to you to manually remove the failed uploads from Amazon S3. The AWS Java SDK has a helper method called abortMultipartUploads, which makes it easy to clean up failed uploads.

The Amazon EMR configuration parameters for multipart upload are described in the following table.

Configuration Parameter NameDefault ValueDescription
fs.s3n.multipart.uploads.enabledfalseA boolean type that indicates whether to enable multipart uploads.
fs.s3n.ssl.enabledtrueA boolean type that indicates whether to use http or https.

You modify the configuration parameters for multipart uploads using a bootstrap action.

Amazon EMR Console

This procedure explains how to enable multipart upload using the Amazon EMR console.

To enable multipart uploads with a predefined bootstrap action

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click the Create New Job Flow button and fill out the Create a New Job Flow wizard. For more information about creating job flows, see Creating a Job Flow.

  3. On the BOOTSTRAP ACTIONS pane of the wizard, select Configure your Bootstrap Actions.

  4. For Action Type select Configure Hadoop.

  5. In Optional Arguments, replace the default value with the following: -c fs.s3n.multipart.uploads.enabled=true -c fs.s3n.multipart.uploads.split.size=524288000

    Add a bootstrap action
  6. If you have more bootstrap actions to add, click Add another Bootstrap Action. When all of your bootstrap actions are added, click Continue to go to the REVIEW pane of the Create a New Job Flow wizard.

CLI

This procedure explains how to enable multipart upload using the CLI. The command creates a job flow in a waiting state with multipart upload enabled.

If you are using...Enter the following...
Linux or UNIX
$ ./elastic-mapreduce --create --alive \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--bootstrap-name "enable multipart upload" \
--args "-c,fs.s3n.multipart.uploads.enabled=true, \
-c,fs.s3n.multipart.uploads.split.size=524288000"
Microsoft Windows c:\ruby elastic-mapreduce --create --alive --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --bootstrap-name "enable multipart upload" --args "-c,fs.s3n.multipart.uploads.enabled=true,-c,fs.s3n.multipart.uploads.split.size=524288000"

This job flow remains in the WAITING state until it is terminated.

Using the API

For information on using Amazon S3 multipart uploads programmatically, go to Using the AWS SDK for Java for Multipart Upload in the Amazon S3 Developer Guide.

For more information on the AWS SDK for Java, go to the AWS SDK for Java detail page.