Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

File System Configuration

Amazon Elastic MapReduce (Amazon EMR) and Hadoop provide a variety of file systems you can use when processing job flow steps. You specify which file system to use by the prefix of the URI used to access the data. For example, s3://myawsbucket/path references an Amazon S3 bucket using the S3 native file system. The following table lists the available file systems, with recommendations on when it’s best to use them.

File SystemPrefixDescription
HDFShdfs:// or no prefix HDFS is used by the master and core nodes. Its advantage is that it’s fast; its disadvantage is that it’s ephemeral storage which is reclaimed when the job flow ends. It’s best used for caching the results produced by intermediate job-flow steps.
Amazon S3 natives3:// or s3n:// Amazon S3 native is a persistent and fault-tolerant file system. It continues to exist after the job flow ends. Its disadvantage is that it’s slower than HDFS because of the round-trip to Amazon S3. It’s best used for storing the input to a job flow, the output of the job flow, and the results of intermediate job flow steps where re-computing the step would be onerous.
[Note]Note

Paths that specify only a bucket name must end with a terminating slash. In other words, all Amazon S3 URIs must have at least three slashes. For example, specify s3n://myawsbucket/ instead of s3n://myawsbucket. The URI s3n://myawsbucket/myfolder, however, is also valid.

Amazon S3 blocks3bfs://Amazon S3 block is a deprecated file system that is not recommended because it can trigger a race condition that might cause your job flow to fail. It may be required by legacy applications.

[Note]Note

The configuration of Hadoop running on Amazon EMR differs from the default configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3:// both map to the Amazon S3 native file system, while in the default configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3 block storage system.

Upload Large Files with the S3 Native File System

The S3 native file system imposes a 5 GiB file-size limit. You might need to upload or store files larger than 5 GiB with Amazon S3. Amazon EMR makes this possible by extending the S3 file system through the AWS Java SDK to support multipart uploads. Using this feature of Amazon EMR you can upload files of up to 5 TiB in size. Multipart upload is disabled by default; to learn how to enable it for your job flow, see the section called “Multipart Upload”.

Access File Systems

You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems.

To access a local HDFS

  • Specify the hdfs:/// prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS.

    hdfs:///path-to-data
    							
    /path-to-data
                

To access a remote HDFS

  • Include the IP address of the master node in the URI as shown in the following examples.

    hdfs://master-ip-address/path-to-data
    						
    master-ip-address/path-to-data
                

To access the Amazon S3 native file system

  • Use the s3n:// or s3:// prefix. Amazon EMR resolves both of the URIs below to the same location.

    s3n://bucket-name/path-to-file-in-bucket
    						
    s3://bucket-name/path-to-file-in-bucket
    					

    [Note]Note

    Because of the file syntax difference between Hadoop running on Amazon EMR and standard Apache Hadoop, it is recommended that you use the s3n:// prefix to highlight the fact that you are using the S3 native file system.

To access the Amazon S3 block file system

  • Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the s3bfs:// prefix in the URI.

    The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GiB in size. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload files of up to 5 TiB in size to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated.

    [Caution]Caution

    Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use the Amazon S3 native file system instead.

    s3bfs://bucket-name/path-to-file-in-bucket