| Did this page help you? Yes No Tell us about it... |
Amazon Elastic MapReduce (Amazon EMR) and Hadoop provide a variety of file systems you can use when processing job flow steps. You specify which file system to use by the prefix of the URI used to access the data. For example, s3://myawsbucket/path references an Amazon S3 bucket using the S3 native file system. The following table lists the available file systems, with recommendations on when it’s best to use them.
| File System | Prefix | Description | |||
|---|---|---|---|---|---|
| HDFS | hdfs:// or no prefix | HDFS is used by the master and core nodes. Its advantage is that it’s fast; its disadvantage is that it’s ephemeral storage which is reclaimed when the job flow ends. It’s best used for caching the results produced by intermediate job-flow steps. | |||
| Amazon S3 native | s3:// or s3n:// | Amazon S3 native is a persistent and fault-tolerant file system. It continues to exist after the job flow ends. Its disadvantage is that it’s slower than HDFS because of the round-trip to Amazon S3. It’s best used for storing the input to a job flow, the output of the job flow, and the results of intermediate job flow steps where re-computing the step would be onerous.
| |||
| Amazon S3 block | s3bfs:// | Amazon S3 block is a deprecated file system that is not recommended because it can trigger a race condition that might cause your job flow to fail. It may be required by legacy applications. |
![]() | Note |
|---|---|
The configuration of Hadoop running on Amazon EMR differs from the default configuration provided by Apache Hadoop. On Amazon EMR, s3n:// and s3:// both map to the Amazon S3 native file system, while in the default configuration provided by Apache Hadoop s3:// is mapped to the Amazon S3 block storage system. |
The S3 native file system imposes a 5 GiB file-size limit. You might need to upload or store files larger than 5 GiB with Amazon S3. Amazon EMR makes this possible by extending the S3 file system through the AWS Java SDK to support multipart uploads. Using this feature of Amazon EMR you can upload files of up to 5 TiB in size. Multipart upload is disabled by default; to learn how to enable it for your job flow, see the section called “Multipart Upload”.
You specify which file system to use by the prefix of the uniform resource identifier (URI) used to access the data. The following procedures illustrate how to reference several different types of file systems.
To access a local HDFS
Specify the hdfs:/// prefix in the URI. Amazon EMR resolves paths that do not specify a prefix in the URI to the local HDFS. For example, both of the following URIs would resolve to the same location in HDFS.
hdfs:///path-to-data/path-to-data
To access a remote HDFS
Include the IP address of the master node in the URI as shown in the following examples.
hdfs://master-ip-address/path-to-datamaster-ip-address/path-to-data
To access the Amazon S3 native file system
Use the s3n:// or s3:// prefix. Amazon EMR resolves both of the URIs below to the same location.
s3n://bucket-name/path-to-file-in-buckets3://bucket-name/path-to-file-in-bucket
![]() | Note |
|---|---|
Because of the file syntax difference between Hadoop running on Amazon EMR and standard Apache Hadoop, it is recommended that you use the s3n:// prefix to highlight the fact that you are using the S3 native file system. |
To access the Amazon S3 block file system
Use only for legacy applications that require the Amazon S3 block file system. To access or store data with this file system, use the s3bfs:// prefix in the URI.
The Amazon S3 block file system is a legacy file system that was used to support uploads to Amazon S3 that were larger than 5 GiB in size. With the multipart upload functionality Amazon EMR provides through the AWS Java SDK, you can upload files of up to 5 TiB in size to the Amazon S3 native file system, and the Amazon S3 block file system is deprecated.
![]() | Caution |
|---|---|
Because this legacy file system can create race conditions that can corrupt the file system, you should avoid this format and use the Amazon S3 native file system instead. |
s3bfs://bucket-name/path-to-file-in-bucket