| Did this page help you? Yes No Tell us about it... |
Topics
Distributed Cache is a Hadoop feature that allow you to transfer files from a distributed file system to the local file system. It can distribute data and text files as well as more complex types such as archives and jars. If your job flow depends on applications or binaries that are not installed when the cluster is created, you can use Distributed Cache to import these files. Using Distributed Cache can boost efficiency when a map or a reduce task needs access to common data. A cluster node can read files from its local file system, instead of retrieving the files from other cluster nodes.
You invoke Distributed Cache when you create the job flow. The files are cached just
before starting the Hadoop job and the files remain cached for the duration of the job. You
can cache files stored on any Hadoop-compatible file system, for example HDFS or S3 native.
The default size of the file cache is 10GB. To change the size of the cache, reconfigure
the Hadoop parameter, local.cache.size using the Configure Hadoop
bootstrap action.
Distributed Cache allows both single files and archives. Individual files are cached as read only. Executables and binary files have execution permissions set.
Archives are one or more files packaged using a utility, such as gzip.
Distributed Cache passes the compressed files to each slave node and decompresses the
archive as part of caching. Distributed Cache supports the following compression
formats:
zip
tgz
tar.gz
tar
jar
Distributed Cache copies files to slave nodes only. If there are no slave nodes in the cluster, Distributed Cache copies the files to the master node.
Distributed Cache associates the cache files to the current working directory of the
mapper and reducer using symlinks. A symlink is an alias to a file location, not the
actual file location. The value of the Hadoop parameter,
mapred.local.dir, specifies the location of temporary files.
Amazon Elastic MapReduce (Amazon EMR) sets this parameter to
/mnt/var/lib/hadoop/mapred/. Cache files are located in a
subdirectory of the temporary file location at
/mnt/var/lib/hadoop/mapred/taskTracker/archive/.
If you cache a single file, Distributed Cache puts the file in the
archive directory. If you cache an archive, Distributed Cache
decompresses the file, creates a subdirectory in /archive with the same
name as the archive file name. The individual files are located in the new
subdirectory.
You can use Distributed Cache only when creating streaming job flows.
To access the cached files from your mapper or reducer applications, make sure that you have added the current working directory (./) into your application path and referenced the cached files as though they are present in the current working directory.
For more information, go to Hadoop Distributed Cache (http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#DistributedCache).
You can use the Amazon EMR console to create job flows that use Distributed Cache.
To specify Distributed Cache files
Launch the Create New Job Flow wizard, specify a streaming job flow, and click Continue.
For information on how to launch the Create New Job Flow wizard and specify a streaming job flow go to How to Create a Streaming Job Flow.
The Specify Parameters page opens.
In the Extra Args field, include the files and archives to save to the cache.
The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.
| If you want to ... | Action | Example |
|---|---|---|
| Add an individual file to the Distributed Cache | Specify -cacheFile followed by the name and location
of the file, the pound (#) sign, and then the name you want to give
the file when it's placed in the local cache. |
–cacheFile \ s3n://bucket_name/file_name#cache_file_name |
| Add an archive file to the Distributed Cache | Enter -cacheArchive followed by the location of the
files in Amazon S3, the pound (#) sign, and then the name you want to
give the collection of files in the local cache. |
–cacheArchive \ s3n://bucket_name/archive_name#cache_archive_name |

Proceed with configuring and launching your streaming job flow.
Your job flow copies the files to the cache location before processing any job flow steps.
You can use the Amazon EMR console to create job flows that use Distributed
Cache. To add files or archives to the Distributed Cache using the CLI, you specify the
options –-cache or --cache-archive to the
CLI command line.
To specify Distributed Cache files
Create a streaming job flow and add the following parameters:
For information on how to create a streaming job flow using the CLI, go to How to Create a Streaming Job Flow.
The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.
| If you want to ... | Add the following parameter to the job flow ... |
|---|---|
| add an individual file to the Distributed Cache | specify -cache followed by the name and location of
the file, the pound (#) sign, and then the name you want to give the
file when it's placed in the local cache. |
| add an archive file to the Distributed Cache | enter -cache-archive followed by the location of the
files in Amazon S3, the pound (#) sign, and then the name you want to
give the collection of files in the local cache. |
The output looks similar to the following.
Created jobflow JobFlowIDYour job flow copies the files to the cache location before processing any job flow steps.
Example 1
The following command shows the creation of a streaming job flow and uses
--cache to add one file,
sample_dataset_cached.dat, to the cache.
./elastic-mapreduce --create --stream \ --input s3n://my_bucket/my_input \ --output s3n://my_bucket/my_output \ --mapper s3n://my_bucket/my_mapper.py \ --reducer s3n://my_bucket/my_reducer.py \ --cache s3n://my_bucket/sample_dataset.dat#sample_dataset_cached.dat
Example 2
The following command shows the creation of a streaming job flow and uses
--cache-archive to add an archive of files to the cache.
./elastic-mapreduce --create --stream \ --input s3n://my_bucket/my_input \ --output s3n://my_bucket/my_output \ --mapper s3n://my_bucket/my_mapper.py \ --reducer s3n://my_bucket/my_reducer.py \ --cache-archive s3n://my_bucket/sample_dataset.tgz#sample_dataset_cached
This section describes the Amazon EMR API Query request parameters needed to use Distributed Cache.
To specify Distributed Cache files
Create a streaming job flow and add the following parameters:
For information on how to create a streaming job flow using the CLI, go to How to Create a Streaming Job Flow.
The size of the file (or total size of the files in an archive file) must be less than the allocated cache size.
| If you want to ... | Add the following parameter to the job flow ... | |
|---|---|---|
| add an individual file to the Distributed Cache | specify -cache followed by the name and location of
the file, the pound (#) sign, and then the name you want to give the
file when it's placed in the local cache. | |
| add an archive file to the Distributed Cache | enter -cache-archive followed by the location of the
files in Amazon S3, the pound (#) sign, and then the name you want to
give the collection of files in the local cache. |
The following JSON example describes a simple streaming job flow that uses the
Distributed Cache to store the file sample_data.dat.
[
{ "Name": "streaming job flow referencing distributed cache",
"HadoopJarStep":
{
"Jar": "/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args":
[
"-input", "s3n://elasticmapreduce/samples/wordcount/input",
"-output", "s3n://myawsbucket",
"-mapper", "s3://elasticmapreduce/samples/wordcount/wordSplitter.py",
"-reducer", "aggregate",
"-cache", "s3n://my_bucket/sample_data.dat#sample_data_cached.dat"
]
}}
]All paths are prefixed with their location. “s3://” refers to the Amazon S3 file
system. “s3n://” refers to the Amazon S3 native file system. If you use HDFS, prepend
the path with hdfs:///. Make sure to use three slashes (///), as in
hdfs:///home/hadoop/sampleInput2/.