Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Streaming

Hadoop streaming is a utility that comes with Hadoop that enables you to develop MapReduce executables in languages other than Java. Streaming is implemented in the form of a JAR file, so you can run it from the Amazon Elastic MapReduce (Amazon EMR) API or command line just like a standard JAR file.

This section describes how to stream Hadoop.

[Note]Note

Apache Hadoop Streaming is an independent tool. As such, we do not describe all of its functions and parameters. For a complete description of Apache Hadoop Streaming, go to http://hadoop.apache.org/common/docs/r0.20.0/streaming.html.

Using the Hadoop Stream Utility

This section describes how use to Hadoop's streaming utility.

Hadoop Process

1

Write your mapper and reducer executable in the programming language of your choice.

Follow the directions in Hadoop's documentation to write your streaming executables. The programs should read their input from standard input and output data through standard output. By default, each line of input/output represents a record and the first tab on each line is used as a separator between the key and value.

2

Test your executables locally and upload them to Amazon S3.

3

Use the Amazon EMR command line interface or Amazon EMR console to run your program.


Example Running a Streaming Job Flow from the Command Line Interface

The following example shows a standard invocation of the hadoop-streaming utility.

$ ./elastic-mapreduce --create --stream \
     --mapper  s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
     --input   s3://elasticmapreduce/samples/wordcount/input \
     --output  [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
     --reducer aggregate

In this example, both the mapper and the reducer are executables that read the input from an Amazon S3 bucket and write the output of the job flow to the Amazon S3 bucket specified by output. The mapper parameter specifies the python executable to turn into a JAR file.

Each mapper script launches as a separate process in the Hadoop cluster. Each reducer executable turns the output of the mapper executable into the data output by the job flow.

The input, output, mapper, and reducer parameters are required by most Hadoop stream job flows. The following table describes these and other, optional parameters.

ParameterDescriptionRequired
-input

Location on Amazon S3 of the input data.

Type: String

Default: None

Constraint: URI. If no protocol is specified then it uses the cluster's default file system.

Yes
-output

Location on Amazon S3 where Amazon EMR uploads the processed data.

Type: String

Default: None

Constraint: URI

Default: If a location is not specified, Amazon EMR uploads the data to the location specified by input.

Yes
-mapper

Name of the mapper executable.

Type: String

Default: None

Yes
-reducer

Name of the reducer executable.

Type: String

Default: None

Yes
-cacheFile

Location on Amazon S3 of the mapper executable.

Type: String

Default: None

Constraints: [URI]#[symlink name to create in working directory]

No
-cacheArchive

JAR file to extract into the working directory

Type: String

Default: None

Constraints: [URI]#[symlink directory name to create in working directory

No
-combiner

Combines results

Type: String

Default: None

Constraints: Java class name

No

The following code sample shows the wordSplitter.py executable identified in the previous hadoop command.

#!/usr/bin/python
import sys

def main(argv):
  line = sys.stdin.readline()
  try:
    while line:
      line = line.rstrip()
      words = line.split()
      for word in words:
        print "LongValueSum:" + word + "\t" + "1"
      line = sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

For a sample mapper application, see Running Hadoop MapReduce on Amazon EC2 and Amazon S3.