Amazon Elastic MapReduce
Getting Started Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Create a Streaming Job Flow

Create a Streaming job flow

This example shows how to use Hadoop streaming to count the number of times that a word occurs in a data set. This type of job flow is appropriate if you want to search a large number of logs for a particular error or you want to know the number of blog posts made for each user name. Hadoop streaming enables you to execute MapReduce programs written in languages such as Python, Ruby, and PHP.

To count the occurrence of words, you need a mapper function that iterates through the input data and outputs word-count pairs. You can create a mapper function in Python as shown in the following example:

#!/usr/bin/python
import sys 
import re 

def main(argv): 
    pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") 
    for line in sys.stdin: 
        for word in pattern.findall(line): 
            print "LongValueSum:" + word.lower() + "\t" + "1" 


if __name__ == "__main__": 
    main(sys.argv) 

To run the Hadoop streaming job with Amazon Elastic MapReduce (Amazon EMR), this mapper function must be uploaded to Amazon S3.

You can save this Python script to your own Amazon S3 location. For your convenience, this example is stored on Amazon S3 at the location s3://elasticmapreduce/samples/wordcount/wordSplitter.py.

The sample input for this job flow is available at s3://elasticmapreduce/samples/wordcount/input.

This example uses the built-in reducer called aggregate. This reducer adds up the counts of words being output by the wordSplitter mapper function. It knows to use data type Long from the prefix on the words.

To run a streaming job flow

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

       $ ./elastic-mapreduce --create --stream \
           --mapper  s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
           --input   s3://elasticmapreduce/samples/wordcount/input \
           --output  [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
           --reducer aggregate
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
           --mapper  s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
           --input   s3://elasticmapreduce/samples/wordcount/input \
           --output  [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
           --reducer aggregate

The output will look similar to:

Created job flow JobFlowID	

This sample may take several minutes to run. You can monitor the job flow from the CLI as described in the Retrieve Information About a Specific Job Flow step or from the Amazon EMR console.

To view the streaming job flow

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Click Refresh.

  3. Click the Hadoop streaming job flow. The Hadoop streaming job flow is listed with a STATE.

  4. Click Debug.

    If the job flow STATE is COMPLETED, links to the Amazon EMR log files are displayed.

  5. If the job flow is not completed, click Close, wait a minute, and then attempt Step 4 again.

    [Note]Note

    The Actions column has a link to View Jobs. Clicking this link displays an alert. Jobs, Tasks, and Task Attempts are not available because you did not enable debugging when you created this job flow. You must enable and configure Hadoop debugging to create these additional results.

  6. After you have viewed the Amazon EMR log files, click Close.

You can find additional Amazon EMR log files in the Amazon S3 bucket you specified in your credentials.json file.

For information about the contents of these logs, see the Amazon Elastic MapReduce Developer Guide.

[Tip]Tip

Each time you run a Hadoop streaming job flow you must specify a new --output location or the job flow will fail. You can specify a folder within an existing bucket as well as create a new bucket.

To view job flow results

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Navigate to the Amazon S3 bucket you referenced in --output.

Your job flow results are stored in a text file. The results file contains a list of all words found with the number of times the word occurred in the data set.

Now that you have completed a Hadoop streaming job flow, you can clean up your resources so you do not incur any unnecessary charges. Click the following button to learn how.

Restore Environment