| Did this page help you? Yes No Tell us about it... |

This example shows how to use Hadoop streaming to count the number of times that a word occurs in a data set. This type of job flow is appropriate if you want to search a large number of logs for a particular error or you want to know the number of blog posts made for each user name. Hadoop streaming enables you to execute MapReduce programs written in languages such as Python, Ruby, and PHP.
To count the occurrence of words, you need a mapper function that iterates through the input data and outputs word-count pairs. You can create a mapper function in Python as shown in the following example:
#!/usr/bin/python
import sys
import re
def main(argv):
pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
for line in sys.stdin:
for word in pattern.findall(line):
print "LongValueSum:" + word.lower() + "\t" + "1"
if __name__ == "__main__":
main(sys.argv)
To run the Hadoop streaming job with Amazon Elastic MapReduce (Amazon EMR), this mapper function must be uploaded to Amazon S3.
You can save this Python script to your own Amazon S3 location. For your convenience, this
example is stored on Amazon S3 at the location
s3://elasticmapreduce/samples/wordcount/wordSplitter.py.
The sample input for this job flow is available at
s3://elasticmapreduce/samples/wordcount/input.
This example uses the built-in reducer called aggregate. This
reducer adds up the counts of words being output by the wordSplitter mapper function. It
knows to use data type Long from the prefix on the words.
To run a streaming job flow
Enter the following commands from the command-line prompt:
Linux and UNIX users:
$ ./elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregateWindows users:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregateThe output will look similar to:
Created job flow JobFlowID This sample may take several minutes to run. You can monitor the job flow from the CLI as described in the Retrieve Information About a Specific Job Flow step or from the Amazon EMR console.
To view the streaming job flow
Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
Click Refresh.
Click the Hadoop streaming job flow. The Hadoop streaming job flow is
listed with a STATE.
Click Debug.
If the job flow STATE is COMPLETED, links to
the Amazon EMR log files are displayed.
If the job flow is not completed, click Close, wait a minute, and then attempt Step 4 again.
![]() | Note |
|---|---|
The Actions column has a link to View Jobs. Clicking this link displays an alert. Jobs, Tasks, and Task Attempts are not available because you did not enable debugging when you created this job flow. You must enable and configure Hadoop debugging to create these additional results. |
After you have viewed the Amazon EMR log files, click Close.
You can find additional Amazon EMR log files in the Amazon S3 bucket you specified
in your credentials.json file.
For information about the contents of these logs, see the Amazon Elastic MapReduce Developer Guide.
![]() | Tip |
|---|---|
Each time you run a Hadoop streaming job flow you must specify a new
|
To view job flow results
Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
Navigate to the Amazon S3 bucket you referenced in
--output.
Your job flow results are stored in a text file. The results file contains a list of all words found with the number of times the word occurred in the data set.
Now that you have completed a Hadoop streaming job flow, you can clean up your resources so you do not incur any unnecessary charges. Click the following button to learn how.
