Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Pig

Amazon Elastic MapReduce (Amazon EMR) enables you to run Pig scripts in two modes:

  • Interactive

  • Batch

Typically, you use interactive mode to troubleshoot your job flow and batch mode in production. After you revise the Pig Latin script using the interactive mode, you should upload it to Amazon S3 and use the batch mode to run job flows. For more information about using SSH, go to How to View Logs Using SSH.

In interactive mode, you SSH as the Hadoop user into the master node in the Hadoop cluster and run the Pig Latin script on it so that you can debug it. The interactivity of this mode enables you to revise the Pig Latin script quicker than you could in batch mode.

In batch mode, you create a Pig script using Pig Latin and then load that script into Amazon S3. When you run a job flow using Pig, the first step is to download that script from Amazon S3 so that it is used in the MapReduce job flow. When you use the Amazon EMR console this download is done automatically for you. You use batch mode to run job flows in production.

Creating a Job Flow Using Pig

To run Pig in interactive mode use the alive option with the create command so that the job flow remains active until you terminate it.

$ ./elastic-mapreduce --create --alive --name "Testing Pig -- $USER" \
  --num-instances 5 --instance-type instanceType \
  --pig-interactive

The return is similar to the following:

Created jobflow JobFlowID

You are now running Pig in interactive mode and can execute Pig queries.

Running Pig in Batch Mode

The following process shows how to run Pig in batch mode and assumes that you stored the Pig script in a bucket on Amazon S3. For more information about uploading files into Amazon S3, see the Amazon S3 Getting Started Guide.

To run Pig in batch mode, create a job flow with a step that executes a Pig script stored on Amazon S3.

$ ./elastic-mapreduce --create \
--name "$USER's Pig JobFlow"  \
--pig-script \
--args s3://myawsbucket/myquery.q \
--args -p,INPUT=s3://myawsbucket/input,-p,OUTPUT=s3://myawsbucket/output

The args option provides arguments to the Pig-script. The first args option specifies the location of the Pig script in Amazon S3. In the second args option, the -d provides a way to pass values into the script (INPUT) and where to store results (OUTPUT). Within the Pig script these parameters are available as ${variable}. So, in this example Pig replaces ${INPUT} and ${OUTPUT} with the values passed in. These variables are substituted as a preprocessing step so can occur anywhere in a Pig script. The return is similar to the following:

Created jobflow JobFlowID

You might need to add Pig as a new step in an existing job flow. Adding steps can help you test and develop Pig scripts. For example, if the script fails, you can add a new step to the job flow without having to wait for a new job flow to start.

To add a Pig script to an existing job flow in batch mode, specify the location in Amazon S3 of a Pig script and associate it with an existing job flow.

$ ./elastic-mapreduce --jobflow JobFlowID \
--pig-script \
--args s3://myawsbucket/myquery.q \
--args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output

Call User Defined Functions from Pig

Pig provides the ability to call user defined functions (UDFs) from within Pig scripts. You can do this to implement custom processing to use in your Pig scripts. The languages currently supported are Java, Python/Jython, and Javascript. (Though Javascript support is still experimental.

The following sections describe how to register your functions with Pig so you can call them either from the Pig shell or from within Pig scripts. For more information about using UDFs with Pig, go to http://pig.apache.org/docs/r0.9.2/udf.html.

Call JAR files from Pig

You can use custom JAR files with Pig using the REGISTER command in your Pig script. The JAR file is local or a remote file system such as Amazon S3. When the Pig script runs, Amazon EMR downloads the JAR file automatically to the master node and then uploads the JAR file to the Hadoop distributed cache. In this way, the JAR file is automatically used as necessary by all instances in the cluster.

To use JAR files with Pig

  1. Upload your custom JAR file into Amazon S3.

  2. Use the REGISTER command in your Pig script to specify the bucket on Amazon S3 of the custom JAR file.

    REGISTER s3://myawsbucket/path/to/my/uploaded.jar;

Call Python/Jython Scripts from Pig

You can register Python scripts with Pig and then call functions in those scripts from the Pig shell or in a Pig script. You do this by specifying the location of the script with the register keyword.

Because Pig in written in Java, it uses the Jython script engine to parse Python scripts. For more information about Jython, go to http://www.jython.org/.

To call a Python/Jython script from Pig

  1. Write a Python script and upload the script to a location in Amazon S3. This should be a bucket owned by the same account that creates the Pig job flow, or that has permissons set so the account that created the job flow can access it. In this example, the script is uploaded to s3://myawsbucket/pig/python.

  2. Start a pig job flow. If you'll be accessing Pig from the Grunt shell, run an interactive job flow. If you're running Pig commands from a script, start a scripted Pig job flow. In this example, we'll start an interactive job flow. For more information about how to create a Pig job flow, see Creating a Job Flow Using Pig.

  3. Because we've launched an interactive job flow, we'll now SSH into the master node where we can run the Grunt shell. For more information about how to SSH into the master node, see SSH into the Master Node.

  4. Run the Grunt shell for Pig by typing pig at the command line.

    pig
    				
  5. Register your Python script with Pig using the register keyword at the Grunt command prompt, as shown in the following, where you would specify the location of your script in Amazon S3.

    grunt> register 's3://myawsbucket/pig/python/myscript.py' using jython as myfunctions;
    				
  6. You can now call functions in your script from within Pig by referencing them using myfunctions.

Additional Pig Functions

The Amazon EMR development team has created additional Pig functions that simplify string manipulation and make it easier to format date-time information. These are available at http://aws.amazon.com/code/2730.

How to Configure the Pig Installation

For information about how to configure Pig on Amazon EMR as well as the versions of Pig that you can run on Amazon EMR and their patches, see Pig Configuration.