| Did this page help you? Yes No Tell us about it... |
Topics
Amazon Elastic MapReduce (Amazon EMR) enables you to run Pig scripts in two modes:
Interactive
Batch
Typically, you use interactive mode to troubleshoot your job flow and batch mode in production. After you revise the Pig Latin script using the interactive mode, you should upload it to Amazon S3 and use the batch mode to run job flows. For more information about using SSH, go to How to View Logs Using SSH.
In interactive mode, you SSH as the Hadoop user into the master node in the Hadoop cluster and run the Pig Latin script on it so that you can debug it. The interactivity of this mode enables you to revise the Pig Latin script quicker than you could in batch mode.
In batch mode, you create a Pig script using Pig Latin and then load that script into Amazon S3. When you run a job flow using Pig, the first step is to download that script from Amazon S3 so that it is used in the MapReduce job flow. When you use the Amazon EMR console this download is done automatically for you. You use batch mode to run job flows in production.
To run Pig in interactive mode use the alive option with the
create command so that the job flow remains active until you
terminate it.
$ ./elastic-mapreduce --create --alive --name "Testing Pig -- $USER" \
--num-instances 5 --instance-type instanceType \
--pig-interactiveThe return is similar to the following:
Created jobflow JobFlowIDYou are now running Pig in interactive mode and can execute Pig queries.
The following process shows how to run Pig in batch mode and assumes that you stored the Pig script in a bucket on Amazon S3. For more information about uploading files into Amazon S3, see the Amazon S3 Getting Started Guide.
To run Pig in batch mode, create a job flow with a step that executes a Pig script stored on Amazon S3.
$ ./elastic-mapreduce --create \ --name "$USER's Pig JobFlow" \ --pig-script \ --argss3://myawsbucket/myquery.q\ --args -p,INPUT=s3://myawsbucket/input,-p,OUTPUT=s3://myawsbucket/output
The args option provides arguments to the Pig-script. The
first args option specifies the location of the Pig script in Amazon
S3. In the second args option, the -d
provides a way to pass values into the script (INPUT) and where to
store results (OUTPUT). Within the Pig script these parameters are
available as ${variable}. So, in this example Pig replaces
${INPUT} and ${OUTPUT} with the values
passed in. These variables are substituted as a preprocessing step so can occur anywhere in
a Pig script. The return is similar to the following:
Created jobflow JobFlowIDYou might need to add Pig as a new step in an existing job flow. Adding steps can help you test and develop Pig scripts. For example, if the script fails, you can add a new step to the job flow without having to wait for a new job flow to start.
To add a Pig script to an existing job flow in batch mode, specify the location in Amazon S3 of a Pig script and associate it with an existing job flow.
$ ./elastic-mapreduce --jobflowJobFlowID\ --pig-script \ --args s3://myawsbucket/myquery.q \ --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output
Pig provides the ability to call user defined functions (UDFs) from within Pig scripts. You can do this to implement custom processing to use in your Pig scripts. The languages currently supported are Java, Python/Jython, and Javascript. (Though Javascript support is still experimental.
The following sections describe how to register your functions with Pig so you can call them either from the Pig shell or from within Pig scripts. For more information about using UDFs with Pig, go to http://pig.apache.org/docs/r0.9.2/udf.html.
You can use custom JAR files with Pig using the REGISTER command
in your Pig script. The JAR file is local or a remote file system such as Amazon S3. When
the Pig script runs, Amazon EMR downloads the JAR file automatically to the master
node and then uploads the JAR file to the Hadoop distributed cache. In this way, the JAR
file is automatically used as necessary by all instances in the cluster.
You can register Python scripts with Pig and then call functions in those scripts from the Pig shell or in a Pig script. You
do this by specifying the location of the script with the register keyword.
Because Pig in written in Java, it uses the Jython script engine to parse Python scripts. For more information about Jython, go to http://www.jython.org/.
To call a Python/Jython script from Pig
Write a Python script and upload the script to a location in Amazon S3. This should be a bucket owned by the same account that creates the Pig job flow, or that has permissons set so the account that created the job flow can access it. In this example, the script is uploaded to s3://myawsbucket/pig/python.
Start a pig job flow. If you'll be accessing Pig from the Grunt shell, run an interactive job flow. If you're running Pig commands from a script, start a scripted Pig job flow. In this example, we'll start an interactive job flow. For more information about how to create a Pig job flow, see Creating a Job Flow Using Pig.
Because we've launched an interactive job flow, we'll now SSH into the master node where we can run the Grunt shell. For more information about how to SSH into the master node, see SSH into the Master Node.
Run the Grunt shell for Pig by typing pig at the command line.
pig
Register your Python script with Pig using the register keyword at the Grunt command prompt, as shown in the following, where you
would specify the location of your script in Amazon S3.
grunt> register 's3://myawsbucket/pig/python/myscript.py' using jython as myfunctions;
You can now call functions in your script from within Pig by referencing them using myfunctions.
The Amazon EMR development team has created additional Pig functions that simplify string manipulation and make it easier to format date-time information. These are available at http://aws.amazon.com/code/2730.
For information about how to configure Pig on Amazon EMR as well as the versions of Pig that you can run on Amazon EMR and their patches, see Pig Configuration.