| Did this page help you? Yes No Tell us about it... |
Cascading is an open-source Java library that provides a query API, a query planner, and a job scheduler for creating and running Hadoop MapReduce applications. Applications developed with Cascading are compiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoop applications.
Multitool is a Cascading application that provides a simple command line interface for managing large datasets. For example, you can filter records matching a Java regular expression from data stored in Amazon S3 and copy the results to the Hadoop file system.
You can run the Cascading Multitool application on Amazon Elastic MapReduce (Amazon EMR) using either the Amazon EMR command line interface or the Amazon EMR console. Amazon EMR supports all Multitool arguments.
The Multitool JAR file is at
s3n://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar.
The Multitool source code, along with a number of other tools, is available for download from the project website at http://www.cascading.org/modules.html.
For additional samples and tips for using Multitool, go to Cascading.Multitool - Tips on using the Multitool and Generate usage reports . For more information about Cascading, go to http://www.cascading.org.
To create a Cascading job flow using the CLI
Create a job flow referencing the Cascading Multitool JAR file and supply the appropriate Multitool arguments as follows:
$ ./elastic-mapreduce --create --jar s3n://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar --args [args]
To create a Cascading job flow using the Amazon EMR console
Start a new job flow:
Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.
Select a Region.

Click Create a New Job Flow.
The Create a New Job Flow page appears.
In the DEFINE JOB FLOW page, enter the following information:
Enter a name in the Job Flow Name field.
We recommend that you use a descriptive name. It does not need to be unique.
Select Run your own application.
Select Custom JAR from the menu and click Continue.

In the SPECIFY PARAMETERS page, specify the job flow parameters:
Specify the Jar Location for the Multitool JAR file in the Specify Parameters dialog box, for example:
s3n://elasticmapreduce/samples/multitool/multitool-aws-03-31-09.jar
Specify any arguments for the job flow.
All Multitool arguments are supported, including those listed in the following table.
| Parameter | Description |
|---|---|
-input | Location of input file. |
-output | Location of output files. |
-start | Use data created after the start time. |
-end | Use data created by the end time. |
Click Continue.

On the CONFIGURE EC2 INSTANCES page, accept the default parameters and click Continue.
On the ADVANCED OPTIONS page, accept the default parameters and click Continue.
In the BOOTSTRAP ACTIONS page, select Proceed with no Bootstrap Actions, and then click Continue.
For more information about bootstrap actions, see Bootstrap Actions.

Click Continue to accept the defaults for the remaining wizard steps, and then click Create Job Flow at the end to launch the Cascading job flow.