| Did this page help you? Yes No Tell us about it... |
This section provides a comparison of the job flow types supported in Amazon Elastic MapReduce (Amazon EMR). In most cases, you can use any job flow type to process large amounts of data with Amazon EMR. Choosing the method that is right for you depends on the structure of your data, your current knowledge of a scripting or programming language, and how much effort you want to expend writing MapReduce code. The following table provides a high level comparison of the features and abilities of each job flow type. After the table are brief descriptions of each job flow type.
| Cascading | Custom JAR | Hadoop streaming | Hive | Pig |
|---|---|---|---|---|
| Versions supported on Amazon Elastic MapReduce (Amazon EMR) | ||||
| n/a | n/a |
AMI 1.0: Hadoop 0.18, 0.20 AMI 2.0: Hadoop 0.20.205 |
AMI 1.0: Hive 0.4, 0.5, 0.7,0.7.1 AMI 2.0: Hive 0.7.1 |
AMI 1.0: Pig 0.3, 0.6 AMI 2.0: Pig 0.9.1 |
| Supported programming language | ||||
| Java | Java | Any scripting language such as Python or Ruby | SQL-like | Pig-Latin |
| Supports schemas and types | ||||
| no | no | no | yes (explicit) | yes (implicit) |
| Can share schema and/or metastore | ||||
| no | no | no | yes | no |
| Supports partitions | ||||
| no | no | no | yes | no |
| Can be configured as a server to support dynamic connections | ||||
| no | no | no | Option (Thrift) | no |
| User Defined Functions (UDF) that allow configuration of parameters | ||||
| yes | yes | yes | yes (Java) | yes (Java) |
| Custom Serializer/ Deserializer option to supply custom serialization mechanism for all parts of Hadoop | ||||
| no | no | no | yes | yes |
| Allows direct read/write access to HDFS | ||||
| no | no | no | yes (implicit) | yes (explicit) |
| Built-in ability to manage data using JOIN, ORDER, and SORT | ||||
| yes | no | yes | yes | yes |
| Provides a shell interface | ||||
| no | no | no | yes | yes |
| Supports streaming of data | ||||
| yes | yes | yes | yes | yes |
| Amazon EMR console supports the job flow type | ||||
| yes (Custom JAR) | yes | yes | yes (Batch) | yes (Batch) |
| AWS provides JDBC drivers | ||||
| no | no | no | yes | no |
| Accessible with 3rd party interface | ||||
| yes | no | no | yes | no |
Cascading is a Java library that simplifies using the Hadoop MapReduce API. The API is based on pipes and filters, providing features like splitting and joining data streams.
The Custom JAR job flow type supports MapReduce programs written in Java. You can leverage your existing knowledge of Java using this method. While you have the most flexibility of any job flow type in designing your job flow, you must know Java and the MapReduce API. Custom JAR is a low level interface. You are responsible for converting your problem definition into specific Map and Reduce tasks and then implementing those tasks in your JAR.
Hadoop streaming is the built-in utility provided with Hadoop. Streaming supports any scripting language, such as Python or Ruby. It is easy to read and debug, numerous libraries and data are available, and it is fast and simple. You can script your data analysis process and avoid writing code by using the existing libraries. Streaming is a low level interface. You are responsible for converting your problem definition into specific Map and Reduce tasks and then implementing those tasks via scripts.
Hive is an open-source project that uses a SQL-like language. If you are familiar with SQL, the transition to using Hive is fairly easy. Hive allows customizations using Java JARs.
The following information resources are available for Hive:
Hive overview— http://wiki.apache.org/hadoop/Hive
Hive video tutorial— http://aws.amazon.com/articles/2862
Running Hive on Amazon Elastic MapReduce (Amazon EMR)— http://aws.amazon.com/articles/2857
Additional features of Hive in Amazon Elastic MapReduce— http://aws.amazon.com/articles/2856
Operating a data warehouse with Hive, Amazon EMR and Amazon SimpleDB— http://aws.amazon.com/articles/2854
Contextual Advertising using Apache Hive and Amazon Elastic MapReduce (Amazon EMR) with High Performance Computing instances— http://aws.amazon.com/articles/2855
QL-Hive Language Manual— http://wiki.apache.org/hadoop/Hive/LanguageManual
This document explains the SQL-like language called Hive QL. Hive converts QL into MapReduce algorithms for job flows that you can then run using Amazon Elastic MapReduce.
Pig is an open-source project that uses a proprietary language called PigLatin. If you have existing scripts written in Pig-Latin, you can use them on Amazon EMR with little or no modification.
The following information resources are available for Pig and Pig Latin:
Pig tutorial— Apache Log Analysis using Pig
This tutorial shows you how to analyze Apache logs using Pig and Elastic MapReduce.
Pig video tutorial— Video that shows how to use a Pig script with the Amazon EMR console and SSH
This video tutorial shows you how to use Pig in batch and interactive modes with Elastic MapReduce.
Sample Pig script— Parsing Logs with Apache Pig and Elastic MapReduce
This document shows a sample Pig script.
PiggyBank functions— String Manipulation and DateTime Functions For Pig
This is a list of five functions that AWS added to the Pig library.
Pig Latin— http://pig.apache.org/docs/r0.7.0/piglatin_ref1.html
This document explains the SQL-like language called Pig Latin. Pig converts Pig Latin into MapReduce job flows that you can then run using Elastic MapReduce.
Pig video tutorial— Using a Pig Script with the Console Video Tutorial