Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Appendix: Compare Job Flow Types

This section provides a comparison of the job flow types supported in Amazon Elastic MapReduce (Amazon EMR). In most cases, you can use any job flow type to process large amounts of data with Amazon EMR. Choosing the method that is right for you depends on the structure of your data, your current knowledge of a scripting or programming language, and how much effort you want to expend writing MapReduce code. The following table provides a high level comparison of the features and abilities of each job flow type. After the table are brief descriptions of each job flow type.

Cascading Custom JAR Hadoop streaming Hive Pig
Versions supported on Amazon Elastic MapReduce (Amazon EMR)
n/a n/a

AMI 1.0: Hadoop 0.18, 0.20

AMI 2.0: Hadoop 0.20.205

AMI 1.0: Hive 0.4, 0.5, 0.7,0.7.1

AMI 2.0: Hive 0.7.1

AMI 1.0: Pig 0.3, 0.6

AMI 2.0: Pig 0.9.1

Supported programming language
Java JavaAny scripting language such as Python or Ruby SQL-like Pig-Latin
Supports schemas and types
no no no yes (explicit) yes (implicit)
Can share schema and/or metastore
nono no yes no
Supports partitions
no no no yes no
Can be configured as a server to support dynamic connections
no no no Option (Thrift) no
User Defined Functions (UDF) that allow configuration of parameters
yes yes yes yes (Java) yes (Java)
Custom Serializer/ Deserializer option to supply custom serialization mechanism for all parts of Hadoop
no no no yes yes
Allows direct read/write access to HDFS
no no no yes (implicit) yes (explicit)
Built-in ability to manage data using JOIN, ORDER, and SORT
yes no yes yes yes
Provides a shell interface
no no no yes yes
Supports streaming of data
yes yes yes yes yes
Amazon EMR console supports the job flow type
yes (Custom JAR) yes yes yes (Batch) yes (Batch)
AWS provides JDBC drivers
no no no yes no
Accessible with 3rd party interface
yes no no yes no

Cascading

Cascading is a Java library that simplifies using the Hadoop MapReduce API. The API is based on pipes and filters, providing features like splitting and joining data streams.

Custom JAR

The Custom JAR job flow type supports MapReduce programs written in Java. You can leverage your existing knowledge of Java using this method. While you have the most flexibility of any job flow type in designing your job flow, you must know Java and the MapReduce API. Custom JAR is a low level interface. You are responsible for converting your problem definition into specific Map and Reduce tasks and then implementing those tasks in your JAR.

Hadoop Streaming

Hadoop streaming is the built-in utility provided with Hadoop. Streaming supports any scripting language, such as Python or Ruby. It is easy to read and debug, numerous libraries and data are available, and it is fast and simple. You can script your data analysis process and avoid writing code by using the existing libraries. Streaming is a low level interface. You are responsible for converting your problem definition into specific Map and Reduce tasks and then implementing those tasks via scripts.

Hive

Hive is an open-source project that uses a SQL-like language. If you are familiar with SQL, the transition to using Hive is fairly easy. Hive allows customizations using Java JARs.

The following information resources are available for Hive:

Pig

Pig is an open-source project that uses a proprietary language called PigLatin. If you have existing scripts written in Pig-Latin, you can use them on Amazon EMR with little or no modification.

The following information resources are available for Pig and Pig Latin: