Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Glossary

Amazon Machine Image

An Amazon Machine Image (AMI) is similar to the root drive of your computer. It contains the operating system and can also include software and layers of your application such as database servers, middleware, web servers, etc. AMIs are encrypted machine images stored in Amazon Elastic Block Store or Amazon Simple Storage Service.

authentication

The process of proving your identity to the system.

Access Key ID

A string that AWS distributes to uniquely identify each AWS user; it is alphanumeric token associated with your Secret Access Key.

block

A data set.

A data set. Amazon Elastic MapReduce (Amazon EMR) breaks large amounts of data into subsets. Each subset is called a data block. Amazon Elastic MapReduce (Amazon EMR) assigns an ID to each block and uses a hash table to keep track of block processing.

bootstrap action

Default or custom actions that you specify to run a script or an application on all nodes of a job flow before Hadoop starts.

bucket

A container for objects stored in Amazon S3. Every object is contained in a bucket. For example, if the object named photos/puppy.jpg is stored in the johnsmith bucket, then authorized users can access the object with the URL http://johnsmith/S3.amazonaws.com/photos/puppy.jpg.

Cascading

Cascading is an open-source Java library that provides a query API, a query planner, and a job scheduler for creating and running Hadoop MapReduce applications. Applications developed with Cascading are compiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoop applications.

core instance group

An instance group managing core nodes. Core instance groups must always contain at least one core node.

core node

A core node is an Amazon EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). It is managed by the master node, which schedules the Hadoop tasks that run on core and task nodes and monitors their status. While a job flow is running you can increase, but not decrease, the number of core nodes. Because core nodes store data and cannot be removed from a job flow, Amazon EC2 instances assigned as core nodes are capacity that you need to allot for the entire job flow. Core nodes run both the DataNodes and TaskTracker Hadoop daemons.

endpoint

A URL that identifies a host and port as the entry point for a web service. Every web service request contains an endpoint. Most AWS products provide regional endpoints to enable faster connectivity.

HMAC

HMAC (Hash-based Message Authentication Code) is a specific construction for calculating a message authentication code (MAC) involving a cryptographic hash function in combination with a secret key. You can use it to verify both the data integrity and the authenticity of a message at the same time. AWS calculates the HMAC using a standard, cryptographic hash algorithm, such as SHA-256.

intermediate results

Processing output created by the map step in the MapReduce process.

job flow

A job flow specifies the complete processing of the data. It's comprised of one or more steps, which specify all of the functions to be performed on the data.

key

The unique identifier for an object in a bucket. Every object in a bucket has exactly one key. Because a bucket and key together uniquely identify each object, you can think of Amazon S3 as a basic data map between the bucket + key, and the object itself. You can uniquely address every object in Amazon S3 through the combination of the web service endpoint, bucket name, and key, for example: http:// doc.s3.amazonaws.com/2006-03-01/AmazonS3.wsdl, where doc is the name of the bucket, and 2006-03-01/AmazonS3.wsdl is the key.

mapper

An executable that splits the raw data into key/value pairs. The reducer uses the output of the mapper, called the intermediate results, as its input.

master instance group

The instance group managing the master node. There can be only one master instance group per job flow.

master node

A process running on an Amazon Machine Image that keeps track of the work its core and task nodes complete.

metadata

The metadata is a set of name-value pairs that describe the object. These include default metadata such as the date last modified and standard HTTP metadata such as Content-Type. The developer can also specify custom metadata at the time the Object is stored.

node

After an Amazon Machine Image (AMI) is launched, the resulting running system is referred to as a node. All instances based on the same AMI start out identical and any information on them is lost when the node terminates or fails.

object

The fundamental entity stored in Amazon S3. Objects consist of object data and metadata. The data portion is opaque to Amazon S3.

reducer

An executable that uses the intermediate results from the mapper and processes them into the final output.

Secret Access Key

A key that Amazon Web Services assigns to you when you sign up for an AWS Account. In request authentication, it is the private key in a public/private key pair. (Sometimes called simply a "secret key.")

service endpoint

See endpoint.

shutdown actions

A predefined bootstrap action that launches a script that executes a series of commands in parallel before terminating the job flow.

signature

Refers to a digital signature, which is a mathematical way to confirm the authenticity of a digital message. AWS uses signatures to authenticate the requests you send to our web services.

slave node

Represents any nonmaster node in a Hadoop cluster.

See core node and task node.

step

A single function applied to the data in a job flow. The sum of all steps comprises a job flow.

step type

The type of work done in a step. There are a limited number of step types, such as moving data from Amazon S3 to Amazon EC2 or moving data from Amazon EC2 to Amazon S3.

streaming

A utility that comes with Hadoop that enables you to develop MapReduce executables in languages other than Java

task instance group

An instance group managing tasks nodes.

task node

A task node is an Amazon EC2 instance that runs Hadoop map and reduce tasks and does not store data. It is managed by the master node, which schedules the Hadoop tasks that run on core nodes and task nodes and monitors their status. While a job flow is running you can increase and decrease the number of task nodes. Because task nodes do not store data and can be added and removed from a job flow, you can use them to manage the amount of Amazon EC2 instance capacity your job flow uses, increasing it to handle peak loads, and decreasing it later. Task nodes run only a TaskTracker Hadoop daemon.

tuning

Selecting the number and type of Amazon Machine Images to run a Hadoop job flow most efficiently.