| Did this page help you? Yes No Tell us about it... |
A job flow is the series of instructions Amazon Elastic MapReduce (Amazon EMR) uses to process data. A job flow contains any number of user-defined steps. A step is any instruction that manipulates the data. Steps are executed in the order in which they are defined in the job flow.
You can track the progress of a job flow by checking its state. The following diagram shows the life cycle of a job flow and how each part of the job flow process maps to a particular job flow state.

A successful Amazon Elastic MapReduce (Amazon EMR) job flow follows this process: Amazon EMR first provisions a
Hadoop cluster. During this phase, the job flow state is STARTING. Next, any user-defined bootstrap actions are run.
During this phase, the job flow state is BOOTSTRAPPING.
After all bootstrap actions are completed, the job flow state is RUNNING. The job
flow sequentially runs all job flow steps during this phase. After all steps run, the job flow
state transitions to SHUTTING_DOWN and the job flow shuts down the cluster. All data stored on a cluster node is deleted. Information stored
elsewhere, such as in your Amazon S3 bucket, persists. Finally,
when all job flow activity is complete, the job flow state is marked as COMPLETED.
You can configure a job flow to go into a WAITING state once it completes
processing of all steps. A job flow in the WAITING state continues running,
waiting for you to add steps or manually terminate it. When you manually terminate a job flow, the Hadoop cluster shuts down
and job flow state is SHUTTING_DOWN. When the job flow activity is complete, the
final job flow state is TERMINATED. Creating a WAITING job
flow is useful when troubleshooting. For more information on troubleshooting, see How to Debug Job Flows with
Steps.
Any failure during the job flow process terminates the job flow and shuts down all cluster nodes.
Any data stored on a cluster node is deleted. The job flow state is marked as
FAILED.
For a complete list of job flow states, see the JobFlowExecutionStatusDetail data type in the Amazon Elastic MapReduce (Amazon EMR) API Reference.
You can also track the progress of job flow steps by checking their state. The following diagram shows the processing of job flow steps and how each step maps to a particular state.

A job flow contains one or more steps. Steps are processed in the order in
which they are listed in the job flow. Step are run following this sequence: all steps have their
state set to PENDING. The first step is run and the step's state is set to
RUNNING. When the step is completed, the step's state changes to
COMPLETED. The next step in the queue is run, and the step's state is set to
RUNNING. After each step completes, the step's state is set to
COMPLETED and the next step in the queue is run. Step
are run until there are no more steps. Processing flow returns to the job flow.
If a step fails, the step state is FAILED and all
remaining steps with a PENDING state are marked as CANCELLED. No
further steps are run. and processing returns to the job flow.
Data is normally communicated from one step to the next using files stored on the cluster's Hadoop Distributed File System (HDFS). Data stored on HDFS exists only as long as the cluster is running. When the cluster is shut down, all data is deleted. The final step in a job flow typically stores the processing results in an Amazon S3 bucket.
For a complete list of step states, see the StepExecutionStatusDetail data type in the Amazon Elastic MapReduce (Amazon EMR) API Reference.