Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Monitoring Job Flow Metrics

When you’re running a job flow, you often want to track its progress and health. Amazon Elastic MapReduce (Amazon EMR) records metrics that can help you monitor your job flow. It makes these metrics available in the Amazon EMR console and in the Amazon CloudWatch console, where you can track them with your other AWS metrics. In Amazon CloudWatch you can set alarms to warn you if a metric goes outside of parameters you specify.

Metrics are updated every five minutes. This interval is not configurable. Metrics are archived for two weeks; after that period, the data is discarded.

These metrics are automatically collected and pushed to Amazon CloudWatch for every Amazon EMR job flow. There is no charge for the Amazon EMR metrics reported in Amazon CloudWatch; they are provided as part of the Amazon EMR service.

[Note]Note

Viewing Amazon EMR metrics in Amazon CloudWatch is supported only for job flows launched with AMI 2.0.3 or later and running Hadoop 0.20.205 or later. For more information about selecting the AMI version for your job flow, see Specify the Amazon EMR AMI Version.

Video Tour of Amazon EMR Metrics

The following video walks you through the metrics that Amazon EMR provides in the Amazon EMR console.

How Do I Use Amazon EMR Metrics?

The metrics reported by Amazon EMR provide information that you can analyze in different ways. The table below shows some common uses for the metrics. These are suggestions to get you started, not a comprehensive list. For the complete list of metrics reported by Amazon EMR go to Metrics Reported by Amazon EMR in Amazon CloudWatch.

How do I?Revelant Metrics
Track the progress of my job flow Look at the RunningMapTasks, RemainingMapTasks, RunningReduceTasks, and RemainingReduceTasks metrics.
Detect job flows that are idle The IsIdle metric tracks whether a job flow is live, but not currently running tasks. You can set an alarm to fire when the job flow has been idle for a given period of time, such as thirty minutes.
Detect when a node runs out of storage

The HDFSUtilization metric is the percentage of disk space currently used. If this rises above an acceptable level for your application, such as 80% of capacity used, you may need to resize your job flow and add more core nodes.

Accessing Metrics

There are many ways to access the metrics that Amazon EMR pushes to Amazon CloudWatch. You can view them through either the Amazon EMR console or Amazon CloudWatch console, or you can retrieve them using the Amazon CloudWatch CLI or the Amazon CloudWatch API. The following procedures show you how to access the metrics using these various tools.

To view metrics in the Amazon EMR console

  1. Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/.

  2. To view metrics for a job flow, click on it to display the Job Flow Details pane.

    Metrics Alarm Video Tutorial

  3. Select the Monitoring tab to view information about that job flow. This loads the pane with reports about the progress and health of the job flow.

    Monitoring Tab

To view metrics in the Amazon CloudWatch console

  1. Sign in to the AWS Management Console and open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. Click the All Metrics link in the Navigation pane.

  3. Scroll down to the metric that you want to graph. An easy way to find the Amazon EMR metrics you want is to search on the job flow identifier of the job flow to monitor.

    Metrics Alarm Video Tutorial

  4. Click a metric to display the graph.

    Metrics Alarm Video Tutorial

To access metrics from the Amazon CloudWatch CLI

To access metrics from the Amazon CloudWatch API

Setting Alarms on Metrics

Amazon EMR pushes metrics to Amazon CloudWatch, which means you can use Amazon CloudWatch to set alarms on your Amazon EMR metrics. You can, for example, configure an alarm in Amazon CloudWatch to send you an email any time the HDFS utilization rises above 80%.

The following topics give you a high-level overview of how to set alarms using Amazon CloudWatch. For detailed instructions, go to Using Amazon CloudWatch in the Amazon CloudWatch Developer Guide.

View a Video Tutorial on Setting Alarms

The following video walks you through the process of setting an alarm on an Amazon EMR metric using the Amazon CloudWatch console.

Set alarms using the Amazon CloudWatch console

  1. Sign in to the AWS Management Console and open the Amazon CloudWatch console at https://console.aws.amazon.com/cloudwatch/.

  2. Click the Create Alarm button. This launches the Create Alarm Wizard.

    Create Alarm Wizard

  3. Scroll through the Amazon EMR metrics to locate the metric you want to place an alarm on. An easy way to display just the Amazon EMR metrics in this dialog box is to search on the job flow identifier of your job flow. Select the metric to create an alarm on and click Continue.

    Create Alarm Wizard

  4. Fill in the Name, Description, Threshold, and Time values for the metric, and click Continue.

    Create Alarm Wizard

  5. Choose Alarm as the alarm state. If you want Amazon CloudWatch to send you an email when the alarm state is reached, choose either a pre-existing Amazon SNS email subscription list or Create New Email Topic. If you select Create New Email Topic, you can set the name and email addresses for a new email subscription list. This list will be saved and appear in the drop-down box for future alarms. Click Continue.

    [Note]Note

    If you use Create New Email Topic to create a new Amazon SNS topic, the email addresses must be verified before they will receive notifications. Emails are only sent when the alarm enters an alarm state. If this alarm state change happens before the email addresses are verified, they will not receive a notification.

    Create Alarm Wizard

  6. At this point the Create Alarm Wizard gives you a chance to review the alarm you’re about to create. If you need to make any changes, you can use the Edit links on the right. Click Create Alarm.

    Create Alarm Wizard

[Note]Note

For more information about how to set alarms using the Amazon CloudWatch console, go to Create an Alarm that Sends Email in the in the Amazon CloudWatch Developer Guide.

To set an alarm using the Amazon CloudWatch

To set an alarm using the Amazon CloudWatch API

Metrics Reported by Amazon EMR in Amazon CloudWatch

The following table lists all of the metrics that Amazon EMR reports in the Amazon EMR console and pushes to Amazon CloudWatch.

Amazon EMR Metrics

Amazon EMR sends data for several metrics to Amazon CloudWatch. All Amazon EMR job flows automatically send metrics in five-minute intervals. Metrics are archived for two weeks; after that period, the data is discarded.

[Note]Note

Amazon EMR pulls metrics from a job flow. If a job flow becomes unreachable, no metrics will be reported until the job flow becomes available again.

MetricDescription

CoreNodesPending

The number of core nodes waiting to be assigned. All of the core nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor job flow health

Units: Count

CoreNodesRunning

The number of core nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor job flow health

Units: Count

HDFSBytesRead

The number of bytes read from HDFS.

Use Case: Analyze job flow performance, Monitor job flow progress

Units: Count

HDFSBytesWritten

The number of bytes written to HDFS.

Use Case: Analyze job flow performance, Monitor job flow progress

Units: Count

HDFSUtilization

The percentage of HDFS storage currently used.

Use Case: Analyze job flow performance

Units: Percent

IsIdle

Indicates that a job flow is no longer performing work, but is still alive and accruing charges. It is set to 1 if no tasks are running and no jobs are running, and set to 0 otherwise. This value is checked at five-minute intervals and a value of 1 indicates only that the job flow was idle when checked, not that it was idle for the entire five minutes. To avoid false positives, you should alarm when this value has been 1 for more than one consecutive 5-minute check. For example, you might raise an alarm on this value if it has been 1 for thirty minutes or longer.

Use Case: Monitor job flow performance

Units: Count

JobsFailed

The number of jobs in the job flow that have failed.

Use Case: Monitor job flow health

Units: Count

JobsRunning

The number of jobs in the job flow that are currently running.

Use Case: Monitor job flow health

Units: Count

LiveDataNodes

The percentage of data nodes that are receiving work from Hadoop.

Use Case: Monitor job flow health

Units: Percent

LiveTaskTrackers

The percentage of task trackers that are functional.

Use Case: Monitor job flow health

Units: Percent

MapSlotsOpen

The unused map task capacity. This is calculated as the maximum number of map tasks for a given job flow, less the total number of map tasks currently running in that job flow.

Use Case: Analyze job flow performance

Units: Count

MissingBlocks

The number of blocks in which HDFS has no replicas. These might be corrupt blocks.

Use Case: Monitor job flow health

Units: Count

ReduceSlotsOpen

Unused reduce task capacity. This is calculated as the maximum reduce task capacity for a given job flow, less the number of reduce tasks currently running in that job flow.

Use Case: Analyze job flow performance

Units: Count

RemainingMapTasks

The number of remaining map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs are generated.

Use Case: Monitor job flow progress

Units: Count

RemainingMapTasksPerSlot

The ratio of the total map tasks remaining to the total map slots available in the cluster.

Use Case: Analyze job flow performance

Units: Ratio

RemainingReduceTasks

The number of remaining reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs will be generated.

Use Case: Monitor job flow progress

Units: Count

RunningMapTasks

The number of running map tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs will be generated.

Use Case: Monitor job flow progress

Units: Count

RunningReduceTasks

The number of running reduce tasks for each job. If you have a scheduler installed and multiple jobs running, multiple graphs will be generated.

Use Case: Monitor job flow progress

Units: Count

S3BytesRead

The number of bytes read from Amazon S3.

Use Case: Analyze job flow performance, Monitor job flow progress

Units: Count

S3BytesWritten

The number of bytes written to Amazon S3.

Use Case: Analyze job flow performance, Monitor job flow progress

Units: Count

TaskNodesPending

The number of core nodes waiting to be assigned. All of the task nodes requested may not be immediately available; this metric reports the pending requests. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor job flow health

Units: Count

TaskNodesRunning

The number of task nodes working. Data points for this metric are reported only when a corresponding instance group exists.

Use Case: Monitor job flow health

Units: Count

TotalLoad

The total number of concurrent data transfers.

Use Case: Monitor job flow health

Units: Count

Dimensions for Amazon EMR Metrics

Amazon EMR data can be filtered using any of the dimensions in the following table.

Dimension Description
JobFlowId The identifier for a job flow. You can find this value by clicking on the job flow in the Amazon EMR console. It takes the form j-XXXXXXXXXXXXX.
JobId The identifier of a job within a job flow. You can use this to filter the metrics returned from a job flow down to those that apply to a single job within the job flow. JobId takes the form job_XXXXXXXXXXXX_XXXX.