Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Troubleshooting Tips

Use the Amazon EMR console to access the log files at the different step and Hadoop job execution levels. You can use these logs to debug your applications.

Before you can use the debugging functionality in the console, you must enable debugging when you create a job flow. For more information, see How to Enable Logging and Debugging.

The following procedure shows you how to debug a job flow using the Amazon EMR console.

To debug a failed job flow

  1. In the Amazon EMR console, click the check box next to the failed job flow you want to debug and click Debug.

    [Note]Note

    By default, the list is sorted alphabetically by the Name column. To sort the results based on another column, click the column title once (for ascending) or twice (for descending order).

    The Steps pane displays the steps in the selected job flow.

    Each row provides links pointing to Hadoop logs generated as part of each step. If the links are labeled (log not uploaded yet), click Refresh List.

  2. Click one of the following links in the Log Files column in the row marked FAILED:

    • controller—Contains files generated by Amazon Elastic MapReduce (Amazon EMR) that arise from errors encountered while trying to run your step

      If your step fails while loading, you can find the stack trace in this log. Errors loading or accessing your application are often described here. Missing mapper file errors are often described here.

    • stderr—Contains your step's standard error messages

      Application loading errors are often described here. Sometimes contains stack trace.

    • stdout—Contains status generated by your mapper and reducer executables

      Application loading errors are often described here. Sometimes contains application error messages.

    • syslog—Contains logs from non-Amazon software, such as Apache and Hadoop

      Streaming errors are often described here.

  3. If you can't resolve the problem by looking at the these log files, click View All Tasks for All Jobs.

    This action skips over the Jobs pane, which does not associate links to log files.

    The Tasks pane displays the Hadoop tasks in the jobs.

    Time elapsed during a task is a good indication of trouble; the longer the elapsed time, the more likelihood of trouble.

    • To easily see the time elapsed in a task, click the Elapsed Time column title to sort the results by elapsed time.

  4. On the Tasks pane, click View Attempts for the task that failed.

    The Task Attempts pane displays the task attempts in the selected task.

  5. On the Task Attempts pane, click one or more of the links in the Log Files column for the task attempt that failed:

    • stderr—Contains task attempt error messages

    • stdout—Contains task attempt output logs

    • syslog—Contains logs generated by Hadoop.

Troubleshooting Job Flows

This section describes how you can troubleshoot your job flows using the log files produced by Hadoop and Amazon EMR.

How to Debug Job Flows with No Steps

Amazon EMR allows you to create a job flow containing no steps. The effect is to create a Hadoop cluster and then stop processing. You can add additional steps using --AddJobFlowSteps. As soon as you issue that request, Amazon EMR continues the job flow and you can see whether or not the step completed successfully.

To develop and debug a job flow starting without steps

  1. In a RunJobFlow request, set KeepJobFlowAliveWhenNoSteps to true and ActionOnFailure to CANCEL_AND_WAIT.

    CANCEL_AND_WAIT stops job flow execution but does not terminate the Hadoop cluster. The default value, TERMINATE, stops the job flow and terminates the cluster. CANCEL_AND_WAIT enables you to revise your jars or add steps and retry the job flow without incurring the expense of downloading the data from Amazon S3 to Amazon EC2.

  2. Send the RunJobFlow request.

  3. If you want to see the Hadoop system, ssh as Hadoop user into the master node.

    ssh –i [keyfile] hadoop@[EC2_master_node_DNS]

  4. In a AddJobFlowSteps request, set ActionOnFailure to CANCEL_AND_WAIT.

  5. Send the AddJobFlowSteps request.

  6. Inspect the log files using a tool like Amazon S3 Organizer to see if there were errors.

Using this procedure, you can work on a step to make sure it completes successfully before adding the next step. For more information about adding steps, go to Add Steps to a Job Flow.

When you are ready for production, set KeepJobFlowAliveWhenNoSteps to false and ActionOnFailure to TERMINATE_JOB_FLOW.

This value automatically terminates the Hadoop cluster after running the job flow.

[Note]Note

When you use the console to run a job flow, the value of ActionOnFailure is always CONTINUE.

How to Debug Job Flows with Steps

You might want to debug a job flow with steps.

To develop and debug a job flow with steps

  1. In a RunJobFlow request, set ActionOnFailure to CANCEL_AND_WAIT.

    This value stops job flow execution but does not terminate the Hadoop cluster. The default value, TERMINATE, stops the job flow and terminates the cluster. CANCEL_AND_WAIT enables you to revise your JAR files or add steps and retry the job flow without incurring the expense of downloading the data from Amazon S3 to Amazon EC2.

  2. Send the RunJobFlow request.

  3. Inspect the log files using a tool like Amazon S3 Organizer to see if there were errors.

  4. Change the step that caused the error and resubmit the step using AddJobFlowStep setting, in the request, ActionOnFailure to CANCEL_AND_WAIT.

Troubleshooting Task Attempts

If your JAR file successfully started or you created a streaming job, the next place to look for failures is in the task attempts. The Map and Reduce functions you wrote execute in the context of a task. Tasks can execute multiple times as "task attempts" because of failures or speculative execution. Amazon EMR uploads task attempt logs into task-attempts/.

If one of the tasks failed, you can look at the task logs to determine what happened. These files are also available on the nodes under /mnt/var/log/hadoop/userlogs/. Looking through log files on each node in the cluster, however, makes this way of debugging difficult.

Task-attempt log files are similar in format to the step log files.

Checking Hadoop Failures

In rare cases, Hadoop itself might fail. To see if that is the case, you must look at the Hadoop daemon logs.

To view the daemon log files

  • Look under /mnt/var/log/hadoop/ on each node or under daemons/<instance id>/ on Amazon S3.

[Note]Note

Not all cluster nodes run all daemons.

When developing your application, we recommend that you enable both types of debugging: step and Hadoop job level and run a small but representative subset of your data to make sure your application works. To enable step level debugging, select Yes for Enable Debugging and enter an Amazon S3 bucket URI in the Amazon S3 Log Path field.

Troubleshooting Cluster Nodes and Instance Groups

When when a node fails to come up, Amazon EMR stops attempting to contact the node and put the associated instance group into a failed state. After some time, the failed node causes the instance group to change to an ARRESTED state.

A node could fail to come up if:

  • Hadoop or the cluster is somehow broken and does not accept a new node into the cluster

  • A bootstrap action fails on the new node

  • The node is not functioning correctly and fails to check in with Hadoop

If an instance group is in the ARRESTED state, and the job flow is in a WAITING state, you can add a job flow step to reset the desired number of slave nodes. Adding the step resumes processing of the job flow and put the instance group back into a RUNNING state.

For details on how to reset a job flow in an arrested state, refer to Arrested State.

Commonly Logged Errors and Details

The following sections describe common errors for each job flow type.

Custom JAR Common Errors

The following table describes common errors for custom JAR job flows.

ErrorWhere to Look
GeneralYou can usually find the cause of a custom JAR error in the syslog file. Link to it from the Steps pane. If you can't determine the problem there, check in the Hadoop task attempt error message, which you link to from the Task Attempts pane.
JAR throws exception before creating a jobIf the main program of your custom JAR throws an exception while creating the Hadoop job, the best place to look is the syslog file. Link to it from the Steps pane.
JAR throws an error inside a map task If your custom JAR and mapper throw an exception while processing input data, the best place to look is the syslog file. Link to it on the Task Attempts pane.

Hive and Pig Common Errors

The following table describes common errors for Hive or Pig job flows.

ErrorWhere to Look
GeneralYou can usually find the cause of a Hive or Pig error in the syslog file, which you link to from the Steps pane. If you can't determine the problem there, check in the Hadoop task attempt error message. Link to it on the Task Attempts pane.
Syntax or semantic error in the Hive script If a step fails, look at the stdout file (which you link to from the Steps pane) of the step that ran the Hive script. If the error is not there, look is the syslog file. Link to it on the Task Attempts pane.
Job fails when running interactively`If you are running Hive interactively on the master node and the job flow failed, select the syslog. Link to it on the Task Attempts pane for the task in the interactive step that failed.

Streaming Common Errors

The following table describes common errors for streaming job flows.

ErrorWhere to Look
GeneralYou can usually find the cause of a streaming error in a syslog file. Link to it on the Steps pane.
Data sent to the mapper in the wrong format You can find the error message in the syslog file of a failed task attempt. Link to it on the Task Attempts pane.
Misconfigured time limitYour mapper or reducer script does not produce output within the configured time limit (600 seconds, by default). Find the error in the syslog of the failed task attempt. You can change the time limit by passing an extra arg: -jobconf mapred.task.timeout=800000. This is the number of milliseconds before Amazon EMR terminates a task if it neither reads an input, writes an output, or updates its status string.
Exit with errorYour mapper or reducer script exits with an error. Find the error in the stderr file of the failed task attempt.