Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Things to Check When Your Amazon EMR Job Flow Fails

There are many reasons why a job flow might fail. The following lists the most common issues and how you can fix them.

Does your path to Amazon Simple Storage Service (Amazon S3) have at least three slashes?

When you specify an Amazon S3 bucket, you must include a terminating slash on the end of the URL. For example, instead of referencing a bucket as “s3n://myawsbucket”, you should use “s3n://myawsbucket/”, otherwise Hadoop will fail your job flow in most cases.

Are you trying to recursively traverse input directories?

Hadoop does not recursively search input directories for files. If you have a directory structure such as /corpus/01/01.txt, /corpus/01/02.txt, /corpus/02/01.txt, etc. and you specify /corpus/ as the input parameter to your job flow, Hadoop will not find any input files because the /corpus/ directory is empty and Hadoop does not check the contents of the subdirectories. Similarly, Hadoop does not recursively check the subdirectories of Amazon S3 buckets.

The input files must be directly in the input directory or Amazon S3 bucket you specify, not in subdirectories.

Does your output directory already exist?

If you specify an output path that already exists, Hadoop will fail the job flow in most cases. This means that if you run a job flow once and then run it again with exactly the same parameters, it will likely work the first time and then never again since after the first run the output path exists and thus causes all successive runs to fail.

Are you trying to specify a resource using an HTTP URL?

Hadoop does not accept resource locations specified using the http:// prefix. You cannot reference a resource using an HTTP URL. For example, passing in http://mysite/myjar.jar as the JAR parameter will cause the job flow to fail. For more information about how to reference files in Amazon Elastic MapReduce (Amazon EMR), go to File System Configuration.

Are you referencing an Amazon S3 bucket using an invalid name format?

If you attempt to use a bucket name such as “myawsbucket.1” Hadoop will your job flow will fail because Amazon EMR requires bucket names be valid RFC 2396 host names, and thus the name cannot end with a number. In addition, because of the requirements of Hadoop, Amazon S3 bucket names used with Amazon EMR must contain only lowercase letters, numbers, periods (.), and hyphens (-). For more information about how to format Amazon S3 bucket names, go to Bucket Restrictions and Limitations in the Amazon Simple Storage Service Developer Guide.

Are you passing in invalid streaming arguments?

Hadoop streaming supports only the following arguments. If you pass in arguments other than those listed below, the job flow will fail.

-blockAutoGenerateCacheFiles 
-cacheArchive 
-cacheFile 
-cmdenv 
-combiner 
-debug 
-input 
-inputformat
-inputreader 
-jobconf 
-mapper
-numReduceTasks
-output 
-outputformat 
-partitioner
-reducer
-verbose
			

In addition, Hadoop streaming only recognizes arguments passed in using Java syntax, that is, preceded by a single hyphen. If you pass in arguments preceded by a double hyphen, the job flow will fail.

Are you passing the correct credentials into SSH?

If you are unable to SSH into the master node, most likely it is an issue with your security credentials.

First check that the .pem file containing your SSH key has the proper permissions. You can use chmod to change the permissions on your .pem file as is shown in the following example, where you would replace mykey.pem with the name of your own .pem file.

chmod og-rwx mykey.pem
			

The second possibility is that you are not using the keypair you specified when you created the job flow. This is easy to do if you have created multiple key pairs. Check the job flow details in the Amazon EMR console (or –describe from the CLI) for the name of the keypair that was specified when the job flow was created.

Once you have verified that you are using the correct key pair and that permissions are set correctly on the .pem file, you can use the following command to SSH into the master node, where you would replace mykey.pem with the name of your .pem file and hadoop@ec2-01-001-001-1.compute-1.amazonaws.com with the public dns name of the master node (available through --describe in the CLI or through the Amazon EMR console.)

ssh -i mykey.pem hadoop@ec2-01-001-001-1.compute-1.amazonaws.com				
			

For step-by-step instructions on how to set up your credentials and SSH into the master node, go to How to View Logs Using SSH.

Are you using a custom jar when running DistCp?

You cannot run DistCp by specifying a JAR residing on the AMI. Instead, you should use the samples/distcp/distcp.jar file in the elasticmapreduce Amazon S3 bucket. The following example shows how to call the Amazon Elastic MapReduce (Amazon EMR) version of DistCp. Replace j-ABABABABABAB with the identifier of your job flow.

[Note]Note

DistCp is deprecated on Amazon EMR, we recommend that you use S3DistCp instead. For more information about S3DistCp, see Distributed Copy Using S3DistCp.

elastic-mapreduce --jobflow j-ABABABABABAB \
   --jar s3n://elasticmapreduce/samples/distcp/distcp.jar \
   --arg s3n://elasticmapreduce/samples/wordcount/input \
   --arg hdfs:///samples/wordcount/input					
			

If you are using IAM, do you have the proper Amazon EC2 policies set?

Because Amazon EMR uses EC2 instances as nodes, IAM users of Amazon EMR also need to have certain Amazon EC2 policies set in order for Amazon EMR to be able to manage those instances on the IAM user’s behalf. If you do not have the required permissions set, Amazon EMR returns the error: “User account is not authorized to call EC2.”

For a list of the Amazon EC2 policies your IAM account needs to have set in order to run EMR, go to Example Policies for Amazon EMR.

Do you have enough HDFS space for your job flow?

If you do not, Amazon EMR will return the following error: “Cannot replicate block, only managed to replicate to zero nodes.” This error occurs when you generate more data in your job flow than can be stored in HDFS. You will see this error only while the job flow is running, because when the job flow ends, the HDFS space it was using is released.

The amount of HDFS space available to a job flow depends on the number and type of EC2 instances that are used as core nodes. All of the disk space on each EC2 instance is available to be used by HDFS. For more information on the amount of local storage for each EC2 instance type, go to Instance Types and Families in the Amazon Elastic Compute Cloud User Guide.

The other factor that can affect the amount of HDFS space available is the replication factor, which is the number of copies of each data block that are stored in HDFS for redundancy. The replication factor increases with the number of nodes in the job flow: there are 3 copies of each data block for a job flow with 10 or more nodes, 2 copies of each block for a job flow with 4 to 9 nodes, and 1 copy (no redundancy) for job flows with 3 or fewer nodes. The total HDFS space available is divided by the replication factor. In some cases, such as increasing the number of nodes from 9 to 10, the increase in replication factor can actually cause the amount of available HDFS space to decrease.

For example, a job flow with ten core nodes of type m1.large would have 2833 GB of space available to HDFS ((10 nodes X 850 GB per node)/replication factor of 3).

If your job flow exceeds the amount of space available to HDFS, you can add additional core nodes to your cluster or use data compression to create more HDFS space. If your job flow is one that can be stopped and restarted, you may consider using core nodes of a larger Amazon EC2 instance type. You might also consider adjusting the replication factor. Be aware, though, that decreasing the replication factor reduces the redundancy of HDFS data and your job flow's ability to recover from lost or corrupted HDFS blocks.

Have you checked the log files for clues?

Log files can provide insight into why a job flow failed. There are two options for viewing the log files: directly on EC2 Instances while the job flow is running, or from an Amazon S3 bucket if you launch the job flow with log-file archiving enabled.

While the job flow is running, each EC2 instance (or node) gathers Hadoop and Amazon EMR logs in the /mnt/var/log directory. The Hadoop logs include the daemon logs and the task and attempt logs. Amazon EMR logs include step logs, bootstrap action logs, and more. Because these log files are stored on EC2 Instances which are terminated when the job flow ends, you can only access them while the job flow is running. This can be problematic if your job flow terminates suddenly, and you need the log file information to analyze what went wrong.

To make sure that log-file information is available even after the job flow ends, you can instruct Amazon EMR to periodically copy the log files from /mnt/var/log to an Amazon S3 bucket. This incurs Amazon S3 storage fees, but can be a great help in figuring out why a job flow failed. It’s a best practice to have Amazon S3 log-file archiving turned on during your Hadoop application development and testing, and to have a representative portion of your data archiving logs on production job flows.

If you want Amazon EMR to archive logs to Amazon S3, you must turn this on when you launch the job flow, it is not enabled by default and cannot be added to an already running job flow.

Note that there is a delay between when the node writes a log file and when Amazon EMR propagates it to the Amazon S3 bucket. This means it may take up to five minutes after a job flow ends for Amazon EMR to push the final logs to the Amazon S3 bucket.

For information about how to archive job flow logs to Amazon S3, go to How to Enable Amazon S3 Log Archiving on a Job Flow. To learn more about how to view and interpret the log files, go to How to Use Log Files in Amazon EMR.

Have your job flows finished terminating?

Depending on the configuration of the job flow, it may take up to 5-20 minutes for the job flow to terminate and release allocated resources, such as Amazon E2 instances. If you are getting a EC2 QUOTA EXCEEDED error when you attempt to launch a job flow, it may be because resources from a recently terminated job flow may not yet have been released. In this case, you can either request that your Amazon EC2 quota be increased, or you can wait twenty minutes and re-launch the job flow.