Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Optimizing Performance for Amazon EMR Operations in Amazon DynamoDB

Amazon Elastic MapReduce (Amazon EMR) operations on an Amazon DynamoDB table count as read operations, and are subject to the table's provisioned throughput settings. Amazon EMR implements its own logic to try to balance the load on your Amazon DynamoDB table to minimize the possibility of exceeding your provisioned throughput. At the end of each Hive query, Amazon EMR returns information about the job flow used to process the query, including how many times your provisioned throughput was exceeded. You can use this information, as well as Amazon CloudFront metrics about your Amazon DynamoDB throughput to better manage the load on your Amazon DynamoDB table in subsequent requests.

The following factors influence Hive query performance when working with Amazon DynamoDB tables.

Read Percent Setting

By default, Amazon EMR manages the request load against your Amazon DynamoDB table according to your current provisioned throughput. However, when Amazon EMR returns information about your job that includes a high number of provisioned throughput exceeded responses, you can adjust the default read rate using the dynamodb.throughput.read.percent parameter when you set up the Hive table. For more information about setting the read percent parameter, see Hive Options.

Write Percent Setting

By default, Amazon EMR manages the request load against your Amazon DynamoDB table according to your current provisioned throughput. However, when Amazon EMR returns information about your job that includes a high number of provisioned throughput exceeded responses, you can adjust the default write rate using the dynamodb.throughput.write.percent parameter when you set up the Hive table. For more information about setting the write percent parameter, see Hive Options.

Retry Duration Setting

By default, Amazon EMR will re-run a Hive query if it has not returned a result within two minutes, the default retry interval. You can adjust this interval by setting the dynamodb.retry.duration parameter when you run a Hive query. For more information about setting the write percent parameter, see Hive Options.

Number of Map Tasks

The mapper daemons that Hadoop launches to process your requests to export and query data stored in Amazon DynamoDB are capped at a maximum read rate of 1 MiB per second to limit the read capacity used. If you have additional provisioned throughput available on Amazon DynamoDB, you can improve the performance of Hive export and query operations by increasing the number of mapper daemons. To do this, you can either increase the number of EC2 instances in your job flow or increase the number of mapper daemons running on each EC2 instance.

You can increase the number of EC2 instances in a job flow by stopping the current job flow and re-launching it with a larger number of EC2 instances. You specify the number of EC2 instances in the Configure EC2 Instances dialog box if you're launching the job flow from the Amazon Elastic MapReduce console, or with the --num-instances option if you're launching the job flow from the CLI.

The number of map tasks run on an instance depends on the EC2 instance type. For a list of the supported EC2 instance types and the number of mappers each one provides, go to Task Configuration.

Another way to increase the number of mapper daemons is to change the mapred.tasktracker.map.tasks.maximum configuration parameter of Hadoop to a higher value. This has the advantage of giving you more mappers without increasing either the number or the size of EC2 instances, which saves you money. A disadvantage is that setting this value too high can cause the EC2 instances in your job flow to run out of memory. To set mapred.tasktracker.map.tasks.maximum, launch the job flow and specify the Configure Hadoop bootstrap action, passing in a value for mapred.tasktracker.map.tasks.maximum as one of the arguments of the bootstrap action. This is shown in the following example.

--bootstrap-action s3n://elasticmapreduce/bootstrap-actions/configure-hadoop \
  --args -s,mapred.tasktracker.map.tasks.maximum=10 
        

For more information about bootstrap actions, go to Using Custom Bootstrap Actions in the Amazon Elastic Map Reduce Developer Guide.

Parallel Data Requests

Multiple data requests, either from more than one user or more than one application to a single table may drain read provisioned throughput and slow performance.

Process Duration

Data consistency in Amazon DynamoDB depends on the order of read and write operations on each node. While a Hive query is in progress, another application might load new data into the Amazon DynamoDB table or modify or delete existing data. In this case, the results of the Hive query might not reflect changes made to the data while the query was running.

Avoid Exceeding Throughput

When running Hive queries against Amazon DynamoDB, take care not to exceed your provisioned throughput, because this will deplete capacity needed for your application's calls to DynamoDB::Get. To ensure that this is not occurring, you should regularly monitor the read volume and throttling on application calls to DynamoDB::Get by checking logs and monitoring metrics in Amazon CloudWatch.

Request Time

Scheduling Hive queries that access a Amazon DynamoDB table when there is lower demand on the Amazon DynamoDB table improves performance. For example, if most of your application's users live in San Francisco, you might choose to export daily data at 4 a.m. PST, when the majority of users are asleep, and not updating records in your Amazon DynamoDB database.

Time-Based Tables

If the data is organized as a series of time-based Amazon DynamoDB tables, such as one table per day, you can export the data when the table becomes no longer active. You can use this technique to back up data to Amazon S3 on an ongoing fashion.

Archived Data

If you plan to run many Hive queries against the data stored in Amazon DynamoDB and your application can tolerate archived data, you may want to export the data to HDFS or Amazon S3 and run the Hive queries against a copy of the data instead of Amazon DynamoDB. This will conserve your read operations and provisioned throughput.

Viewing Hadoop Logs

If you run into an error, you can investigate what went wrong by viewing the Hadoop logs and user interface. For more information on how to do this, go to How to Monitor Hadoop on a Master Node and How to Use the Hadoop User Interface in the Amazon Elastic MapReduce Developer Guide.