Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Step 4: Set Up a Hive Table to Run Hive Commands

Apache Hive is a data warehouse application you can use to query data contained in Amazon Elastic MapReduce (Amazon EMR) job flows using a SQL-like language. Because we launched the job flow as a Hive application, Amazon EMR will install Hive on the Amazon EC2 instances it launches to process the job flow. To learn more about Hive, go to http://hive.apache.org/.

If you've followed the previous instructions to set up a job flow and SSH into the master node, you are ready to use Hive interactively.

To run Hive commands interactively

  1. At the hadoop command prompt for the current master node, type hive.

    You should see a hive prompt: hive>

  2. Enter a Hive command that maps a table in the Hive application to the data in Amazon DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in Amazon DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.

    The following shows the syntax for mapping a Hive table to an Amazon DynamoDB table.

    CREATE EXTERNAL TABLE hive_tablename (hive_column1_name column1_datatype, hive_column2_name column2_datatype...)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ("dynamodb.table.name" = "dynamodb_tablename", 
    "dynamodb.column.mapping" = "hive_column1_name:dynamodb_attribute1_name,hive_column2_name:dynamodb_attribute2_name...");
                    

    When you create a table in Hive from Amazon DynamoDB, you must create it as an external table using the keyword EXTERNAL. The difference between external and internal tables is that the data in internal tables is deleted when an internal table is dropped. This is not the desired behavior when connected to Amazon DynamoDB, and thus only external tables are supported.

    For example, the following Hive command creates a table named "hivetable1" in Hive that references the Amazon DynamoDB table named "dynamodbtable1". The Amazon DynamoDB table "dynamodbtable1" has a hash-and-range primary key schema. The hash key element is "name" (string type), the range key element is "year" (numeric type), and each item has an attribute value for "holidays" (string set type).

    CREATE EXTERNAL TABLE hiveTableName (col1 string, col2 bigint, col3 array<string>)
    STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
    TBLPROPERTIES ("dynamodb.table.name" = "dynamodbtable1", 
    "dynamodb.column.mapping" = "col1:name,col2:year,col3:holidays"); 
                    

    Line 1 uses the HiveQL CREATE EXTERNAL TABLE statement. For "hivetable1", you need to establish a column for each attribute name-value pair in the Amazon DynamoDB table, and provide the data type. These values are not case-sensitive, and you can give the columns any name (except reserved words).

    Line 2 uses the STORED BY statement. The value of STORED BY is the name of the class that handles the connection between Hive and Amazon DynamoDB. It should be set to 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'.

    Line 3 uses the TBLPROPERTIES statement to associate "hivetable1" with the correct table and schema in Amazon DynamoDB. Provide TBLPROPERTIES with values for the dynamodb.table.name parameter and dynamodb.column.mapping parameter. These values are case-sensitive.

    [Note]Note

    All Amazon DynamoDB attribute names for the table must have corresponding columns in the Hive table. Otherwise, the Hive table won't contain the name-value pair from Amazon DynamoDB. If you do not map the Amazon DynamoDB primary key attributes, Hive generates an error. If you do not map a non-primary key attribute, no error is generated, but you won't see the data in the Hive table. If the data types do not match, the value will be null.

Then you can start running Hive operations on "hivetable1". Queries run against hivetable1" are internally run against the Amazon DynamoDB table "dynamodbtable1" of your Amazon DynamoDB account, consuming read or write units with each execution.

Sample HiveQL statements to perform tasks such as exporting or importing data from Amazon DynamoDB and joining tables are listed in Hive Command Examples for Exporting, Importing, and Querying Data in Amazon DynamoDB.

You can also create a file that contains a series of commands, launch a job flow, and reference that file to perform the operations. For more information, see Interactive and Batch Modes in the Amazon Elastic MapReduce Developer Guide.

To cancel a Hive request

When you execute a Hive query, the initial response from the server includes the command to cancel the request. To cancel the request at any time in the process, use the Kill Command from the server response.

  1. Enter Ctrl+C to exit the command line client.

  2. At the shell prompt, enter the Kill Command from the initial server response to your request.

    Alternatively, you can run the following command from the command line of the master node to kill the Hadoop job, where job-id is the identifier of the Hadoop job and can be retrieved from the Hadoop user interface. For more information about the Hadoop user interface, go to How to Use the Hadoop User Interface in the Amazon Elastic MapReduce Developer Guide.

    hadoop job -kill job-id
                    

Data Types for Hive and Amazon DynamoDB

The following table shows the available Hive data types and how they map to the corresponding Amazon DynamoDB data types.

Hive typeAmazon DynamoDB type
string

string (S)

bigint or double

number (N)

array

number set (NS) or string set (SS)

The bigint type in Hive is the same as the Java long type, and the Hive double type is the same as the Java double type in terms of precision. This means that if you have numeric data stored in Amazon DynamoDB that has precision higher than is available in the Hive datatypes, using Hive to export, import, or reference the Amazon DynamoDB data could lead to a loss in precision or a failure of the Hive query.

Hive Options

You can set the following Hive options to manage the transfer of data out of Amazon DynamoDB. These options only persist for the current Hive session. If you close the Hive command prompt and reopen it later on the job flow, these settings will have returned to the default values.

Hive OptionsDescription
dynamodb.throughput.read.percent

Set the rate of read operations to keep your Amazon DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.

The value of 0.5 is the default read rate, which means that Hive will attempt to consume half of the read provisioned throughout resources in the table. Increasing this value above 0.5 increases the read request rate. Decreasing it below 0.5 decreases the read request rate. This read rate is approximate. The actual read rate will depend on factors such as whether there is a uniform distribution of keys in Amazon DynamoDB.

If you find your provisioned throughput is frequently exceeded by the Hive operation, or if live read traffic is being throttled too much, then reduce this value below 0.5. If you have enough capacity and want a faster Hive operation, set this value above 0.5. You can also oversubscribe by setting it up to 1.5 if you believe there are unused input/output operations available.

dynamodb.throughput.write.percent

Set the rate of write operations to keep your Amazon DynamoDB provisioned throughput rate in the allocated range for your table. The value is between 0.1 and 1.5, inclusively.

The value of 0.5 is the default write rate, which means that Hive will attempt to consume half of the write provisioned throughout resources in the table. Increasing this value above 0.5 increases the write request rate. Decreasing it below 0.5 decreases the write request rate. This write rate is approximate. The actual write rate will depend on factors such as whether there is a uniform distribution of keys in Amazon DynamoDB

If you find your provisioned throughput is frequently exceeded by the Hive operation, or if live write traffic is being throttled too much, then reduce this value below 0.5. If you have enough capacity and want a faster Hive operation, set this value above 0.5. You can also oversubscribe by setting it up to 1.5 if you believe there are unused input/output operations available or this is the initial data upload to the table and there is no live traffic yet.

dynamodb.endpoint

Specify the endpoint in case you have tables in different regions. The default endpoint is us-east-1.

dynamodb.max.map.tasks

Specify the maximum number of map tasks when reading data from Amazon DynamoDB. This value must be equal to or greater than 1.

dynamodb.retry.duration

Specify the number of minutes to use as the timeout duration for retrying Hive commands. This value must be an integer equal to or greater than 0. The default timeout duration is two minutes.

These options are set using the SET command as shown in the following example.

SET dynamodb.throughput.read.percent=1.0; 

INSERT OVERWRITE TABLE s3_export SELECT * 
FROM hiveTableName;