| Did this page help you? Yes No Tell us about it... |
Topics
Amazon Elastic MapReduce (Amazon EMR) provides support for Apache Hive. Amazon EMR supports several versions of Hive, which you can install on any running job flow. Amazon EMR also allows you to run multiple versions concurrently, allowing you to control your Hive version upgrade. The following sections describe the Hive configurations using Amazon EMR.
You can choose to run Hive in several different configurations. You set the
--hadoop-version, --hive-versions, and --ami-version
parameters in the job creation call as shown in the following
table.
The default configuration for Amazon EMR is Hive 0.7.1.3 with Hadoop 0.20.205.
The Amazon EMR console does not support Hive versioning and always loads the latest version of Hive.
Versions of the Amazon EMR CLI released on 9 April 2012 and later load the latest version of Hive by default. To use a verison of Hive other than the
latest, specify the --hive-versions parameter when you create the job flow. Versions of the Amazon EMR CLI
released prior to 9 April 2012 load the default configuration of Hive.
Calls to the API will launch the default configuratian of Hive unless you specify
--hive-versions as an argument to the step that
loads Hive onto the job flow during the call to RunJobFlow.
| Hive Version | Hadoop Version | Configuration Parameters |
|---|---|---|
| 0.4 | 0.18 |
--hadoop-version 0.18
|
| 0.5 | 0.20 |
--hadoop-version 0.20 --hive-versions 0.5 --ami-version 1.0
|
| 0.5 and 0.7 | 0.20 |
--hadoop-version 0.20 --hive-versions 0.5,0.7 --ami-version 1.0
|
| 0.7 | 0.20 |
--hadoop-version 0.20 --hive-versions 0.7 --ami-version 1.0
|
| 0.7.1 | 0.20 |
--hadoop-version 0.20 --hive-versions 0.7.1 --ami-version 1.0
|
| 0.7.1 | 0.20.205 |
--hadoop-version 0.20 --hive-versions 0.7.1 --ami-version 2.0
|
| 0.7.1.1 | 0.20.205 |
--hadoop-version 0.20.205 --hive-versions 0.7.1.1 --ami-version 2.0
|
| 0.7.1.2 | 0.20.205 |
--hadoop-version 0.20.205 --hive-versions 0.7.1.2 --ami-version 2.0
|
| 0.7.1.3 | 0.20.205 |
--hadoop-version 0.20.205 --hive-versions 0.7.1.3 --ami-version 2.0
|
| 0.7.1.4 | 0.20.205 |
--hadoop-version 0.20.205 --hive-versions 0.7.1.4 --ami-version 2.0
|
Hive 0.7.1.1 introduced support for accessing Amazon DynamoDB, as detailed in Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR. It is a minor verison of 0.7.1 developed by the Amazon EMR team. When specified as the Hive version, Hive 0.7.1.1 overwrites the Hive 0.7.1 directory structure and configuration with its own values. Specifically, Hive 0.7.1.1 matches Apache Hive 0.7.1 and uses the Hive server port, database and log location of 0.7.1 on the job flow.
Hive 0.7.1.2 modifies the way files are named in Amazon S3 for dynamic partitions. It prepends file names in Amazon S3 for dynamic partitions with a
unique identifier. Using Hive 0.7.1.2 you can run queries in parallel with set hive.exec.parallel=true.
It also fixes an issue with filter pushdown when accessing Amazon DynamoDB with spare data sets.
Hive 0.7.1.3 adds the dynamodb.retry.duration option, which you can use to
configure the timeout duration for retrying Hive queries against tables in Amazon DynamoDB. This version of Hive also supports the dynamodb.endpoint option,
which you can use to specify the Amazon DynamoDB endpoint to use for a Hive table. For more information about these options, see
Hive Options.
Hive 0.7.1.4 prevents the "SET" command in Hive from changing the current database of the current session.
To specify the Hive version when creating the job flow
Use the --hive-versions parameter. The following command-line example
creates an interactive Hive job flow running Hadoop 0.20 and Hive 0.7.1.
$ ./elastic-mapreduce --create --alive --name "Test Hive" \ --hadoop-version0.20\ --num-instances 5 --instance-type instanceType \ --hive-interactive \ --hive-versions0.7.1
![]() | Note |
|---|---|
The |
To specify the latest Hive version when creating the job flow
Use the --hive-versions parameter with the latest keyword.
The following command-line example
creates an interactive Hive job flow running the latest verison of Hive.
$ ./elastic-mapreduce --create --alive --name "Test Hive" \
--hadoop-version 0.20 \
--num-instances 5 --instance-type instanceType \
--hive-interactive \
--hive-versions latest
To specify the Hive version for a job flow that is interactive and uses a Hive script
If you have a job flow that uses Hive both interactively and from a script, you must set the Hive version for each type of use. The following command-line example illustrates setting both the interactive and the script version of Hive to use 0.7.1.
$ ./elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ \ --name "Testing m1.large AMI 1" \ --ami-version latest --hadoop-version 0.20 \ --instance-type m1.large --num-instances 5 \ --hive-interactive --hive-versions 0.7.1.2 \ --hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2
To load multiple versions of Hive for a given job flow
Use the --hive-versions parameter and separate the version numbers by comma. The following command-line example creates an interactive job flow running Hadoop 0.20 and multiple versions of Hive. With this configuration, you can use any of the installed versions of Hive on the job flow.
$ ./elastic-mapreduce --create --alive --name "Test Hive" \ --hadoop-version0.20\ --num-instances 5 --instance-type instanceType \ --hive-interactive \ --hive-versions0.5,0.7.1
To call a specific version of Hive
Add the version number to the call. For example, hive-0.5 or hive-0.7.1.
![]() | Note |
|---|---|
If you have multiple versions of Hive loaded on a job flow, calling |
![]() | Note |
|---|---|
When running multiple versions of Hive concurrently, all versions of Hive can read the same data. They cannot, however, share metadata. Use an external metastore if you want multiple versions of Hive to read and write to the same location. |
You can use the --print-hive-version command to display the version of the Hive
currently in use for a given job flow. This is a useful command to call after you have
upgraded to a new version of Hive to confirm that the upgrade succeeded, or when you are
using multiple versions of Hive and need to confirm which version is currently running.
The syntax for this is as follows, where
JobFlowID is the identifier of the job flow to check the Hive version on.
elastic-mapreduce --jobflow JobFlowID --print-hive-versionYou can take advantage of Hive 0.7 bug fixes and performance improvements on your
existing Hive 0.5 job flows by upgrading your version of Hive. Hive 0.5 and Hive 0.7 do not
share schema. However, you can share data between them by creating an external table in
each version with the same LOCATION parameter.
To share data between Hive 0.5 and Hive 0.7
|
1 |
Start a Hive 0.7 cluster. |
|
2 |
Configure the clusters to allow communication: On the Hive 0.5 cluster, configure the insert overwrite directory to the location of the HDFS of the Hive 0.7 cluster. |
|
3 |
Export and reimport the data. |
![]() | Note |
|---|---|
Using this same procedure, you can share data between Hive 0.5 and Hive 0.7.1. |
This section describes the differences between Amazon EMR Hive installations and the default versions of Hive available at http://svn.apache.org/viewvc/hive/branches/.
The Apache Hive default input format is text. The Amazon EMR default input
format for Hive is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. You
can specify the hive.base.inputformat option in Hive to select a
different file format, for example:
hive>set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
To switch back to the default Amazon EMR input format, you would enter the following:
hive>set hive.base.inputformat=default;
Apache Hive saves Hive log files to /tmp/{user.name}/ in a file named
hive.log. Amazon EMR saves Hive logs to
/mnt/var/log/apps/. In order to support concurrent versions of Hive, the
version of Hive you run determines the log file name, as shown in the following table.
| Hive Version | Log File Name |
|---|---|
| 0.4 | hive.log |
| 0.5 | hive_05.log |
| 0.7 | hive_07.log |
| 0.7.1 | hive_07_1.log |
![]() | Note |
|---|---|
Minor verisons of Hive 0.7.1, such as Hive 0.7.1.3 and Hive 0.7.1.4, share the same log file location as Hive 0.7.1. |
Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on port 10000. In order to support concurrent versions of Hive, Amazon EMR operates Hive 0.5 on port 10000, Hive 0.7 on port 10001, and Hive 0.7.1 on port 10002. For more information about thrift services, go to http://wiki.apache.org/thrift/.
Amazon EMR enables you to run Hive scripts in two modes:
Interactive
Batch
Typically, you use interactive mode to troubleshoot your job flow and use batch mode in production.
In interactive mode, you ssh as the Hadoop user into the master
node in the Hadoop cluster and use the Hive Command Line Interface to develop and run your
Hive script. Interactive mode enables you to revise the Hive script more easily than batch
mode. After you successfully revise the Hive script in interactive mode, you can upload the
script to Amazon S3 and use batch mode to run production job flows.
In batch mode, you upload your Hive script to Amazon S3, and then execute it using a job flow. You can pass parameter values into your Hive script and reference resources in Amazon S3. Variables in Hive scripts use the dollar sign and curly braces, for example:
${VariableName}In the Amazon EMR CLI, use the -d parameter to pass
values into the Hive script as in the following example.
$ ./elastic-mapreduce --create \ --name "Hive Job Flow" \ --hive-script \ --args s3://myawsbucket/myquery.q \ --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=3://myawsbucket/output
Using batch mode, you can pass parameter values into a Hive script from the Specify Parameters page of the Create a New Job Flow wizard found in the Amazon EMR console. The values go into the Extra Args field. For example, you could enter:
-d VariableName=ValueThe Amazon EMR console and Amazon EMR command line interface (CLI) both interactive and batch modes.
You can run Hive in interactive mode from both the CLI and Amazon EMR console.
To start an interactive job flow from the command line, use the
--alive option with the --create
parameter so that the job flow remains active until you terminate it, for
example:
$ ./elastic-mapreduce --create --alive --name "Hive job flow" \ --num-instances5--instance-typeinstanceType\ --hive-interactive
The return output is similar to the following:
Created jobflow JobFlowIDAdd additional steps from the Amazon Elastic MapReduce (Amazon EMR) CLI or ssh directly to
the master node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide .
You start an interactive job flow from the Amazon EMR console in the Create a New Job Flow wizard.
To start an interactive job flow from the Amazon EMR console
Click Create New Job Flow and launch the Create a New Job Flow wizard.
Enter a Job Flow Name, and choose a Hive Program Job Type. Click Continue.
From the Specify Parameters page, select Start an Interactive Hive Session, enter the appropriate inputs for Script Location, Input Location, and Output Location, and then click Continue:
Choose the appropriate Amazon EC2 instance type, EC2 key pair, and debugging levels for your job flow on the Configure EC2 Instances page, and then click Continue.
Add any bootstrap actions on the Bootstrap Actions page, and then click Continue.
Review your job, and then click Continue.
The job flow begins. When the job flow is in the WAITING state, you can
add steps to your job flow from the Amazon EMR CLI or ssh directly
to the master node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide.
Adding steps can help you test and develop Hive scripts. For example, if the script fails, you can add a new step to the job flow without having to wait for a new job flow to start. The following procedure shows you how to use the command line to add Hive as a new step to an existing job flow.
To add Hive to an existing job flow
Enter the following command, replacing the location with an Amazon S3 bucket
containing a Hive script and the <JobFlowID> from
your job:
$ ./elastic-mapreduce --jobflowJobFlowID\ --hive-script \ --args s3://location/myquery.q\ --args -d,INPUT=s3://location/input,-d,OUTPUT=s3://location/output
The following procedure shows how to run Hive in batch mode from the command line. The procedure assumes that you stored the Hive script in a bucket on Amazon S3. For more information about uploading files into Amazon S3, go to the Amazon S3 Getting Started Guide.
To create a job flow with a step that executes a Hive script
Enter the following command, substituting the replaceable parameters with the actual values from your job:
$ ./elastic-mapreduce --create \ --name "Hive job flow" \ --hive-script \ --argss3://myawsbucket/myquery.q\ --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output
The --args option provides arguments to the Hive-script. The
first --args option here specifies the location of the Hive
script in Amazon S3. In the second --args option, the
-d provides a way to pass values (INPUT,
OUTPUT) into the script. Within the Hive script, these parameters are
available as ${variable}. In this example, Hive replaces
${INPUT} and ${OUTPUT} with the values you passed in. These
variables are substituted during a preprocessing step, so the variables can occur
anywhere in the Hive script.
The return output is similar to the following:
Created jobflow JobFlowIDHive records metastore information in a MySQL database that is located, by default, on the master node. The metastore contains a description of the input data, including the partition names and data types, contained in the input files.
When a job flow terminates, all associated cluster nodes shut down. All data stored on a cluster node, including the Hive metastore, is deleted. Information stored elsewhere, such as in your Amazon S3 bucket, persists.
If you have multiple job flows that share common data and update the metastore, you should locate the shared metastore on persistent storage.
To share the metastore between job flows, override the default location of the MySQL database to an external persistent storage location.
![]() | Note |
|---|---|
Hive neither supports nor prevents concurrent write access to metastore tables. If you share metastore information between two job flows, you must ensure that you do not write to the same metastore table concurrently—unless you are writing to different partitions of the same metastore table. |
The following procedure shows you how to override the default configuration values for the Hive metastore location and start a job flow using the reconfigured metastore location.
To create a metastore located outside of the cluster
Create a MySQL database.
Relational Database Service (RDS) provides a cloud-based MySQL database. Instructions on how to create an Amazon RDS database are at http://aws.amazon.com/rds/.
Modify your security groups to allow JDBC connections between your MySQL database and the ElasticMapReduce-Master security group.
Instructions on how to modify your security groups for access are at http://aws.amazon.com/rds/faqs/#31.
Set the JDBC configuration values in hive-site.xml:
Create a hive-site.xml configuration file containing
the following information:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>username</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
<description>Password to use against metastore database</description>
</property>
</configuration>
<hostname> is the DNS address of the
Amazon EC2 instance running MySQL.
<username> and
<password> are the credentials for your MySQL
database.
The MySQL JDBC drivers are installed by Amazon EMR.
![]() | Note |
|---|---|
The value property should not contain any spaces or carriage returns. It should appear all on one line. |
Save your hive-site.xml file to a location on Amazon
S3, such as
s3://.myawsbucket/conf/hive-site.xml
Create a job flow and specify the Amazon S3 location of the new Hive configuration file, for example:
$ ./elastic-mapreduce --create --alive \ --name "Hive job flow" \ --hive-interactive \ --hive-site=s3://myawsbucket/conf/hive-site.xml
The --hive-site parameter installs the configuration
values in hive-site.xml in the specified location. The
--hive-site parameter overrides only the values defined in
hive-site.xml.
Connect to the master node of your job flow.
Instructions on how to connect to the master node are available in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide.
Create your Hive tables specifying the location on Amazon S3 by entering a command similar to the following:
CREATE EXTERNAL TABLE IF NOT EXISTS table_name
(
key int,
value int
)
LOCATION s3://myawsbucket/hdfs/Add your Hive script to the running job flow.
Your Hive job flow runs using the metastore located on Amazon S3. Launch all additional Hive job flows that share this metastore by specifying the metastore location.
The Hive JDBC driver provides a mechanism to move data from one database format to another. Installing a JDBC client requires you to download the JDBC driver and install the client software correctly. You can use the Hive JDBC driver to connect to a SQL client. An example of connecting to the SQuirrel SQL client follows.
To download JDBC drivers
Download Hive 0.5 JDBC drivers from http://aws.amazon.com/developertools/Elastic-MapReduce/0196055244487017 and save the files locally.
Download Hive 0.7 JDBC drivers from http://aws.amazon.com/developertools/Elastic-MapReduce/1818074809286277 and save the files locally.
Download Hive 0.7.1 JDBC drivers from http://aws.amazon.com/developertools/Elastic-MapReduce/8084613472207189 and save the files locally.
You need only download the drivers appropriate to the version(s) of Hive you want to access.
To install SQuirrel SQL client
Download SQuirrel SQL client from http://squirrel-sql.sourceforge.net/.
Open the self extracting JAR file, and follow the wizard instructions to install the software.
From the command line, create an SSH tunnel to the master node of your Hive job flow as follows:
| If you are installing... | Enter the following... |
|---|---|
| Hive 0.5 drivers |
ssh -o ServerAliveInterval=10 -L 10000:localhost:10000
hadoop@
|
| Hive 0.7 drivers |
ssh -o ServerAliveInterval=10 -L 10001:localhost:10001
hadoop@
|
| Hive 0.7.1 drivers |
ssh -o ServerAliveInterval=10 -L 10002:localhost:10002
hadoop@
|
The MasterNodeDNS is the DNS of the master node of the
Hadoop cluster and mysecretkey.pem is the name of your AWS
secret key file.
Add the JDBC driver to SQuirrel SQL:
Open SQuirrel SQL and click the Drivers tab.
Double-click JDBC ODBC Bridge to add attributes.
Type org.apache.hadoop.hive.jdbc.HiveDriver in the
Class Name field, and then click
Add.
Navigate to the location of your JDBC drivers.
Add the following JAR files:
| If you are installing... | Add the following... |
|---|---|
| Hive 0.5 drivers |
hadoop-0.20-core.jar hive/lib/hive-exec-0.5.0.jar hive/lib/hive-jdbc-0.5.0.jar hive/lib/hive-metastore-0.5.0.jar hive/lib/hive-service-0.5.0.jar hive/lib/libfb303.jar hive/lib/log4j-1.2.15.jar lib/commons-logging-1.0.4.jar |
| Hive 0.7 drivers |
hadoop-0.20-core.jar hive/lib/hive-exec-0.7.0.jar hive/lib/hive-jdbc-0.7.0.jar hive/lib/hive-metastore-0.7.0.jar hive/lib/hive-service-0.7.0.jar hive/lib/libfb303.jar lib/commons-logging-1.0.4.jar slf4j-api-1.5.6.jar slf4j-log4j12-1.5.6.jar |
| Hive 0.7.1 drivers |
hadoop-0.20-core.jar hive/lib/hive-exec-0.7.1.jar hive/lib/hive-jdbc-0.7.1.jar hive/lib/hive-metastore-0.7.1.jar hive/lib/hive-service-0.7.1.jar hive/lib/libfb303.jar lib/commons-logging-1.0.4.jar slf4j-api-1.6.1.jar slf4j-log4j12-1.6.1.jar |
Click OK.
Add a new alias:
Click the Alias tab, and then click + to add a new alias.
Enter the following information in the Add Alias dialog:
| Field | Description |
|---|---|
| Name | Enter the name of the alias. |
| Driver | Select the JDBC driver from the list. |
| User Name | Enter your local machine login. |
| Password | Enter your local machine password. |
Enter the URL information in the Add Alias dialog based on the version of Hive:
| If you are installing... | Enter the following... |
|---|---|
| Hive 0.5 drivers |
jdbc:hive://localhost:10000/default
|
| Hive 0.7 drivers |
jdbc:hive://localhost:10001/default
|
| Hive 0.7.1 drivers |
jdbc:hive://localhost:10002/default
|
Click OK.
The SQuirrel SQL client is ready to use.
For more information about using Hive and the JDBC interface, go to http://wiki.apache.org/hadoop/Hive/HiveClient and http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface.