Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Hive Configuration

Amazon Elastic MapReduce (Amazon EMR) provides support for Apache Hive. Amazon EMR supports several versions of Hive, which you can install on any running job flow. Amazon EMR also allows you to run multiple versions concurrently, allowing you to control your Hive version upgrade. The following sections describe the Hive configurations using Amazon EMR.

Supported Hive Versions

You can choose to run Hive in several different configurations. You set the --hadoop-version, --hive-versions, and --ami-version parameters in the job creation call as shown in the following table.

The default configuration for Amazon EMR is Hive 0.7.1.3 with Hadoop 0.20.205.

The Amazon EMR console does not support Hive versioning and always loads the latest version of Hive.

Versions of the Amazon EMR CLI released on 9 April 2012 and later load the latest version of Hive by default. To use a verison of Hive other than the latest, specify the --hive-versions parameter when you create the job flow. Versions of the Amazon EMR CLI released prior to 9 April 2012 load the default configuration of Hive.

Calls to the API will launch the default configuratian of Hive unless you specify --hive-versions as an argument to the step that loads Hive onto the job flow during the call to RunJobFlow.

Hive VersionHadoop VersionConfiguration Parameters
0.40.18 --hadoop-version 0.18
0.50.20 --hadoop-version 0.20 --hive-versions 0.5 --ami-version 1.0
0.5 and 0.70.20 --hadoop-version 0.20 --hive-versions 0.5,0.7 --ami-version 1.0
0.70.20 --hadoop-version 0.20 --hive-versions 0.7 --ami-version 1.0
0.7.10.20 --hadoop-version 0.20 --hive-versions 0.7.1 --ami-version 1.0
0.7.10.20.205 --hadoop-version 0.20 --hive-versions 0.7.1 --ami-version 2.0
0.7.1.10.20.205 --hadoop-version 0.20.205 --hive-versions 0.7.1.1 --ami-version 2.0
0.7.1.20.20.205 --hadoop-version 0.20.205 --hive-versions 0.7.1.2 --ami-version 2.0
0.7.1.30.20.205 --hadoop-version 0.20.205 --hive-versions 0.7.1.3 --ami-version 2.0
0.7.1.40.20.205 --hadoop-version 0.20.205 --hive-versions 0.7.1.4 --ami-version 2.0

Hive 0.7.1.1 introduced support for accessing Amazon DynamoDB, as detailed in Export, Import, Query, and Join Tables in Amazon DynamoDB Using Amazon EMR. It is a minor verison of 0.7.1 developed by the Amazon EMR team. When specified as the Hive version, Hive 0.7.1.1 overwrites the Hive 0.7.1 directory structure and configuration with its own values. Specifically, Hive 0.7.1.1 matches Apache Hive 0.7.1 and uses the Hive server port, database and log location of 0.7.1 on the job flow.

Hive 0.7.1.2 modifies the way files are named in Amazon S3 for dynamic partitions. It prepends file names in Amazon S3 for dynamic partitions with a unique identifier. Using Hive 0.7.1.2 you can run queries in parallel with set hive.exec.parallel=true. It also fixes an issue with filter pushdown when accessing Amazon DynamoDB with spare data sets.

Hive 0.7.1.3 adds the dynamodb.retry.duration option, which you can use to configure the timeout duration for retrying Hive queries against tables in Amazon DynamoDB. This version of Hive also supports the dynamodb.endpoint option, which you can use to specify the Amazon DynamoDB endpoint to use for a Hive table. For more information about these options, see Hive Options.

Hive 0.7.1.4 prevents the "SET" command in Hive from changing the current database of the current session.

To specify the Hive version when creating the job flow

  • Use the --hive-versions parameter. The following command-line example creates an interactive Hive job flow running Hadoop 0.20 and Hive 0.7.1.

$ ./elastic-mapreduce --create --alive --name "Test Hive" \
  --hadoop-version 0.20 \
  --num-instances 5 --instance-type instanceType \
  --hive-interactive \
  --hive-versions 0.7.1 
		
[Note]Note

The --hive-versions parameter must come after any reference to the parameters --hive-interactive, --hive-script, or --hive-site.

To specify the latest Hive version when creating the job flow

  • Use the --hive-versions parameter with the latest keyword. The following command-line example creates an interactive Hive job flow running the latest verison of Hive.

$ ./elastic-mapreduce --create --alive --name "Test Hive" \
  --hadoop-version 0.20 \
  --num-instances 5 --instance-type instanceType \
  --hive-interactive \
  --hive-versions latest 
		

To specify the Hive version for a job flow that is interactive and uses a Hive script

  • If you have a job flow that uses Hive both interactively and from a script, you must set the Hive version for each type of use. The following command-line example illustrates setting both the interactive and the script version of Hive to use 0.7.1.

$ ./elastic-mapreduce --create --debug --log-uri s3://myawsbucket/perftest/logs/ \
--name "Testing m1.large AMI 1" \
--ami-version latest --hadoop-version 0.20 \
--instance-type m1.large --num-instances 5 \
--hive-interactive  --hive-versions 0.7.1.2 \
--hive-script s3://myawsbucket/perftest/hive-script.hql --hive-versions 0.7.1.2 
		

To load multiple versions of Hive for a given job flow

  • Use the --hive-versions parameter and separate the version numbers by comma. The following command-line example creates an interactive job flow running Hadoop 0.20 and multiple versions of Hive. With this configuration, you can use any of the installed versions of Hive on the job flow.

$ ./elastic-mapreduce --create --alive --name "Test Hive" \
  --hadoop-version 0.20 \
  --num-instances 5 --instance-type instanceType \
  --hive-interactive \
  --hive-versions 0.5,0.7.1
		

To call a specific version of Hive

  • Add the version number to the call. For example, hive-0.5 or hive-0.7.1.

[Note]Note

If you have multiple versions of Hive loaded on a job flow, calling hive will access the default version of Hive or the version loaded last if there are multiple --hive-versions parameters specified in the job flow creation call. When the comma-separated syntax is used with --hive-versions to load multiple versions, hive will access the default version of Hive.

[Note]Note

When running multiple versions of Hive concurrently, all versions of Hive can read the same data. They cannot, however, share metadata. Use an external metastore if you want multiple versions of Hive to read and write to the same location.

Display the Hive Version

You can use the --print-hive-version command to display the version of the Hive currently in use for a given job flow. This is a useful command to call after you have upgraded to a new version of Hive to confirm that the upgrade succeeded, or when you are using multiple versions of Hive and need to confirm which version is currently running. The syntax for this is as follows, where JobFlowID is the identifier of the job flow to check the Hive version on.

elastic-mapreduce --jobflow JobFlowID --print-hive-version

Sharing Data Between Hive 0.5 and Hive 0.7

You can take advantage of Hive 0.7 bug fixes and performance improvements on your existing Hive 0.5 job flows by upgrading your version of Hive. Hive 0.5 and Hive 0.7 do not share schema. However, you can share data between them by creating an external table in each version with the same LOCATION parameter.

To share data between Hive 0.5 and Hive 0.7

1

Start a Hive 0.7 cluster.

2

Configure the clusters to allow communication:

On the Hive 0.5 cluster, configure the insert overwrite directory to the location of the HDFS of the Hive 0.7 cluster.

3

Export and reimport the data.


[Note]Note

Using this same procedure, you can share data between Hive 0.5 and Hive 0.7.1.

Differences from Apache Hive Defaults

This section describes the differences between Amazon EMR Hive installations and the default versions of Hive available at http://svn.apache.org/viewvc/hive/branches/.

Input Format

The Apache Hive default input format is text. The Amazon EMR default input format for Hive is org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. You can specify the hive.base.inputformat option in Hive to select a different file format, for example:

hive>set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;

To switch back to the default Amazon EMR input format, you would enter the following:

hive>set hive.base.inputformat=default;

Log files

Apache Hive saves Hive log files to /tmp/{user.name}/ in a file named hive.log. Amazon EMR saves Hive logs to /mnt/var/log/apps/. In order to support concurrent versions of Hive, the version of Hive you run determines the log file name, as shown in the following table.

Hive VersionLog File Name
0.4hive.log
0.5hive_05.log
0.7hive_07.log
0.7.1hive_07_1.log
[Note]Note

Minor verisons of Hive 0.7.1, such as Hive 0.7.1.3 and Hive 0.7.1.4, share the same log file location as Hive 0.7.1.

Thrift Service Ports

Thrift is an RPC framework that defines a compact binary serialization format used to persist data structures for later analysis. Normally, Hive configures the server to operate on port 10000. In order to support concurrent versions of Hive, Amazon EMR operates Hive 0.5 on port 10000, Hive 0.7 on port 10001, and Hive 0.7.1 on port 10002. For more information about thrift services, go to http://wiki.apache.org/thrift/.

Interactive and Batch Modes

Amazon EMR enables you to run Hive scripts in two modes:

  • Interactive

  • Batch

Typically, you use interactive mode to troubleshoot your job flow and use batch mode in production.

In interactive mode, you ssh as the Hadoop user into the master node in the Hadoop cluster and use the Hive Command Line Interface to develop and run your Hive script. Interactive mode enables you to revise the Hive script more easily than batch mode. After you successfully revise the Hive script in interactive mode, you can upload the script to Amazon S3 and use batch mode to run production job flows.

In batch mode, you upload your Hive script to Amazon S3, and then execute it using a job flow. You can pass parameter values into your Hive script and reference resources in Amazon S3. Variables in Hive scripts use the dollar sign and curly braces, for example:

${VariableName}

In the Amazon EMR CLI, use the -d parameter to pass values into the Hive script as in the following example.

$ ./elastic-mapreduce --create \
  --name "Hive Job Flow"  \
  --hive-script   \
  --args s3://myawsbucket/myquery.q \
  --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=3://myawsbucket/output

Using batch mode, you can pass parameter values into a Hive script from the Specify Parameters page of the Create a New Job Flow wizard found in the Amazon EMR console. The values go into the Extra Args field. For example, you could enter:

-d VariableName=Value

The Amazon EMR console and Amazon EMR command line interface (CLI) both interactive and batch modes.

Running Hive in Interactive Mode

You can run Hive in interactive mode from both the CLI and Amazon EMR console.

  • To start an interactive job flow from the command line, use the --alive option with the --create parameter so that the job flow remains active until you terminate it, for example:

    $ ./elastic-mapreduce --create --alive --name "Hive job flow" \
      --num-instances 5 --instance-type instanceType \
      --hive-interactive

The return output is similar to the following:

Created jobflow JobFlowID

Add additional steps from the Amazon Elastic MapReduce (Amazon EMR) CLI or ssh directly to the master node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide .

You start an interactive job flow from the Amazon EMR console in the Create a New Job Flow wizard.

To start an interactive job flow from the Amazon EMR console

  1. Click Create New Job Flow and launch the Create a New Job Flow wizard.

  2. Enter a Job Flow Name, and choose a Hive Program Job Type. Click Continue.

  3. From the Specify Parameters page, select Start an Interactive Hive Session, enter the appropriate inputs for Script Location, Input Location, and Output Location, and then click Continue:

  4. Choose the appropriate Amazon EC2 instance type, EC2 key pair, and debugging levels for your job flow on the Configure EC2 Instances page, and then click Continue.

  5. Add any bootstrap actions on the Bootstrap Actions page, and then click Continue.

  6. Review your job, and then click Continue.

The job flow begins. When the job flow is in the WAITING state, you can add steps to your job flow from the Amazon EMR CLI or ssh directly to the master node following the instructions in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide.

Adding steps can help you test and develop Hive scripts. For example, if the script fails, you can add a new step to the job flow without having to wait for a new job flow to start. The following procedure shows you how to use the command line to add Hive as a new step to an existing job flow.

To add Hive to an existing job flow

  • Enter the following command, replacing the location with an Amazon S3 bucket containing a Hive script and the <JobFlowID> from your job:

    $ ./elastic-mapreduce --jobflow JobFlowID \
      --hive-script \
      --args s3://location/myquery.q \
      --args -d,INPUT=s3://location/input,-d,OUTPUT=s3://location/output

Running Hive in Batch Mode

The following procedure shows how to run Hive in batch mode from the command line. The procedure assumes that you stored the Hive script in a bucket on Amazon S3. For more information about uploading files into Amazon S3, go to the Amazon S3 Getting Started Guide.

To create a job flow with a step that executes a Hive script

  • Enter the following command, substituting the replaceable parameters with the actual values from your job:

    $ ./elastic-mapreduce --create \
      --name "Hive job flow"  \
      --hive-script   \
      --args s3://myawsbucket/myquery.q \
      --args -d,INPUT=s3://myawsbucket/input,-d,OUTPUT=s3://myawsbucket/output

The --args option provides arguments to the Hive-script. The first --args option here specifies the location of the Hive script in Amazon S3. In the second --args option, the -d provides a way to pass values (INPUT, OUTPUT) into the script. Within the Hive script, these parameters are available as ${variable}. In this example, Hive replaces ${INPUT} and ${OUTPUT} with the values you passed in. These variables are substituted during a preprocessing step, so the variables can occur anywhere in the Hive script.

The return output is similar to the following:

Created jobflow JobFlowID

Creating a Metastore Outside the Hadoop Cluster

Hive records metastore information in a MySQL database that is located, by default, on the master node. The metastore contains a description of the input data, including the partition names and data types, contained in the input files.

When a job flow terminates, all associated cluster nodes shut down. All data stored on a cluster node, including the Hive metastore, is deleted. Information stored elsewhere, such as in your Amazon S3 bucket, persists.

If you have multiple job flows that share common data and update the metastore, you should locate the shared metastore on persistent storage.

To share the metastore between job flows, override the default location of the MySQL database to an external persistent storage location.

[Note]Note

Hive neither supports nor prevents concurrent write access to metastore tables. If you share metastore information between two job flows, you must ensure that you do not write to the same metastore table concurrently—unless you are writing to different partitions of the same metastore table.

The following procedure shows you how to override the default configuration values for the Hive metastore location and start a job flow using the reconfigured metastore location.

To create a metastore located outside of the cluster

  1. Create a MySQL database.

    Relational Database Service (RDS) provides a cloud-based MySQL database. Instructions on how to create an Amazon RDS database are at http://aws.amazon.com/rds/.

  2. Modify your security groups to allow JDBC connections between your MySQL database and the ElasticMapReduce-Master security group.

    Instructions on how to modify your security groups for access are at http://aws.amazon.com/rds/faqs/#31.

  3. Set the JDBC configuration values in hive-site.xml:

    1. Create a hive-site.xml configuration file containing the following information:

      <configuration>
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://hostname:3306/hive?createDatabaseIfNotExist=true</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>com.mysql.jdbc.Driver</value>
          <description>Driver class name for a JDBC metastore</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>username</value>
          <description>Username to use against metastore database</description>
        </property>
        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>password</value>
          <description>Password to use against metastore database</description>
        </property>
      </configuration>

      <hostname> is the DNS address of the Amazon EC2 instance running MySQL. <username> and <password> are the credentials for your MySQL database.

      The MySQL JDBC drivers are installed by Amazon EMR.

      [Note]Note

      The value property should not contain any spaces or carriage returns. It should appear all on one line.

    2. Save your hive-site.xml file to a location on Amazon S3, such as s3://myawsbucket/conf/hive-site.xml.

  4. Create a job flow and specify the Amazon S3 location of the new Hive configuration file, for example:

    $ ./elastic-mapreduce --create --alive \
      --name "Hive job flow"    \
      --hive-interactive \
      --hive-site=s3://myawsbucket/conf/hive-site.xml

    The --hive-site parameter installs the configuration values in hive-site.xml in the specified location. The --hive-site parameter overrides only the values defined in hive-site.xml.

  5. Connect to the master node of your job flow.

    Instructions on how to connect to the master node are available in the Amazon Elastic MapReduce (Amazon EMR) Getting Started Guide.

  6. Create your Hive tables specifying the location on Amazon S3 by entering a command similar to the following:

    CREATE EXTERNAL TABLE IF NOT EXISTS table_name
    (
    key int,
    value int
    )
    LOCATION s3://myawsbucket/hdfs/
  7. Add your Hive script to the running job flow.

Your Hive job flow runs using the metastore located on Amazon S3. Launch all additional Hive job flows that share this metastore by specifying the metastore location.

Using the Hive JDBC Driver

The Hive JDBC driver provides a mechanism to move data from one database format to another. Installing a JDBC client requires you to download the JDBC driver and install the client software correctly. You can use the Hive JDBC driver to connect to a SQL client. An example of connecting to the SQuirrel SQL client follows.

To download JDBC drivers

You need only download the drivers appropriate to the version(s) of Hive you want to access.

To install SQuirrel SQL client

  1. Download SQuirrel SQL client from http://squirrel-sql.sourceforge.net/.

  2. Open the self extracting JAR file, and follow the wizard instructions to install the software.

  3. From the command line, create an SSH tunnel to the master node of your Hive job flow as follows:

    If you are installing...Enter the following...
    Hive 0.5 drivers ssh -o ServerAliveInterval=10 -L 10000:localhost:10000 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem
    Hive 0.7 drivers ssh -o ServerAliveInterval=10 -L 10001:localhost:10001 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem
    Hive 0.7.1 drivers ssh -o ServerAliveInterval=10 -L 10002:localhost:10002 hadoop@MasterNodeDNS -i $HOME/mysecretkey.pem

    The MasterNodeDNS is the DNS of the master node of the Hadoop cluster and mysecretkey.pem is the name of your AWS secret key file.

  4. Add the JDBC driver to SQuirrel SQL:

    1. Open SQuirrel SQL and click the Drivers tab.

    2. Double-click JDBC ODBC Bridge to add attributes.

    3. Type org.apache.hadoop.hive.jdbc.HiveDriver in the Class Name field, and then click Add.

    4. Navigate to the location of your JDBC drivers.

    5. Add the following JAR files:

      If you are installing...Add the following...
      Hive 0.5 drivers
      hadoop-0.20-core.jar
      hive/lib/hive-exec-0.5.0.jar
      hive/lib/hive-jdbc-0.5.0.jar
      hive/lib/hive-metastore-0.5.0.jar
      hive/lib/hive-service-0.5.0.jar
      hive/lib/libfb303.jar
      hive/lib/log4j-1.2.15.jar
      lib/commons-logging-1.0.4.jar             
      Hive 0.7 drivers
      hadoop-0.20-core.jar
      hive/lib/hive-exec-0.7.0.jar
      hive/lib/hive-jdbc-0.7.0.jar
      hive/lib/hive-metastore-0.7.0.jar
      hive/lib/hive-service-0.7.0.jar
      hive/lib/libfb303.jar
      lib/commons-logging-1.0.4.jar    
      slf4j-api-1.5.6.jar
      slf4j-log4j12-1.5.6.jar         
      Hive 0.7.1 drivers
      hadoop-0.20-core.jar
      hive/lib/hive-exec-0.7.1.jar
      hive/lib/hive-jdbc-0.7.1.jar
      hive/lib/hive-metastore-0.7.1.jar
      hive/lib/hive-service-0.7.1.jar
      hive/lib/libfb303.jar
      lib/commons-logging-1.0.4.jar    
      slf4j-api-1.6.1.jar
      slf4j-log4j12-1.6.1.jar         
    6. Click OK.

  5. Add a new alias:

    1. Click the Alias tab, and then click + to add a new alias.

    2. Enter the following information in the Add Alias dialog:

      FieldDescription
      NameEnter the name of the alias.
      DriverSelect the JDBC driver from the list.
      User NameEnter your local machine login.
      PasswordEnter your local machine password.
    3. Enter the URL information in the Add Alias dialog based on the version of Hive:

      If you are installing...Enter the following...
      Hive 0.5 drivers jdbc:hive://localhost:10000/default
      Hive 0.7 drivers jdbc:hive://localhost:10001/default
      Hive 0.7.1 drivers jdbc:hive://localhost:10002/default
    4. Click OK.

The SQuirrel SQL client is ready to use.

For more information about using Hive and the JDBC interface, go to http://wiki.apache.org/hadoop/Hive/HiveClient and http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface.