Amazon Elastic MapReduce
Getting Started Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Create a Job Flow Using Hive

Job Flow Using Hive

This sample Hive script combines advertisement impression and click log data to evaluate the success of targeted online advertising. The script combines the two sets of log data, places the information into a Hive cluster, and outputs the results to a specified directory. The following script processes all impressions that occurred between 2009-04-13 8:00 and 2009-04-13 9:00 and were referred by twitter.com from a Mozilla browser.

A detailed description of this business problem can be found in the tutorial, Contextual Advertising Using Hive and Amazon EMR http://developer.amazonwebservices.com/connect/entry!default.jspa?categoryID=269&externalID=2855.

Hive provides tools to summarize data, query, and analyze large data sets stored in Hadoop files. It provides a simple query language called Hive QL which is based on SQL. Hive allows traditional map/reduce programmers to plug in custom mappers and reducers for more sophisticated analysis.

Create a Hive Script

For your convenience, this sample script is stored on Amazon S3 at s3://elasticmapreduce/samples/hive-ads. You can also save this script to your own Amazon S3 location and change the Hive command appropriately.

Sample data for this job flow is available at s3://elasticmapreduce/samples/hive-ads/libs/twitter-impressions.q.

The commented script follows:

  • A custom SerDe is used to read the advertisement impressions data.

    ADD JAR ${SAMPLE}/libs/jsonserde.jar ;
  • An external table is created to instruct Hive on how to organize the advertisement impressions data.

    CREATE EXTERNAL TABLE impressions (
        requestBeginTime string, adId string, impressionId string, referrer string, 
        userAgent string, userCookie string, ip string
      )
      PARTITIONED BY (dt string)
      ROW FORMAT 
        serde 'com.amazon.elasticmapreduce.JsonSerde'
        with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, 
        referrer, userAgent, userCookie, ip' )
      LOCATION '${SAMPLE}/tables/impressions' ;
  • A single partition table is created and partitioned based on time.

    ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05');
  • Temporary tables are created in the job flow's local HDFS partition to store intermediate advertisement impressions and click data.

    CREATE TABLE tmp_impressions (
        requestBeginTime string, adId string, impressionId string, referrer string, 
        userAgent string, userCookie string, ip string
      )
      STORED AS SEQUENCEFILE ;
  • Data from the advertisement impressions table for a specified time period is inserted into the partitioned table.

    INSERT OVERWRITE TABLE tmp_impressions 
        SELECT 
          from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int)) requestBeginTime, 
          i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip
        FROM 
          impressions i
        WHERE 
          i.dt ≥ '{DAY}-${HOUR}-00' and i.dt < '{NEXT_DAY}-${NEXT_HOUR}-00'
    ;
  • Specific impression data is stored in an output table on Amazon S3.

    CREATE EXTERNAL TABLE output_impressions (
        requestBeginTime string, adId string, impressionId string, referrer string, 
        userAgent string, userCookie string, ip string
        )
        PARTITIONED BY (day string, hour string)
        ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '\t'
        LINES TERMINATED BY '\n'
        STORED AS TEXTFILE
        LOCATION '${OUTPUT}/impressions'
      ;
  • The output table is populated with all advertisement impressions referred by twitter.com through a Mozilla browser during the specified time period.

    INSERT OVERWRITE TABLE output_impressions PARTITION (day='${DAY}', hour='${HOUR}')
      SELECT 
        i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip
      FROM 
        tmp_impressions i
      WHERE
        i.referrer = 'twitter.com' and i.userAgent like '%Mozilla%'
        ;    

Launch a Job Flow Using Hive

To run the job flow with Hive, create an Amazon Elastic MapReduce (Amazon EMR) job flow using the CLI, log in to the job flow's master node, and then launch the Hive script.

To create a job flow using Hive

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --create --alive --name "Hive Job Flow" --hive-interactive 
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --alive --name "Hive Job Flow" --hive-interactive 

The output will look similar to:

Created job flow JobFlowID	

This job flow takes a few minutes to transition from the STARTING to the WAITING state. You can monitor the job flow from the CLI as described in the Retrieve Information About a Specific Job Flow step or from the Amazon EMR console.

To list all active job flows using the CLI

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      $ ./elastic-mapreduce --list --active
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --list --active

The list of active job flows initially looks similar to the following:

JobFlowID      STARTING
     Hive Job Flow
   PENDING        Setup Hive 

When the job flow is ready to accept the Hive script, it looks similar to:

JobFlowID      WAITING        ec2-184-72-128-177.compute-1.amazonaws.com
     Hive Job Flow
   COMPLETED      Setup Hive

The DNS to the master node and the root login are required to connect to the master node. The DNS can be found in the output of an active job flow. In this sample, the DNS is ec2-184-72-128-177.compute-1.amazonaws.com. The root login or username is hadoop.

When the job flow is in the WAITING state, connect to the master node using SSH.

To connect to the master node

  1. Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

      & ./elastic-mapreduce --ssh --jobflow JobFlowID

      Use the job flow ID of the sample job flow.

    • Windows users:

      1. Start PuTTY. (For more information about how to install PuTTY and use it to connect to an EC2 instance, such as the master node, go to Appendix D: Connecting to a Linux/UNIX Instance from Windows using PuTTY in the Amazon Elastic Compute Cloud User Guide.)

      2. Select Session in the Category list. Enter hadoop@DNS in the Host Name field. The input looks similar to hadoop@ec2-184-72-128-177.compute-1.amazonaws.com.

      3. In the Category list, expand Connection, expand SSH, and then select Auth. The Options controlling SSH authentication pane appears.

      4. Click Browse for Private key file for authentication, and select the private key file you generated earlier. If you are following this guide, the file name is mykeypair.ppk.

      5. Click OK.

      6. Click Open to connect to your master node.

      7. A PuTTY Security Alert pops up. Click Yes.

    When you successfully connect to the master node, the output looks similar to the following:

    Using username "hadoop".
    Authenticating with public key "imported-openssh-key"
    Linux domU-12-31-39-01-5C-F8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
    --------------------------------------------------------------------------------
    
    Welcome to Amazon EMR running Hadoop and Debian/Lenny.
    
    Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
    /mnt/var/log/hadoop/steps for diagnosing step failures.
    
    The Hadoop UI can be accessed via the following commands:
    
      JobTracker    lynx http://localhost:9100/
      NameNode      lynx http://localhost:9101/
    
    --------------------------------------------------------------------------------
  2. Run the sample Hive script with the following command.

    hadoop@domU-12-31-39-07-D2-14:~$ hive \
        -d SAMPLE=s3://elasticmapreduce/samples/hive-ads \
        -d DAY=2009-04-13 -d HOUR=08 \
        -d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 \
        -d OUTPUT=[A path to a bucket and a folder you own on Amazon S3, such as, s3://myawsbucket/folder] \
        -f s3://elasticmapreduce/samples/hive-ads/libs/twitter-impressions.q

    The Hive script is added to the job flow. The output looks similar to the following:

    10/08/20 14:57:34 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. 
    Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to 
    override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
    Hive history file=/mnt/var/lib/hive/tmp/history/hive_job_log_hadoop_201008201457_1658787617.txt
    Testing s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar
    converting to local s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar
    Added /mnt/var/lib/hive/downloaded_resources/s3_elasticmapreduce_samples_hive-ads_libs_jsonserde.jar 
    to class path
    Found class for com.amazon.elasticmapreduce.JsonSerde
    OK
    Time taken: 11.531 seconds
    
    ...
    
    
    Starting Job = job_201008201445_0003, Tracking URL = http://domU-12-31-39-01-5C-F8.compute-1.internal:
    9100/jobdetails.jsp?jobid=job_201008201445_0003
    Kill Command = /home/hadoop/.versions/0.20/bin/../bin/hadoop job  -Dmapred.job.tracker=
    domU-12-31-39-01-5C-F8.compute-1.internal:9001 -kill job_201008201445_0003
    2010-08-20 14:59:07,714 Stage-2 map = 0%,  reduce = 0%
    2010-08-20 14:59:22,254 Stage-2 map = 100%,  reduce = 0%
    2010-08-20 14:59:31,450 Stage-2 map = 100%,  reduce = 33%
    2010-08-20 14:59:37,608 Stage-2 map = 100%,  reduce = 100%
    Ended Job = job_201008201445_0003
    Loading data to table output_impressions partition {day=2009-04-13, hour=08}
    30 Rows loaded to output_impressions
    OK
    Time taken: 64.647 seconds

    Your job flow step is completed.

To quit ssh or PuTTY

  • Type exit and press ENTER.

To terminate a job flow

  • Enter the following commands from the command-line prompt:

    • Linux and UNIX users:

       $ ./elastic-mapreduce --terminate JobFlowID
    • Windows users:

      C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --terminate JobFlowID

To view the results of your job flow

  1. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.

  2. Navigate to the Amazon S3 bucket and path you referenced in your Hive script as part of -d OUTPUT. The results for this sample will be located in a text file in the folder \impressions\day=2009-04-13\hour=08.

Your job flow results are stored in a text file.

You can find additional Amazon EMR log files in the Amazon S3 bucket you specified in your credentials.json file.

For information about the contents of these logs, see the Amazon Elastic MapReduce Developer Guide.

Now that you have completed a job flow using Hive, you can clean up your resources so you do not incur any unnecessary charges. Click the following button to learn how.

Restore Environment