| Did this page help you? Yes No Tell us about it... |

This sample Hive script combines advertisement impression and click log data to evaluate the success of targeted online advertising. The script combines the two sets of log data, places the information into a Hive cluster, and outputs the results to a specified directory. The following script processes all impressions that occurred between 2009-04-13 8:00 and 2009-04-13 9:00 and were referred by twitter.com from a Mozilla browser.
A detailed description of this business problem can be found in the tutorial, Contextual Advertising Using Hive and Amazon EMR http://developer.amazonwebservices.com/connect/entry!default.jspa?categoryID=269&externalID=2855.
Hive provides tools to summarize data, query, and analyze large data sets stored in Hadoop files. It provides a simple query language called Hive QL which is based on SQL. Hive allows traditional map/reduce programmers to plug in custom mappers and reducers for more sophisticated analysis.
For your convenience, this sample script is stored on Amazon S3 at
s3://elasticmapreduce/samples/hive-ads. You can also save this script
to your own Amazon S3 location and change the Hive command appropriately.
Sample data for this job flow is available at
s3://elasticmapreduce/samples/hive-ads/libs/twitter-impressions.q.
The commented script follows:
A custom SerDe is used to read the advertisement impressions data.
ADD JAR ${SAMPLE}/libs/jsonserde.jar ;An external table is created to instruct Hive on how to organize the advertisement impressions data.
CREATE EXTERNAL TABLE impressions (
requestBeginTime string, adId string, impressionId string, referrer string,
userAgent string, userCookie string, ip string
)
PARTITIONED BY (dt string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId,
referrer, userAgent, userCookie, ip' )
LOCATION '${SAMPLE}/tables/impressions' ;A single partition table is created and partitioned based on time.
ALTER TABLE impressions ADD PARTITION (dt='2009-04-13-08-05');
Temporary tables are created in the job flow's local HDFS partition to store intermediate advertisement impressions and click data.
CREATE TABLE tmp_impressions (
requestBeginTime string, adId string, impressionId string, referrer string,
userAgent string, userCookie string, ip string
)
STORED AS SEQUENCEFILE ;Data from the advertisement impressions table for a specified time period is inserted into the partitioned table.
INSERT OVERWRITE TABLE tmp_impressions
SELECT
from_unixtime(cast((cast(i.requestBeginTime as bigint) / 1000) as int)) requestBeginTime,
i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip
FROM
impressions i
WHERE
i.dt ≥ '{DAY}-${HOUR}-00' and i.dt < '{NEXT_DAY}-${NEXT_HOUR}-00'
;Specific impression data is stored in an output table on Amazon S3.
CREATE EXTERNAL TABLE output_impressions (
requestBeginTime string, adId string, impressionId string, referrer string,
userAgent string, userCookie string, ip string
)
PARTITIONED BY (day string, hour string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '${OUTPUT}/impressions'
;The output table is populated with all advertisement impressions referred by twitter.com through a Mozilla browser during the specified time period.
INSERT OVERWRITE TABLE output_impressions PARTITION (day='${DAY}', hour='${HOUR}')
SELECT
i.requestBeginTime, i.adId, i.impressionId, i.referrer, i.userAgent, i.userCookie, i.ip
FROM
tmp_impressions i
WHERE
i.referrer = 'twitter.com' and i.userAgent like '%Mozilla%'
; To run the job flow with Hive, create an Amazon Elastic MapReduce (Amazon EMR) job flow using the CLI, log in to the job flow's master node, and then launch the Hive script.
To create a job flow using Hive
Enter the following commands from the command-line prompt:
Linux and UNIX users:
$ ./elastic-mapreduce --create --alive --name "Hive Job Flow" --hive-interactive
Windows users:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --alive --name "Hive Job Flow" --hive-interactive
The output will look similar to:
Created job flow JobFlowID This job flow takes a few minutes to transition from the
STARTING to the WAITING state. You can
monitor the job flow from the CLI as described in the Retrieve Information About a Specific Job Flow step or from the Amazon EMR console.
To list all active job flows using the CLI
Enter the following commands from the command-line prompt:
Linux and UNIX users:
$ ./elastic-mapreduce --list --active
Windows users:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --list --active
The list of active job flows initially looks similar to the following:
JobFlowID STARTING
Hive Job Flow
PENDING Setup Hive When the job flow is ready to accept the Hive script, it looks similar to:
JobFlowID WAITING ec2-184-72-128-177.compute-1.amazonaws.com
Hive Job Flow
COMPLETED Setup HiveThe DNS to the master node and the root login are required to connect to the master
node. The DNS can be found in the output of an active job flow. In this sample, the DNS
is ec2-184-72-128-177.compute-1.amazonaws.com. The root login or username
is hadoop.
When the job flow is in the WAITING state, connect to the
master node using SSH.
To connect to the master node
Enter the following commands from the command-line prompt:
Linux and UNIX users:
& ./elastic-mapreduce --ssh --jobflow JobFlowID Use the job flow ID of the sample job flow.
Windows users:
Start PuTTY. (For more information about how to install PuTTY and use it to connect to an EC2 instance, such as the master node, go to Appendix D: Connecting to a Linux/UNIX Instance from Windows using PuTTY in the Amazon Elastic Compute Cloud User Guide.)
Select Session in the
Category list. Enter
hadoop@
in the Host Name field. The input looks
similar to
DNShadoop@ec2-184-72-128-177.compute-1.amazonaws.com.
In the Category list, expand Connection, expand SSH, and then select Auth. The Options controlling SSH authentication pane appears.
Click Browse for Private key
file for authentication, and select the private
key file you generated earlier. If you are following this guide,
the file name is mykeypair.ppk.
Click OK.
Click Open to connect to your master node.
A PuTTY Security Alert pops up. Click Yes.
When you successfully connect to the master node, the output looks similar to the following:
Using username "hadoop". Authenticating with public key "imported-openssh-key" Linux domU-12-31-39-01-5C-F8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 -------------------------------------------------------------------------------- Welcome to Amazon EMR running Hadoop and Debian/Lenny. Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check /mnt/var/log/hadoop/steps for diagnosing step failures. The Hadoop UI can be accessed via the following commands: JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/ --------------------------------------------------------------------------------
Run the sample Hive script with the following command.
hadoop@domU-12-31-39-07-D2-14:~$ hive \
-d SAMPLE=s3://elasticmapreduce/samples/hive-ads \
-d DAY=2009-04-13 -d HOUR=08 \
-d NEXT_DAY=2009-04-13 -d NEXT_HOUR=09 \
-d OUTPUT=[A path to a bucket and a folder you own on Amazon S3, such as, s3://myawsbucket/folder] \
-f s3://elasticmapreduce/samples/hive-ads/libs/twitter-impressions.qThe Hive script is added to the job flow. The output looks similar to the following:
10/08/20 14:57:34 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath.
Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to
override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
Hive history file=/mnt/var/lib/hive/tmp/history/hive_job_log_hadoop_201008201457_1658787617.txt
Testing s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar
converting to local s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar
Added /mnt/var/lib/hive/downloaded_resources/s3_elasticmapreduce_samples_hive-ads_libs_jsonserde.jar
to class path
Found class for com.amazon.elasticmapreduce.JsonSerde
OK
Time taken: 11.531 seconds
...
Starting Job = job_201008201445_0003, Tracking URL = http://domU-12-31-39-01-5C-F8.compute-1.internal:
9100/jobdetails.jsp?jobid=job_201008201445_0003
Kill Command = /home/hadoop/.versions/0.20/bin/../bin/hadoop job -Dmapred.job.tracker=
domU-12-31-39-01-5C-F8.compute-1.internal:9001 -kill job_201008201445_0003
2010-08-20 14:59:07,714 Stage-2 map = 0%, reduce = 0%
2010-08-20 14:59:22,254 Stage-2 map = 100%, reduce = 0%
2010-08-20 14:59:31,450 Stage-2 map = 100%, reduce = 33%
2010-08-20 14:59:37,608 Stage-2 map = 100%, reduce = 100%
Ended Job = job_201008201445_0003
Loading data to table output_impressions partition {day=2009-04-13, hour=08}
30 Rows loaded to output_impressions
OK
Time taken: 64.647 secondsYour job flow step is completed.
To quit ssh or PuTTY
Type exit and press ENTER.
To terminate a job flow
Enter the following commands from the command-line prompt:
Linux and UNIX users:
$ ./elastic-mapreduce --terminate JobFlowIDWindows users:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --terminate JobFlowIDTo view the results of your job flow
Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
Navigate to the Amazon S3 bucket and path you referenced in your Hive script
as part of -d OUTPUT. The results for this sample will be
located in a text file in the folder
\impressions\day=2009-04-13\hour=08.
Your job flow results are stored in a text file.
You can find additional Amazon EMR log files in the Amazon S3 bucket you
specified in your credentials.json file.
For information about the contents of these logs, see the Amazon Elastic MapReduce Developer Guide.
Now that you have completed a job flow using Hive, you can clean up your resources so you do not incur any unnecessary charges. Click the following button to learn how.
