| Did this page help you? Yes No Tell us about it... |
You can use Amazon Elastic MapReduce (Amazon EMR) (Amazon EMR) as a build environment to compile programs for use in your job flow. Programs that you use with Amazon EMR must be compiled on a system running the same version of Debian used by Amazon EMR. For a 32-bit version, (m1.small, and m1.medium) you should have compiled on a 32-bit machine or with 32-bit cross compilation options turned on. For a 64-bit version, you need to have compiled on a 64-bit machine or with 64-bit cross compilation options turned. For more information on Amazon EC2 instance versions, go to Amazon EC2 Instances. Supported programming languages include C++, Cython, and C#.
The following table outlines the steps involved to build and test your application using Amazon EMR.
Process for Building a Module
| 1 | Create an interactive job flow. |
| 2 | Identify the job flow ID and Public DNS name of the master node. |
| 3 | SSH as the Hadoop user to the master node of your Hadoop cluster. |
| 4 | Copy source files to the master node. |
| 5 | Build binaries with any necessary optimizations. |
| 6 | Copy binaries from the master node to Amazon S3. |
| 7 | Close the SSH connection. |
| 8 | Terminate the job flow. |
The details for each of these steps are covered in the sections that follow.
To create an interactive job flow
Create an interactive job flow with a single node Hadoop cluster using the desired instance type:
| If you are using... | Enter the following... |
|---|---|
| Linux or UNIX |
$ ./elastic-mapreduce --create --alive --name "Interactive Job Flow" \ --num-instances=1 --master-instance-type=m1.large --hive-interactive |
| Microsoft Windows |
C:\ruby elastic-mapreduce --create --alive --name
"Interactive Job Flow" --num-instances=1
--master-instance-type=m1.large
--hive-interactive
|
The output looks similar to:
Created jobflow JobFlowID To identify the job flow ID and Public DNS name of the master node
Identify your job flow:
| If you are using... | Enter the following... |
|---|---|
| Linux or UNIX |
& ./elastic-mapreduce --list |
| Microsoft Windows |
c:\ruby elastic-mapreduce --list
|
The output looks similar to the following.
j-SLRI9SCLK7UC STARTING ec2-75-101-168-82.compute-1.amazonaws.com Interactive Job Flow PENDING Hive Job
The response includes the job flow ID and the Public DNS Name. You use this information to connect to the master node.
Typically you need to wait one or two minutes after launching the job flow before the Public DNS Name is assigned.
To SSH as the Hadoop user to the master node
Use your credentials created for your Amazon EC2 key pair to log in to the master node:
Instructions for creating credentials are located at Create a Credentials File.
| If you are using... | Enter the following... |
|---|---|
| Linux or UNIX |
& ./elastic-mapreduce --ssh --jobflow
|
| Microsoft Windows |
|
When you successfully connect to the master node, the output looks similar to the following:
Using username "hadoop". Authenticating with public key "imported-openssh-key" Linux domU-12-31-39-01-5C-F8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 -------------------------------------------------------------------------------- Welcome to Amazon EMR running Hadoop and Debian/Lenny. Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check /mnt/var/log/hadoop/steps for diagnosing step failures. The Hadoop UI can be accessed via the following commands: JobTracker lynx http://localhost:9100/ NameNode lynx http://localhost:9101/ --------------------------------------------------------------------------------
To copy source files to the master node
Copy your source files to the master node:
Put your source files on Amazon S3. To learn how to create buckets and move files with Amazon S3, go to the Amazon Simple Storage Service Getting Started Guide.
Create a folder on your Hadoop cluster for your source files by entering a command similar to the following:
$ mkdir SourceDesitinationYou now have a destination folder for your source files.
Copy your sources files from Amazon S3 to the Hadoop cluster by entering a command similar to the following:
$ hadoop fs -get s3://myawsbucket/SourceFiles SourceDestinationYour source files are now located in your destination folder on the master node of your Hadoop cluster.
Build binaries with any necessary optimizations
How you build your binaries code depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. You can use Hadoop system specification commands to obtain cluster information to determine how to install your build environment.
To identify system specifications
Use the following commands to verify the architecture you are using to build your binaries:
To view the version of Debian, enter the following command:
master$ cat /etc/issue
The output looks similar to the following.
Debian GNU/Linux 5.0
To view the Public DNS Name and processor size, enter the following command:
master$ uname -a
The output looks similar to the following.
Linux domU-12-31-39-17-29-39.compute-1.internal 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 GNU/Linux
To view the processor speed, enter the following command:
master$ cat /proc/cpuinfo
The output looks similar to the following.
processor : 0 vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm ...
How you build your binaries code depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. Once your binaries are built, you can copy the files to Amazon S3.
To copy binaries from the master node to Amazon S3
Copy the binaries to Amazon S3 by entering the following command:
$ hadoop fs -put BinaryFiles s3://myawsbucket/BinaryDestination
Your binaries are now stored in your Amazon S3 bucket.
To close the SSH connection
Enter the following command from the Hadoop command-line prompt:
$ exit
You are no longer connected to your cluster via SSH.
To terminate the job flow
| If you are using... | Enter the following... |
|---|---|
| Linux or UNIX |
$ ./elastic-mapreduce --terminate
|
| Microsoft Windows |
C:\ruby elastic-mapreduce --terminate
|
Your job flow is terminated.
![]() | Important |
|---|---|
Terminating a job flow delete all files and executables saved to the cluster. Remember to save all required files before terminating a job flow. |