Amazon Elastic MapReduce
Developer Guide (API Version 2009-11-30)
Print this pageEmail this pageGo to the ForumsView the PDFShare this page on TwitterShare this page on FacebookBookmark this page on DeliciousSubmit this page to RedditSubmit this page to DiggDid this page help you?  Yes  No   Tell us about it...

Building Binaries Using Amazon EMR

You can use Amazon Elastic MapReduce (Amazon EMR) (Amazon EMR) as a build environment to compile programs for use in your job flow. Programs that you use with Amazon EMR must be compiled on a system running the same version of Debian used by Amazon EMR. For a 32-bit version, (m1.small, and m1.medium) you should have compiled on a 32-bit machine or with 32-bit cross compilation options turned on. For a 64-bit version, you need to have compiled on a 64-bit machine or with 64-bit cross compilation options turned. For more information on Amazon EC2 instance versions, go to Amazon EC2 Instances. Supported programming languages include C++, Cython, and C#.

The following table outlines the steps involved to build and test your application using Amazon EMR.

Process for Building a Module

1 Create an interactive job flow.
2 Identify the job flow ID and Public DNS name of the master node.
3 SSH as the Hadoop user to the master node of your Hadoop cluster.
4 Copy source files to the master node.
5 Build binaries with any necessary optimizations.
6 Copy binaries from the master node to Amazon S3.
7 Close the SSH connection.
8 Terminate the job flow.

The details for each of these steps are covered in the sections that follow.

To create an interactive job flow

  • Create an interactive job flow with a single node Hadoop cluster using the desired instance type:

    If you are using... Enter the following...
    Linux or UNIX
    $ ./elastic-mapreduce --create --alive --name "Interactive Job Flow" \
    --num-instances=1 --master-instance-type=m1.large --hive-interactive 
    Microsoft Windows C:\ruby elastic-mapreduce --create --alive --name "Interactive Job Flow" --num-instances=1 --master-instance-type=m1.large --hive-interactive

    The output looks similar to:

    Created jobflow JobFlowID	

To identify the job flow ID and Public DNS name of the master node

  • Identify your job flow:

    1. If you are using... Enter the following...
      Linux or UNIX
      & ./elastic-mapreduce --list
      Microsoft Windows c:\ruby elastic-mapreduce --list

    The output looks similar to the following.

    j-SLRI9SCLK7UC          STARTING    ec2-75-101-168-82.compute-1.amazonaws.com
      Interactive Job Flow  PENDING     Hive Job

    The response includes the job flow ID and the Public DNS Name. You use this information to connect to the master node.

    Typically you need to wait one or two minutes after launching the job flow before the Public DNS Name is assigned.

To SSH as the Hadoop user to the master node

  • Use your credentials created for your Amazon EC2 key pair to log in to the master node:

    Instructions for creating credentials are located at Create a Credentials File.

    1. If you are using... Enter the following...
      Linux or UNIX
      & ./elastic-mapreduce --ssh --jobflow JobFlowID
      Microsoft Windows
      1. Start PuTTY.

      2. Select Session in the Category list. Enter hadoop@DNS in the Host Name field. In this example, the input looks similar to hadoop@ec2-75-101-168-82.compute-1.amazonaws.com.

      3. In the Category list, expand Connection, expand SSH, and then select Auth. The Options controlling SSH authentication pane appears.

      4. Click Browse for Private key file for authentication, and select the private key file you generated earlier. If you are following this guide, the file name is mykeypair.ppk.

      5. Click OK.

      6. Click Open to connect to your master node.

      7. A PuTTY Security Alert pops up. Click Yes.

    When you successfully connect to the master node, the output looks similar to the following:

    Using username "hadoop".
    Authenticating with public key "imported-openssh-key"
    Linux domU-12-31-39-01-5C-F8 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686
    --------------------------------------------------------------------------------
    
    Welcome to Amazon EMR running Hadoop and Debian/Lenny.
    
    Hadoop is installed in /home/hadoop. Log files are in /mnt/var/log/hadoop. Check
    /mnt/var/log/hadoop/steps for diagnosing step failures.
    
    The Hadoop UI can be accessed via the following commands:
    
      JobTracker    lynx http://localhost:9100/
      NameNode      lynx http://localhost:9101/
    
    --------------------------------------------------------------------------------

To copy source files to the master node

  • Copy your source files to the master node:

    1. Put your source files on Amazon S3. To learn how to create buckets and move files with Amazon S3, go to the Amazon Simple Storage Service Getting Started Guide.

    2. Create a folder on your Hadoop cluster for your source files by entering a command similar to the following:

      $ mkdir SourceDesitination

      You now have a destination folder for your source files.

    3. Copy your sources files from Amazon S3 to the Hadoop cluster by entering a command similar to the following:

      $ hadoop fs -get s3://myawsbucket/SourceFiles SourceDestination

      Your source files are now located in your destination folder on the master node of your Hadoop cluster.

Build binaries with any necessary optimizations

How you build your binaries code depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. You can use Hadoop system specification commands to obtain cluster information to determine how to install your build environment.

To identify system specifications

  • Use the following commands to verify the architecture you are using to build your binaries:

    1. To view the version of Debian, enter the following command:

      master$ cat /etc/issue

      The output looks similar to the following.

      Debian GNU/Linux 5.0
    2. To view the Public DNS Name and processor size, enter the following command:

      master$ uname -a

      The output looks similar to the following.

      Linux domU-12-31-39-17-29-39.compute-1.internal 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 GNU/Linux
    3. To view the processor speed, enter the following command:

      master$ cat /proc/cpuinfo

      The output looks similar to the following.

      processor : 0
      vendor_id : GenuineIntel
      model name : Intel(R) Xeon(R) CPU E5430 @ 2.66GHz
      flags : fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr dca lahf_lm
      ... 

How you build your binaries code depends on many factors. Follow the instructions for your specific build tools to setup and configure your environment. Once your binaries are built, you can copy the files to Amazon S3.

To copy binaries from the master node to Amazon S3

  • Copy the binaries to Amazon S3 by entering the following command:

    $ hadoop fs -put BinaryFiles s3://myawsbucket/BinaryDestination

    Your binaries are now stored in your Amazon S3 bucket.

To close the SSH connection

  • Enter the following command from the Hadoop command-line prompt:

    • $ exit

    You are no longer connected to your cluster via SSH.

To terminate the job flow

  • If you are using... Enter the following...
    Linux or UNIX
    $ ./elastic-mapreduce  --terminate JobFlowID
    Microsoft Windows C:\ruby elastic-mapreduce --terminate JobFlowID

    Your job flow is terminated.

    [Important]Important

    Terminating a job flow delete all files and executables saved to the cluster. Remember to save all required files before terminating a job flow.