| Did this page help you? Yes No Tell us about it... |
Topics
The following sections describe the features available in Amazon Elastic MapReduce (Amazon EMR).
A bootstrap action is a mechanism that lets you run a script on Elastic MapReduce cluster nodes before Hadoop starts. Bootstrap action scripts are stored in Amazon S3 and passed to Amazon EMR when creating a new job flow. Bootstrap action scripts are downloaded from Amazon S3 and executed on each node before the job flow is executed.
By using bootstrap actions, you can install software on the node, modify the default Hadoop site configuration, or change the way Java parameters are used to run Hadoop daemons.
Both predefined and custom bootstrap actions are available. The predefined bootstrap actions include Configure Hadoop, Configure Daemons, and Run-if. You can write custom bootstrap actions in any language already installed on the job flow instance, such as Ruby, Python, Perl, or bash.
You can specify a bootstrap action from the command line interface, from the Amazon EMR console, or from the Amazon EMR API when starting a job flow. For more information, see Bootstrap Actions.
Amazon EMR supports Hadoop Distributed Files System (HDFS). HDFS is fault-tolerant, scalable, and easily configurable. The default configuration is already optimized for most job flows. Generally, the configuration needs to be changed only for very large clusters. Configuration changes are accomplished using bootstrap actions. For more information, see Hadoop Configuration.
Amazon EMR provides detailed logs you can use to debug both Hadoop and Amazon EMR. For more information on how to create logs, view logs, and use them to troubleshoot a job flow, see Troubleshooting.
Amazon Elastic MapReduce (Amazon EMR) supports Apache Hive. Hive is an integrated data warehouse infrastructure built on top of Hadoop. It provides tools to simplify data summarization and provides ad hoc querying and analysis of large datasets stored in Hadoop files. Hive provides a simple query language called Hive QL, which is based on SQL.
For more information on the supported versions of Hive, see Hive Configuration.
The ability to resize a running job flow lets you increase or decrease the number of nodes in a running cluster. Core nodes contain the Hadoop Distributed File System (HDFS). After a job flow is running, you can increase the number of core nodes. Task nodes also run your Hadoop, but do not contain HDFS. After a job flow is running you can also increase and decrease the number of task nodes. For more information, see Resizing Running Job Flows.
Amazon EMR provides an authentication mechanism to ensure that data stored in Amazon S3 is secured against unauthorized access. By default, only the AWS Account owner can access the data uploaded to Amazon S3. Other users can access the data only if you explicitly edit security permissions.
You can send data to Amazon S3 using the secure HTTPS protocol. Amazon EMR always uses a secure channel to send data between Amazon S3 and Amazon EC2. For added security, you can encrypt your data before uploading it to Amazon S3. For more information on AWS security, go to the AWS Security Center.
Amazon EMR supports job flows based on streaming, Hive, Pig, Custom JAR, and Cascading. Streaming enables you to write application logic in any language and to process large amounts of data using the Hadoop framework. Hive and Pig offer nonprogramming options with their SQL-like scripting languages. Custom JAR files enable you to write Java-based MapReduce functions. Cascading is an API with built-in MapReduce support that lets you create complex distributed processes. For more information, see Using Amazon EMR.
Amazon EMR supports job flows with multiple, sequential steps, including the ability to add steps while a job flow runs. Individual steps can combine to create more sophisticated job flows. Additionally, you can incrementally add steps to a running job flow to help with debugging. For more information, see Add Steps to a Job Flow.