Amazon EC2 vs. Amazon EMR [closed] - amazon-ec2

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have implemented a task in Hive. Currently it is working fine on my single node cluster.
Now I am planning to deploy it on AWS.
I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR?
I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible?
Please suggest me as soon as possible.
Many Thanks.

EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured on them. If you are using your cluster for running Hadoop/Hive/Pig jobs, EMR is the way to go. An EMR instance costs a little bit extra as compared to an EC2 instance. A quick check on Amazon prices today reveals that a small EC2 instances costs $0.08/hour while a small EMR instance costs $0.015/hour extra.
In my opinion, it's totally worth paying that extra money to save yourself the hassle of installing and setting up Hadoop (along with Hive and Pig), creating and maintaining and AMI and using it. Moreover, EMR's version of Hadoop and Hive has some patches that are not available (atleast, not yet) on Apache Hive. If you use EC2, you will probably be using Apache Hadoop and Hive (or may be, the cloudera distributions) and wouldn't have access to those patches (like native support for S3 or commands like ALTER TABLE my_table RECOVER PARTITIONS
References:
http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/elasticmapreduce/pricing/

I would suggest that you do NOT try and deploy your own Hadoop cluster, unless you have 2-3 months to spare, and you have a hadoop expert handy.
Elastic MapReduce will allow you to get started very quickly by providing a pre-configured hadoop environment. Seeing as you only have a single job, it should be fine.

In general, historically, EMR was pretty far behind the latest versions of Hadoop components, and some were missing entirely. That's the major reason for using another distribution. For example, if you wanted HBase, it wasn't in EMR, but not it is. Today, Spark is absent from EMR. EMR will generally lag.
That said, if you're not using the latest and greatest features, go with EMR.

Related

Please explain the high level architecture of Hadoop Cluster Environment? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Since am new to Hadoop Administration, am trying to understanding the Hadoop Cluster Setup Environment in real Time Production systems.
1) As of today, do most projects run on Hadoop v1 or Hadoop v2 ?
2) Do we have Single Cluster or Multiple Clusters for a single Projects?
(I heard there are multiple clusters where each cluster is dedicated to specific roles.)
3) Do the Hadoop Clusters usually run on Cloud like AWS, Rackspace or Do they run on their Clien't own network.?
All questions you've completely depends on the client, project and lot of other factors.. but here are my 2 cents
1) Most of the projects have been switched to Hadoop v2.
2) It depends, obviously there will be 1 or 2 envs for dev, test and staging etc., before production. But in production one project will have one environment or one environment handles multiple projects. (Yahoo has 4,500 node hadoop cluster)
3) Number of nodes varies on the amount of data the company handles.. there are companies which run production cluster on 4 node cluster and 4000 node cluster
4) Again it depends on the type of data they're storing and processing.. clients with sensitive information like Banking won't normally go for Cloud as they feel data will be secure in their own data centers. But some clients they completely go for cloud because they save a lot of money (like New York Times on AWS).

Hadoop installation on Amazon cloud

i am new to Hadoop ,i likes to go in hadoop administration line so studied basics of hadoop and tried to install hadoop in pseudo distribution mode and installed successfully and run some basic examples also, now i need to improve me further,so i need to try a way to learn hadoop installation and configuration in real time so decided to go for Amazon micro instance ,can any one please tell how to install and configure hadoop in Amazon cloud.
Thanks in Advance.
I have tried this personally and you will not really be able to use hadoop on a single micro instance due to memory restrictions. IMHO you should atleast try a medium instance to run hadoop or better yet use their elastic-mapreduce api which is a modified version of hadoop. You can run a 3 node cluster for around 00.25 cents an hour. If you really want to learn big data this is the way I went.
You should check out their documentation here
http://aws.amazon.com/documentation/elasticmapreduce/

Is there an Amazon community AMI for Hadoop/HBase?

I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties

read data from amazon hbase

Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly

Recommendations for Hadoop on EC2?

When running Hadoop in EC2, I seem to have two options:
A: Manage the cluster myself, using the EC2-specific shell scripts that come with Hadoop.
B: Use Elastic MapReduce, and pay a little extra for the convenience.
I'm leaning towards B, but I'd appreciate some advice from people with more experience. Here are my questions:
Are there any tasks that can be done with one of these methods but not the other?
Are there other options besides these two that I'm overlooking?
If I choose B, how easy would it be to go back to A? That is, what's the danger of vendor lock-in?
Third option:
You can use apache whirr to set up an hadoop cluster on ec2 (rackspace is also supported)
I have been told by people close to the Amazon Elastic MapReduce (EMR) development team that there are at least two other advantages to using EMR: a) Amazon is actively applying bug fixes and performance enhancements to the Hadoop code base used on EMR, and b) Amazon employs a high performance network between EMR servers and S3 servers that may not be available between EC2 servers and S3 servers.
UPDATE: See #mat's comments that refute the rumored advantages of using EMR.
Disclaimer: I'm the founder of Axemblr.com
There are also commercial alternatives you can use. Axemblr Tool for Cloudera CDH3 is a tool we are building that can deploy a cluster in just a few minutes with all you need (including Cloudera Hue, Mahout & Pig).
We are also building an alternative to EMR that's fully compatible from an API perspective, targeted at private clouds.
If you are wondering why it makes sense to run CDH on EC2 rather than EMR see:
http://www.quora.com/What-are-the-advantages-disadvantages-running-Clouderas-distribution-for-Hadoop-on-EC2-instances-rather-than-using-Amazons-Elastic-Map-Reduce-Service

Resources