I am quite new to Amazon services, and started reading about EMR. I am more or less familiar with OpenStack. I just want some one to tell me in short what plays the role of Compute, Controller and Cinder of storage in Amazon cloud.
For example Cinder is storage for OpenStack and likewise S3 is the storage in Amazon cloud.
What are the the other two - compute and controller in Amazon cloud?
Also, can some 1 please put up in simple words the relation between EMR and EC2 or are they entirely different ?
Even in EMR we use EC2 instances, so why are people comparing hadoop on EC2 vs Map Reduce like in the following link
Hadoop on EC2 vs Elastic Map Reduce
Thanks a ton in advance :)
Openstack is an open source software that can be setup in your own cloud so that you can have your managed services like Amazon.
Amazon is it's own independent service with its own proprietary implementation and they basically sell the service.
So Openstack has several components that has a somehow 1-1 mapping with AWS services.
Controller -> Amazon Console
Cinder -> EBS
Storage -> S3
Compute -> EC2
EMR (Elastic Map Reduce) is just another service from Amazon that allows you to run hadoop jobs. EMR basically runs on top of EC2 so in essence when you create an EMR cluster it's using EC2 as its underlying service.
You can also run Hadoop independently from EMR on EC2 instances, the downside is that you have to manage all the Hadoop installation, configuration yourself (Cloudera manager is pretty helpful for this). The advantage is that it allows you to tweak as much as you want from the Hadoop stack.
Hope this helps.
Related
I am planning to create cluster with three nodes and each node will be launched in three different Amazon EC2 zone.
As per Datastax Documentation, I will use Ec2MultiRegionSnitch and replication stragey is NetworkTopologyStrategy. Below is my needs to be achieved
Cluster Size : 3 (Spanning Across Amazon EC2 Region).
Replication Factor: 3
Read and Write Level : QUORUM.
Based on the above configuration, I can survive on single node loss(Meaning that down of any one of amazon region. Correct me if I am wrong).
In order to achieve the above configuration, I have two option
Option-1 : Using Datastax provided Amazon EC2 AMI image.
This option launch the instance with almost all components needed to run cassandra with some monitoring tools(opscenter..etc)
But It store all data on EC2 Instance Store hence data persists only during the life of the instance and the storage size depends upon instance type.
Option-2 : Using Customised installation
In this option, I have to launch Amazon EC2 Ubuntu AMI,installing JAVA,installing Datastax community edition.
This option enable me to store all my data on EBS. Hence I can expand EBS whenever I needed and the same time I can restore any node using EBS snapshot.
My Question:
Which one of the option is suitable for my needs?.
Note:
I read the documentation provided by Datastax and very new to cassandra. Hence, Whatever inputs you provided will be very useful to me.
Thanks
It's not true that you get Datastax AMI only with EC2 ephemeral storage. Starting from version 2.5 they claim you can choose EBS as well: Introducing the DataStax Auto-Clustering AMI 2.5. That's an relatively easy way of getting started which I've personally chosen.
Should you choose EBS or EC2 ephemeral storage?
The answer is: it depends...
The past (~2012-2013):
EC2 instances with ephemeral storage were a better choice. There were detailed performance benchmarks over the years which indicated that EBS is getting better, but still, attached physical drives were better.
The past (~2014):
EC2 choice is still better. Datastax wrote a nice post about pricing, network and failure resilience: What is the story with AWS storage?
Present (~2016):
instaclustr claims:
By running Cassandra on Amazon EBS, you can run denser, cheaper
Cassandra clusters with just as much availability as ephemeral storage
instances.
Nice presentation here: AWS re:Invent 2015 | (BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second on 60 Nodes
All in all, I suggest you doing a TCO analysis and if there isn't a big difference in price, choose EBS - because of out of the box ability to make a snapshot. What's more, chances are EBS will be improved over the time.
We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.
I would like to setup a hadoop cluster in aws which will have total capacity of 100T approx. If I go and choose aws instances as per http://aws.amazon.com/ec2/instance-types/ , I do not get ideal configuration for data nodes, I would like to use local disks(SSD/NON-SSD) for worker nodes. for e.g. If I select cc2.8xlarge instance for datanode then for 100T I will have to setup 30 cc2.8xlarge instances which would be very costly. Could you please suggest how should I configure my cluster in aws (EC2) with minimum number of datanodes or is there any standard configuration for hadoop in aws ?
It sounds very much like you want to consider Elastic MapReduce which is a core AWS service based in Hadoop.
http://aws.amazon.com/elasticmapreduce/
You can specify your configuration and the cluster will launch for you - much easier than trying to configure EC2 instances yourself.
If you want to do Hadoop yourself, then you use EBS drives. You can mount a bunch of drives (around 10-20 as I recall) on each node, and each drive can be up to 1 TB.
If you don't want to do it yourself, then look into EMR like monkeymatrix said.
I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties
I am working on a HDFS high availability project.
I have configured Hadoop on one Amazon EC2 instance. It is small instance (AMI: Ubuntu server)
I want to form a cluster of EC2 instances. So, i am thinking of replicating the same machine. Does anybody have a clue about how to duplicate this instance on another instance of EC2. If yes, please share.
Thanks!
If your instance is EBS backed, you can make a snapshot and then run as many instance as you want from it.