Need help to setup hadoop cluster in aws - hadoop

I would like to setup a hadoop cluster in aws which will have total capacity of 100T approx. If I go and choose aws instances as per http://aws.amazon.com/ec2/instance-types/ , I do not get ideal configuration for data nodes, I would like to use local disks(SSD/NON-SSD) for worker nodes. for e.g. If I select cc2.8xlarge instance for datanode then for 100T I will have to setup 30 cc2.8xlarge instances which would be very costly. Could you please suggest how should I configure my cluster in aws (EC2) with minimum number of datanodes or is there any standard configuration for hadoop in aws ?

It sounds very much like you want to consider Elastic MapReduce which is a core AWS service based in Hadoop.
http://aws.amazon.com/elasticmapreduce/
You can specify your configuration and the cluster will launch for you - much easier than trying to configure EC2 instances yourself.

If you want to do Hadoop yourself, then you use EBS drives. You can mount a bunch of drives (around 10-20 as I recall) on each node, and each drive can be up to 1 TB.
If you don't want to do it yourself, then look into EMR like monkeymatrix said.

Related

Kubernetes distributed filesystem

Well, my company is considering to move from Hadoop to Kubernetes. We can find solutions in Kubernetes for tools such as cassandra, sparks, etc. So the last problem for us is how to store massive amount of files in Kubernetes, saying 1 PB. FYI, we DO NOT want to use online storage services such as S3.
As far as I know, HDFS is merely used in Kubernetes and there are a few replacement products such as Torus and Quobyte. So my question is, any recommendation for the filesystem on Kubernetes? Or any better solution?
Many thanks.
You can use a Hadoop Compatible FileSystem such as Ceph or Minio. Both of which offer S3-compatible REST APIs for reading and writing. In Kubernetes, Ceph can be deployed using the Rook project.
But overall, running HDFS in Kubernetes would require stateful services like the NameNode, and DataNodes with proper affinity and network rules in place. The Hadoop Ozone project is a realization that object storage is more common for microservice workloads than HDFS block storage as reasonably trying to analyze PB of data using distributed microservices wasn't feasible. (I'm only speculating)
The alternative is to use Docker support in Hadoop & YARN 3.x

Running mahout using hadoop on Amazon's EMR/EC2

I want to migrate my current local hadoop cluster into amazon . In this hadoop cluster I am using services like mahout,hbase and hive . I have two option now in amazon either go for pure EC2 instances or Elastic map reduce cluster . I want some suggestion on what is better option to move the cluster which has these kinds of requirement .
I always suggest people to go for EMR, as that is managed and will be a bit more costlier than using pure ec2, but the headache and time you will spent in configuring the clusters and then managing them can be saved by running managed services like EMR.
Mahout can easily be run like a custom jar.
Hive cluster can also be launched within minutes.
Similary for HBase, Amazon has recently added creating HBase cluster over EMR.
See other views here.

EMR, EC2, OpenStack, Please clarify

I am quite new to Amazon services, and started reading about EMR. I am more or less familiar with OpenStack. I just want some one to tell me in short what plays the role of Compute, Controller and Cinder of storage in Amazon cloud.
For example Cinder is storage for OpenStack and likewise S3 is the storage in Amazon cloud.
What are the the other two - compute and controller in Amazon cloud?
Also, can some 1 please put up in simple words the relation between EMR and EC2 or are they entirely different ?
Even in EMR we use EC2 instances, so why are people comparing hadoop on EC2 vs Map Reduce like in the following link
Hadoop on EC2 vs Elastic Map Reduce
Thanks a ton in advance :)
Openstack is an open source software that can be setup in your own cloud so that you can have your managed services like Amazon.
Amazon is it's own independent service with its own proprietary implementation and they basically sell the service.
So Openstack has several components that has a somehow 1-1 mapping with AWS services.
Controller -> Amazon Console
Cinder -> EBS
Storage -> S3
Compute -> EC2
EMR (Elastic Map Reduce) is just another service from Amazon that allows you to run hadoop jobs. EMR basically runs on top of EC2 so in essence when you create an EMR cluster it's using EC2 as its underlying service.
You can also run Hadoop independently from EMR on EC2 instances, the downside is that you have to manage all the Hadoop installation, configuration yourself (Cloudera manager is pretty helpful for this). The advantage is that it allows you to tweak as much as you want from the Hadoop stack.
Hope this helps.

read data from amazon hbase

Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources