I want to migrate my current local hadoop cluster into amazon . In this hadoop cluster I am using services like mahout,hbase and hive . I have two option now in amazon either go for pure EC2 instances or Elastic map reduce cluster . I want some suggestion on what is better option to move the cluster which has these kinds of requirement .
I always suggest people to go for EMR, as that is managed and will be a bit more costlier than using pure ec2, but the headache and time you will spent in configuring the clusters and then managing them can be saved by running managed services like EMR.
Mahout can easily be run like a custom jar.
Hive cluster can also be launched within minutes.
Similary for HBase, Amazon has recently added creating HBase cluster over EMR.
See other views here.
Related
I've installed Hadoop via a helm chart on my microk8s kubernetes cluster.
I would like to know how to create a dask cluster on my different machines on this hadoop cluster. I tried following the the tutorials on the Dask websites, but I keep getting errors because it is looking for the local yarn/hadoop. How do I point to the hadoop on kubernetes so I can create the cluster?
If you want to launch Dask on Yarn we recommend using https://yarn.dask.org
However, if you are using Kubernetes already you might consider https://kubernetes.dask.org, which is more commonly used today.
We are searching a viable way for provisioning a Hadoop ecosystem cluster with OpenShift (based on Docker). We look to build up a cluster using the services of the Hadoop ecosystem, i.e. HDFS, YARN, Spark, Hive, HBase, ZooKeeper etc.
My team has been using Hortonworks HDP for on-premise hardware but will now switch into a OpenShift-based infrastructure. Hortonworks Cloudbreak seems not to be suitable for OpenShift-based infrastructures. I have found this article that describes the integration of YARN into OpenShift but it seems like there are no further information available.
What is the easiest way to provision a Hadoop ecosystem cluster on OpenShift? Manually adding all the services feels error-prone and hard to administer. I have stumbled upon the Docker images of these separate services, but it is not comparable to the automated provisioning you get with a platform like Hortonworks HDP. Any guidance is appreciated.
If you install Openstack within Openshift, Sahara allows provisioning of Openstack Hadoop clusters
Alternatively, Cloudbreak is Hortonwork's tool for provisioning container based cloud deployments
Both provides Ambari, allowing you the same interface for cluster administration as HDP.
FWIW, I personally don't find the reason for putting Hadoop in containers. Your datanodes are locked to specific disks. There's no improvement in running several smaller ResourceManagers on a single host. Plus, for YARN, you'd be running containers within containers. And for the namenode, you must have a replicated Fsimage + Editlog because the container could be placed on any system
We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.
I would like to setup a hadoop cluster in aws which will have total capacity of 100T approx. If I go and choose aws instances as per http://aws.amazon.com/ec2/instance-types/ , I do not get ideal configuration for data nodes, I would like to use local disks(SSD/NON-SSD) for worker nodes. for e.g. If I select cc2.8xlarge instance for datanode then for 100T I will have to setup 30 cc2.8xlarge instances which would be very costly. Could you please suggest how should I configure my cluster in aws (EC2) with minimum number of datanodes or is there any standard configuration for hadoop in aws ?
It sounds very much like you want to consider Elastic MapReduce which is a core AWS service based in Hadoop.
http://aws.amazon.com/elasticmapreduce/
You can specify your configuration and the cluster will launch for you - much easier than trying to configure EC2 instances yourself.
If you want to do Hadoop yourself, then you use EBS drives. You can mount a bunch of drives (around 10-20 as I recall) on each node, and each drive can be up to 1 TB.
If you don't want to do it yourself, then look into EMR like monkeymatrix said.
Loving MRToolkit -- great to get away from Java while writing Hadoop jobs. It has become apparent that the library was written to interface with an EC2 cluster, and not with Amazon's elastic map/reduce system. Does anybody have insights into running jobs defined using the toolkit on elastic map/reduce servers? It isn't readily apparent from the web interface, and I'd love to avoid the headache of setting up a cluster by hand on EC2.
I've looked into updloading files under the 'streaming' option (as that's what MRToolkit uses), but Amazon is expecting separate files for the mapper and reducer -- typical MRToolkit style defines them in the a single file as subclasses of predefined Base(Map|Reduce) classes.
Thanks much for any thoughts.
Isaac
It's doable, but not through the web GUI.
Download and install the Ruby Client
Create your cluster: elastic-mapreduce --create --alive [params to size cluster]
Confirm your Elastic Map Reduce Master security group has port 22 open
SSH into your master node
Use git / scp to copy over your application code
Run your app