I have some query regarding Cloudera Manager(Free Edition) on EC2. I am not sure this is the correct place to ask the question. If I am wrong please also let me know. Is there a place where I can put my questions regarding cloudera manager and Hadoop?
Current I am creating hadoop cluster using cloudera manager. I have m3.Xlarge EC2 Instances but the wizard does not have option to select m3.xlarge instance. Secondly, I have RHEL OS where as wizard does not have option for RHEL it has only Ubuntu 12.04 and Cent OS 6.3. Does that means it does not have support for RHEL?
Cloudera Manager only supports ubuntu and centos as of now. And please not whatever other instances, you might have created will not be used by the cloudera manager. It will automatically create new instances and you can check that in the management console of aws. So when you choose number of instances and type of instances( only supported type by cloudera manager is available), it will use your key and secret key to create them automatically.
Related
I want to learn hadoop and hence downloaded hortonworks sandbox on my local machine and opened it on vmbox. But due to lack of sufficient RAM I am thinking about using a cloud vm instance. I used wget to install hortonworks sandbox on instance but it is in ova file ? How can I open it ? How can I start using hadoop eenvironment on my instance ? I want to get into ambari GUI through my cloud instance. Is there any way ?
You cannot install Hortonworks sandbox from an ova into a VM instance. The sandbox is a virtual machine setup for installation on your desktop into products such as VirtualBox. Your Google Compute Engine is a VM so you cannot install a VM into a Google VM.
Setting up Hadoop on a single VM instance is fairly easy, and there are numerous tutorials on the Internet. Google also offers Dataproc as a service which is a very good setup of Hadoop, Sparc, etc. However, setting up Hadoop manually with all of the applications that Hortonworks offers will take effort. This is why products like Dataproc exist, to remove that setup burden.
I am new to IBM Bluemix platform and exploring its BigInsights service. I can see pre configured components such as Pig Hive Hbase and others. But I want to know How can I install services like Drill or say Hue which is not configured by default. Also ssh to cluster nodes allows restricted access with no sudo rights in case one need to run yum commands.Does bluemix allows root access as I cannot see one. Thanks In advance.
As far as I know, it is not possible.
But you can use http://www.softlayer.com/ to build your own IOP (IBM Open Platform) Cluster in the cloud.
If you are interested in IBM's value-adds and you just want to try out:
https://www.youtube.com/watch?v=4p7LDeu_qQQ it is a nice tutorial to set up your own cluster via Docker.
This tutorial should be still valid for Hue:
https://developer.ibm.com/hadoop/2015/06/02/deploying-hue-on-ibm-biginsights/
Installing Drill doesn't look complicated:
https://drill.apache.org/docs/installing-drill-in-distributed-mode/
In conclusion: You need to move away from Bluemix, if you want to have a more customised BigInsights. But there are options: Softlayer, AWS, .. or just on your local computer (if you got sufficient resources, since some components like Hbase need a minimum amount of nodes)
It's a known fact that it is not possible to create a cluster in a single machine by changing ports. The workaround is to add virtual Ethernet devices to our machine and use these to configure the cluster.
I want to deploy a cluster of , let's say 6 nodes, on two ec2 instances. That means, 3 nodes on each machine. Is it possible? What should be the seed nodes address, if it's possible?
Is it a good idea for production?
You can use Datastax AMI on AWS. Datastax Enterprise is a suitable solution for production.
I am not sure about your cluster, because each node need its own config files and it is default. I have no idea how to change it.
There are simple instructions here. When you configure instances settings, you have to write advanced settings for cluster, like --clustername yourCluster --totalnodes 6 --version community etc. You also can install Cassandra manually by installing latest version java and cassandra.
You can build cluster by modifying /etc/cassandra/cassandra.yaml (Ubuntu 12.04) fields like cluster_name, seeds, listener_address, rpc_broadcast and token. Cluster_name have to be same for whole cluster. Seed is master node, which IP you should add for every node. I am confused about tokens
I am trying to create a small cluster for testing purposes on EC2 using Cloudera Manager 5.
These are the directions I am following, http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.7.1/Cloudera-Manager-Installation-Guide/cmig_install_on_EC2.html.
It is getting to the point where it executes, "Execute command SparkUploadJarServiceCommand on service spark" and it fails.
The error is "Upload Spark Jar failed on spark_master".
What is going wrong and how can I fix this?
Thanks for your help.
Adding the findings as an answer.
You have to open all the required ports for your Cloudera Manager to install it's components correctly.
For a complete guide of ports you need to open refer to:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_ports_cdh4.html
If you are running Cloudera Manager in EC2 you can create a security group that allows all traffic/ports between the Cloudera Manager and its nodes.
I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties