How to deploy a Cassandra cluster on two ec2 machines? - amazon-ec2

It's a known fact that it is not possible to create a cluster in a single machine by changing ports. The workaround is to add virtual Ethernet devices to our machine and use these to configure the cluster.
I want to deploy a cluster of , let's say 6 nodes, on two ec2 instances. That means, 3 nodes on each machine. Is it possible? What should be the seed nodes address, if it's possible?
Is it a good idea for production?

You can use Datastax AMI on AWS. Datastax Enterprise is a suitable solution for production.
I am not sure about your cluster, because each node need its own config files and it is default. I have no idea how to change it.
There are simple instructions here. When you configure instances settings, you have to write advanced settings for cluster, like --clustername yourCluster --totalnodes 6 --version community etc. You also can install Cassandra manually by installing latest version java and cassandra.
You can build cluster by modifying /etc/cassandra/cassandra.yaml (Ubuntu 12.04) fields like cluster_name, seeds, listener_address, rpc_broadcast and token. Cluster_name have to be same for whole cluster. Seed is master node, which IP you should add for every node. I am confused about tokens

Related

Arangodump between two different AWS ec2 clusters

I have created a Graph database in ArangoDB in a 5 machine AWS cluster. I do not have enough space in the Database AWS cluster to store the dump. So I would like to take a dump of the database in an AWS instance in a different cluster. I have the key files to connect to the machines. How to do it using Arangodump ? Thanks.
I do get that correctly that you're using DC/OS clusters on AWS?
The problem with arangoimp is, that it doesn't know howto authenticate with the DC/OS proxy, and thus can't reach the routes it would require to import to arangodb.
The problem is similar to Running Arango Shell on DC/OS cluster - you want to use sshutle as lalitlogical describes to forward the ArangoDB server port (usually 8529) to your target environment.

Multi-node hadoop cluster installation

Sorry if my question appears to be naïve. We are planning to use CDH 5.3.0 or 5.4.0. We want to implement a multi-node cluster.
The example multi-node installations that I have seen/read on different blogs/resources have master and slaves on different hosts.
However, we are restrained by the number of hosts. We have only 2 powerful hosts ( 32 cores 400+ GB RAM), so if we decide to have master on one and slave on other, we will end up with only one slave. My questions are :
Is it possible to have master and slave on the same hosts?
Can I have more than one slave node on a single host.
Also does one need to pay to use Cloudera Manager or it is open-source like the rest of the components.
If you can point me in the direction of some resource which would help me understand above scenarios it would be helpful.
Thanks for your help.
Regards,
V
old question but no and wrong answer:
yes, it is possible to install Master & Worker services on a single host.
e.g. HDFS (NameNode and Datanode). You can even install a full cloudera or Hortonworks installation with ALL services on a single host if it is powerfull enough, but i would only recommend it for POC or testcases.
If you use cloudera or hortonworks without virtualization it is not possible to run multiple instances of the SAME worker services e.g. datanode on the same host. 1 Host 1 worker instance. everything else would not make sense.
Cloudera is a package of multiple open source projekt (Hadoop,Spark....) and other closed source parts like cloudera manager and other enterprise closed source features. But everything you need is free even for commercial use with the community licence.
Right now (2017): only cloudera navigator is the big feature which is not part of the community edition
Yes you can configure namenode and datanode both on a single node.
You cannot have more than two datanodes on a single machine.
Cloudera is open-source hadoop distribution.

vSphere Cluster creation requirements

I've been searching around but haven't found a clear answer on this.
We're using VMware ESXI with vSphere to manage a handful of VMs (about 15 right now)
However, these are all spread over three separate machines. I'm looking for a way to cluster these together so their resources can be pooled or dynamically allocated. I found vSphere DRS Cluster information, but I'm having a really hard time finding out what I need to get that set up.
Does it require a separate vCenter license to hook into vSphere? And at that point, how do I create a database to group all the server hosts together? Every tutorial I find already has 2+ host machines already grouped together in the vSphere client, and I'm not sure how to go about achieving that.
If you just want to create a failover cluster, then you need VMware HA. VMware DRS is the option for resource dynamical allocation. To manage these two options, you need a vCenter Server. vCenter Server Foundation can manage up to 3 hosts (which is your case). For more information about vCenter, see this link.
For VMware HA and DRS to work, you must have a shared storage (NFS, iSCSI, or Fiber Channel). To know how to create VMware HA cluster using vSphere Client (connected to vCenter Server), see this link.
VMware DRS is an option after you created VMware HA. See this link

Cloudera installation Doubts?

I am new to cloudera, I installed cloudera in my system successfully I have two doubts,
Consider a machine with some nodes already using hadoop with some data, Can we install Cloudera to use the existing Hadoop without made any changes or modifaction on data stored existing hadooop.
I installed Cloudera in my machine, I have another three machines to add those as clusters, I want to know, Am i want install cloudera in those three machines before add those machines as clusters ?, or Can we add a node as clusters without installing cloudera on that purticular nodes?.
Thanks in advance can anyone, please give some information about the above questions.
Answer to questions -
1. If you want to migrate to CDH from existing Apache Distribution, you can follow this link
Excerpt:
Overview
The migration process does require a moderate understanding of Linux
system administration. You should make a plan before you start. You
will be restarting some critical services such as the name node and
job tracker, so some downtime is necessary. Given the value of the
data on your cluster, you’ll also want to be careful to take recent
back ups of any mission-critical data sets as well as the name node
meta-data.
Backing up your data is most important if you’re upgrading from a
version of Hadoop based on an Apache Software Foundation release
earlier than 0.20.
2.CDH binary needs be installed and configured in all the nodes to have a CDH based cluster up and running.
From the Cloudera Manual
You can migrate the data from a CDH3 (or any Apache Hadoop) cluster to a CDH4 cluster by
using a tool that copies out data in parallel, such as the DistCp tool
offered in CDH4.
Other sources
Regarding your second question,
Again from the manual page
Important:
Before proceeding, you need to decide:
As a general rule:
The NameNode and JobTracker run on the the same "master" host unless
the cluster is large (more than a few tens of nodes), and the master
host (or hosts) should not
run the Secondary NameNode (if used), DataNode or TaskTracker
services. In a large cluster, it is especially important that the
Secondary NameNode (if used) runs on a separate machine from the
NameNode. Each node in the cluster except the master host(s) should
run the DataNode and TaskTracker services.
Additionally, if you use Cloudera Manager it will automatically do all the setup necessary i.e install the necessary selected components on the nodes in the cluster.
Off-topic: I had a bad habit of not referrring the manual properly. Have a clear look at it, it answers all our questions
Answer to your second question,
you can add directly, with installation few pre requisites like openssh-clients and firewalls and java.
these machines( existing node, new three nodes) should accept same username and password (or) you should set passwordless ssh to these hosts..
you should connect to the internet while adding the nodes.
I hope it will help you:)

Is there an Amazon community AMI for Hadoop/HBase?

I would like to test out Hadoop & HBase in Amazon EC2, but I am not sure how complicate it is. Is there a stable community AMI that has Hadoop & HBase installed? I am thinking of something like bioconductor AMI
Thank you.
I highly recommend using Amazon's Elastic MapReduce service, especially if you already have an AWS/EC2 account. The reasons are:
EMR comes with a working Hadoop/HBase cluster "out of the box" - you don't need to tune anything to get Hadoop/HBase working. It Just Works(TM).
Amazon EC2's networking is quite different from what you are likely used to. It has, AFAIK, a 1-to-1 NAT where the node sees its own private IP address, but it connects to the outside world on a public IP. When you are manually building a cluster, this causes problems - even using software like Apache Whirr or BigTop specifically for EC2.
An AMI alone is not likely to help you get a Hadoop or HBase cluster up and running - if you want to run a Hadoop/HBase cluster, you will likely have to spend time tweaking the networking settings etc.
To my knowledge there isn't, but you should be able to easily deploy on EC2 using Apache Whirr which is a very good alternative.
Here is a good tutorial to do this with Whirr, as the tutorial says you should be able to do this in minutes !
The key is creating a recipe like this:
whirr.cluster-name=hbase
whirr.instance-templates=1 zk+nn+jt+hbase-master,5 dn+tt+hbase-regionserver
whirr.provider=ec2
whirr.identity=${env:AWS_ACCESS_KEY_ID}
whirr.credential=${env:AWS_SECRET_ACCESS_KEY}
whirr.hardware-id=c1.xlarge
whirr.image-id=us-east-1/ami-da0cf8b3
whirr.location-id=us-east-1
You will then be able to launch your cluster with:
bin/whirr launch-cluster --config hbase-ec2.properties

Resources