Can I create Hadoop cluster with single VM? - hadoop

I am an experienced person in Java and wanted to get my hands dirty with Hadoop. I have gone through the basics and now preparing for the practical things.
I have started with the tutorials given at https://developer.yahoo.com/hadoop/tutorial/ to setup and running hadoop on virtual machine.
So, to create a cluster I need multiple virtual machine running in parallel. right? And needs to add ip address of all in hadoop-site.xml. Or can I do it with single virtual machine?

No you cannot create a cluster with single VM. Cluster meaning is group of machines.
If you have a good configuration of Host machine, on top of that you can run 'n' number of guest OS. By doing like this only you can create Hadoop cluster (1 NN, 1 SNN, 1 DN)
If you want, you can install Pseudo mode (all services run in one machine) Hadoop, which runs like a testing machine

You can setup a multinode cluster using any virtual box like Oracle VM. Create 5 nodes(1-NN,1-SNN,3-DN). Assign each node its IP address and set up all the configuration files on all the nodes. There are 2 files - (Masters and slave). In the NN node give the IP address of SNN in Master file and all the 3 DN's IP address in the slave files. Also set up the ssh connectivity between all the nodes using the public keys.

Related

Should I need to use same configration all hosts for Hadoop cluster?

I want to create multi hosts cluster for Hadoop. I want to install Apache Ambari on server with multi hosts but I have one confusion regarding hosts. Should I need to use same configuration on all hosts like (RAM, Processor, Hard Disk). I have one hosts have 64 GB ram and other two have 4GB RAM. Can I move with configuration or anything wrong with this?
Your question sounds like a perfect use case for host config groups in Ambari. Just create 2 host groups with different memory settings
It's a usual practice to have few more powerful nodes for Ambari Server, master nodes like Namenode, Hbase Master, DB, and hundreds of less powerful slave nodes

Determin whether slave nodes in hadoop cluster has been assigned tasks

I'm new to Hadoop and MapReduce. I just deployed a Hadoop cluster with one master machine and 32 slave machines. However when I start to run an example program, it seems that it just runs to slow. How can I determine whether a map/reduce task has really been assigned to a slave node for execution?
The example program is executed like that:
hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 32 100
okay lots of possibilities there. Hadoop comes out to help in distributed task.
So if your code is written in way that everything is dependent then there is no use of 32 slaves. rather it will take overhead time to manage connection.
check your hadoopMasterIp:50070 if if all the datanodes(slave) is running or not. obviously if you did not change dfs.http.address in your core-site.xml.
The easiest way to take a look at Yarn Web UI. By default it uses port 8088 on your master node (change master in the URI by your own IP address):
http://master:8088/cluster
There you can see total resources of your cluster and list of all applications. For every application you can find out how many mappers/reducers were used and where (on what machine) they were executed.

Dynamic IPs for Hadoop cluster

I need to setup a multi-node Hadoop cluster. So far, I have done installations using static IP addresses for each of the cluster nodes. However, in my latest cluster, I need to work with DHCP assigned nodes. So I am wondering, how should I get the cluster working and survive restarts etc.
Is it mandatory to have static IP address for the cluster nodes or can we get it working with dynamic IPs as well?
Any expert guidance please.
For standalone and pseudo-distributed modes, you can get going on dynamic IP address for it runs on a single node.
For fully distributed mode, the nodes are identified with the file masters and slaves located in 'HADOOP_HOME/conf'. These names are hosts which have been described in '/etc/hosts'. So, when IP of any node changes, Hadoop cannot identify the machines or nodes or hosts (even if identified, these nodes have no Hadoop configured). Thus, you cannot achieve the fully distributed Hadoop setup.
Get your DHCP configured on a router if you can. Else install DHCP on all of the nodes. And get going!!

How to deploy a Cassandra cluster on two ec2 machines?

It's a known fact that it is not possible to create a cluster in a single machine by changing ports. The workaround is to add virtual Ethernet devices to our machine and use these to configure the cluster.
I want to deploy a cluster of , let's say 6 nodes, on two ec2 instances. That means, 3 nodes on each machine. Is it possible? What should be the seed nodes address, if it's possible?
Is it a good idea for production?
You can use Datastax AMI on AWS. Datastax Enterprise is a suitable solution for production.
I am not sure about your cluster, because each node need its own config files and it is default. I have no idea how to change it.
There are simple instructions here. When you configure instances settings, you have to write advanced settings for cluster, like --clustername yourCluster --totalnodes 6 --version community etc. You also can install Cassandra manually by installing latest version java and cassandra.
You can build cluster by modifying /etc/cassandra/cassandra.yaml (Ubuntu 12.04) fields like cluster_name, seeds, listener_address, rpc_broadcast and token. Cluster_name have to be same for whole cluster. Seed is master node, which IP you should add for every node. I am confused about tokens

hadoop install on round-robin DNS

I want to install hadoop on a round-robin DNS environment. I have a bunch of machines sharing a common user environment and a common name. These machines are equal. The round-robin DNS runs on a branch of machines. Each machine has its own IP address and host name. It is our school's machines. But these machines share a common name. When I login, my terminal shows which machine I am on.
The problem is that I make change on one machine, the changes applies to all other machines.
I follow the instruction of michael-noll's multi-node hadoop. I need to configure master node.
But what I did to master node applies to slave nodes. That said, I cannot differentiate master and slave nodes.
So, can I install hadoop in such environment?
I'm not quite sure why you would want to install Hadoop using a round-robin DNS, but no, you cannot do this with Hadoop. Every single node needs to have a unique host name.

Resources