Installing Hadoop over 5 hard drives on a desktop - hadoop

I have been working with installing Hadoop. I followed some instruction on a Udemy course, and I installed Hadoop on pseudo distributed mode, on my laptop. It was fairly straightforward.
After that, I started to wonder if I could set up Hadoop on a desktop computer. So went out and bought an empty case and put in a 64 bit, 8 core AMD processor, along with a 50GB SSD hard drive and 4 inexpensive 500GB hard drives. I installed Ubuntu 14.04 on the SSD drive, and put virtual machines on the other drives.
I'm envisioning using my SSD as the master and using my 4 hard drives as nodes. Again, everything is living in the same case.
Unfortunately, and I've been searching everywhere, and I can't find any tutorials, guides, books, etc, that describe setting up Hadoop in this manner. It seems like most everything I've found that details installation of Hadoop is either a simple pseudo distributed setup (which I've already done), or else the instructions jump straight to large scale commercial applications. I'm still learning the basics, clearly, but I'd like to play in this sort-of in between place.
Has anyone done this before, and/or come across any documentation / tutorials / etc that describe how to set Hadoop up in this way? Many thanks in advance for the help.

You can run hadoop in different VM's which are located in different drives in the same system.
But you need to allocate same configurations for all the master and slave nodes
Also ensure that all the VM's having different ip addresses.
You can get different IP addresses by connecting your master computer to the LAN or you need to disable some functionality in VM machines in order to get different IP addresses.

if you done the hadoop installation in pseduo mode means then follow the below steps this may help you.
MULTINODE :
Configure the hosts in the network using the following settings in the host file. This has to be done in all machine [in namenode too].
sudo vi /etc/hosts
add the following lines in the file:
yourip1 master
yourip2 slave01
yourip3 slave02
yourip4 slave03
yourip5 slave04
[Save and exit – type ESC then :wq ]
Change the hostname for the namenode and datanodes.
sudo vi /etc/hostname
For master machine [namenode ] – master
For other machines – slave01 and slave02 and slave03 and slave04 and slave 05
Restart the machines in order to get the settings related to the network applied.
sudo shutdown -r now
Copy the keys from the master node to all datanodes, so as this will help to access the machines without asking for permissions everytime.
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave01
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave02
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave03
#ssh-copy-id –i ~/.ssh/id_rsa.pub hduser#slave04
Now we are about to configure the hadoop configuration settings, so navigate to the ‘conf’ folder.
cd ~/hadoop/etc
Edit the slaves file within the hadoop directory.
vi ~/hadoop/conf/slaves
And add the below :
master
slave01
slave02
slave03
slave04
Now update localhost to master in core-site.xml,hdfs-site.xml,mapred-site.xml and yarn-site.xml
Now copy the files from the hadoop/etc/hadoop folder from master to slave machines.
then format you namenode in all machines.
and start the hadoop services.
I given you the some clues for how to configure the hadoop multinode cluster.

Never tried, but if you type ifconfig then it gives you same ipaddress on all the vm machines in hard drives. So this may not be the better option to go..
You can try creating Hadoop Cluster on Amazon EC2 for free using this step by step guide HERE
Or Video guide HERE
Hope it helps!

Related

Cloudera installation dfs.datanode.max.locked.memory issue on LXC

I have created virtual box, ubuntu 14.04LTS environment on my mac machine.
In virtual box of ubuntu, I've created cluster of three lxc-containers. One for master and other two nodes for slaves.
On master, I have started installation of CDH5 using following link http://archive.cloudera.com/cm5/installer/latest/cloudera-manager-installer.bin
I have also made necessary changes in the /etc/hosts including FQDN and hostnames. Also created passwordless user named as "ubuntu".
While setting up the CDH5, during installation I'm constantly facing following error on datanodes. Max locked memory size: dfs.datanode.max.locked.memory of 922746880 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.
Exception in secureMain: java.lang.RuntimeException: Cannot start datanode because the configured max locked memory size (dfs.datanode.max.locked.memory) of 922746880 bytes is more than the datanode's available RLIMIT_MEMLOCK ulimit of 65536 bytes.
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1050)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:411)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2297)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2184)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2231)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2407)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2431)
Krunal,
This solution will be probably be late for you but maybe it can help somebody else so here it is. Make sure your ulimit is set correctly. But in case its a config issue.
goto:
/run/cloudera-scm-agent/process/
find latest config dir,
in this case:
1016-hdfs-DATANODE
search for parameter in this dir:
grep -rnw . -e "dfs.datanode.max.locked.memory"
./hdfs-site.xml:163: <name>dfs.datanode.max.locked.memory</name>
and edit the value to the one he is expecting in your case(65536)
I solved by opening a seperate tab in Cloudera and set the value from there

Hadoop removing a mount point folder from Cloudera

I've searched and I've been reading on Cloudera Hadoop on removing mount point file systems but I cannot find a thing on removing them.
I have two SSD drives in 6 machines and when I initially installed Cloudera Hadoop it added all file systems and I only need two mount points to run a few teragen and terasorts.
I need to remove everything except for:
/dev/nvme0n1 and /dev/nvme1n1
In Cloudera Manager you can modify the list of drives used for HDFS data at:
Clusters > HDFS > Configuration > DataNode Default Group (or whatever you may have renamed this to) > DataNode Data Directory

hadoop 2 node cluster setup

Can anyone please tell how to create a hadoop 2 node cluster ?i have 2 Virual machines both have hadoop single node setup installed properly,please tell how to form 2 node cluster from that ,and mention proper network setup for that.Thanks in Advance.i am using hadoop -1.0.3 version.
Hi can you do one thing for network configuration of vms.
Make sure of the network connect of virtual machine setting of hostonly.
And add two hostname and ips in /etc/hosts file.
add two hostnames in $HADOOP_HOME/conf/slaves file.
Make sure of secondary namenode hostname in $HADOOP_HOME/conf/masters file.
Then make sure same configuration on two machine.
hadoop namenode -format
start-all.sh
Best of luck.

Differences : Single-node and Multi-node

I'm trying to install Hadoop in a virtual machine, I found a tutorial explaining how to do that in a multi-node cluster .
So my question is what's the difference between a single-node and a multi-node cluster ?
Thanks in advance :)
Single node cluster : By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used.
Pseudo-distributed or multi-node cluster: The Hadoop daemons run on a local machine, thus simulating a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine. HDFS is used instead of local FS
And you can have your work environment set up as follow
Step 1 - Download VMware player latest version and install on your laptop/desktop. You can also install VMware tools, which will be very useful for your working with guest OS.
Step 2 - Once you Step 1 is completed then Download Cloudera Quick Start VM from
http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo
Step 3 - Open VMPlayer program and click on “Open a virtual Machine”. and go to directory “cloudera-quickstart-vm-4.4.0-1-vmware”. Select cloudera-quickstart-vm-4.4.0-1-vmware. This will create a virtual machine instance in VM Player.
Step 4 - Start the Cloudera VM by Clicking on power on to start the Cloudera Demo VM.
You are good to go
Good luck

Hadoop cluster configuration with Ubuntu Master and Windows slave

Hi I am new to Hadoop.
Hadoop Version (2.2.0)
Goals:
Setup Hadoop standalone - Ubuntu 12 (Completed)
Setup Hadoop standalone - Windows 7 (cygwin being used for only sshd) (Completed)
Setup cluster with Ubuntu Master and Windows 7 slave (This is mostly for learning purposes and setting up a env for development) (Stuck)
Setup in relationship with the questions below:
Master running on Ubuntu with hadoop 2.2.0
Slaves running on Windows 7 with a self compiled version from hadoop 2.2.0 source. I am using cygwin only for the sshd
password less login setup and i am able to login both ways using ssh
from outside hadoop. Since my Ubuntu and Windows machine have
different usernames I have set up a config file in the .ssh folder
which maps Hosts with users
Questions:
In a cluster does the username in the master need to be same as in the slave. The reason I am asking this is that post configuration of the cluster when I try to use start-dfs.sh the logs say that they are able to ssh into the slave nodes but were not able to find the location "/home/xxx/hadoop/bin/hadoop-daemon.sh" in the slave. The "xxx" is my master username and not the slaveone. Also since my slave in pure Windows version the install is under C:/hadoop/... Does the master look at the env variable $HADOOP_HOME to check where the install is in the slave? Is there any other env variables that I need to set?
My goal was to use the Windows hadoop build on slave since hadoop is officially supporting windows now. But is it better to run the Linux build under cygwin to accomplish this. The question comes since I am seeing that the start-dfs.sh is trying to execute hadoop-daemon.sh and not some *.cmd.
If this setup works out in future, a possible question that I have is whether Pig, Mahout etc will run in this kind of a setup as I have not seen a build of Pig, Mahout for Windows. Does these components need to be present only on the master node or do they need to be in the slave nodes too. I saw 2 ways of running mahout when experimenting with standalone mode first using the mahout script which I was able to use in linux and second using the yarn jar command where I passed in the mahout jar while using the windows version. In the case Mahout/ Pig (when using the provided sh script) will assume that the slaves already have the jars in place then the Ubuntu + Windows combo does not seem to work. Please advice.
As I mentioned this is more as an experiment rather than an implementation plan. Our final env will be completely on linux. Thank you for your suggestions.
You may have more success going with more standard ways of deploying hadoop. Try out using ubuntu vm's for master and slaves.
You can also try to do a pseudo-distributed deployment in which all of the processes run on a single VM and thus avoid the need to even consider multiple os's.
I have only worked with the same username. In general SSH allows to login with a different login name with the -l command. But this might get tricky. You have to list your slaves in the slaves file.
At least at the manual https://hadoop.apache.org/docs/r0.19.1/cluster_setup.html#Slaves I did not find anything to add usernames. it might be worth trying to add -l login_name to the slavenode in the slave conf file and see if it works.

Resources