Differences : Single-node and Multi-node - hadoop

I'm trying to install Hadoop in a virtual machine, I found a tutorial explaining how to do that in a multi-node cluster .
So my question is what's the difference between a single-node and a multi-node cluster ?
Thanks in advance :)

Single node cluster : By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used.
Pseudo-distributed or multi-node cluster: The Hadoop daemons run on a local machine, thus simulating a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine. HDFS is used instead of local FS

And you can have your work environment set up as follow
Step 1 - Download VMware player latest version and install on your laptop/desktop. You can also install VMware tools, which will be very useful for your working with guest OS.
Step 2 - Once you Step 1 is completed then Download Cloudera Quick Start VM from
http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo
Step 3 - Open VMPlayer program and click on “Open a virtual Machine”. and go to directory “cloudera-quickstart-vm-4.4.0-1-vmware”. Select cloudera-quickstart-vm-4.4.0-1-vmware. This will create a virtual machine instance in VM Player.
Step 4 - Start the Cloudera VM by Clicking on power on to start the Cloudera Demo VM.
You are good to go
Good luck

Related

How could I install Hortonworks' HDP?

I am new on this and I want to know how could I install the solution provided by Hortonworks, HDP, (http://hortonworks.com/products/data-center/hdp/) following the next specification: I have 2 Virtual Machines and another local machine to work with, and I want to use the 2 VM as Master node and Worker node by the time I configure Apache SPARK.
But my question is: what do I have to do to install HDP correctly? I have to install te solution in my local machine and configure Apache SPARK to use those 2 Virtual Machines as Master node and Worker node? Or I must to install HDP in the 3 machines that I have?
I repeat that I am new on this, and it wold be very helpfull for me any answer or comment that you could give.
Thank you, so much!
if you are trying to deploy HDP on a cluster (MultiNode Enviroment)
use Apache Ambari to Install HDP and other services such as spark.
I have tried it on centos environment below are the steps and links for installation.
get the repo by using the command below
wget
http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.2.2.0/ambari.repo
Install Ambari by using the command below
yum install ambari-server
Start the server by using the command below
ambari-server start
Now you can setup your by going to web browser and typing
http://:8080.
Now you can select the desired HDP version. Add the other nodes and services and deploy your cluster.
Detailed steps for installation can be found here

Spark Standalone Mode: Worker not starting properly in cloudera

I am new to the spark, After installing the spark using parcels available in the cloudera manager.
I have configured the files as shown in the below link from cloudera enterprise:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
After this setup, I have started all the nodes in the spark by running /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-all.sh. But I couldn't run the worker nodes as I got the specified error below.
[root#localhost sbin]# sh start-all.sh
org.apache.spark.deploy.master.Master running as process 32405. Stop it first.
root#localhost.localdomain's password:
localhost.localdomain: starting org.apache.spark.deploy.worker.Worker, logging to /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain: failed to launch org.apache.spark.deploy.worker.Worker:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
localhost.localdomain: at gnu.java.lang.MainThread.run(libgcj.so.10)
localhost.localdomain: full log in /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain:starting org.apac
When I run jps command, I got:
23367 Jps
28053 QuorumPeerMain
28218 SecondaryNameNode
32405 Master
28148 DataNode
7852 Main
28159 NameNode
I couldn't run the worker node properly. Actually I thought to install a standalone spark where the master and worker work on a single machine. In slaves file of spark directory, I given the address as "localhost.localdomin" which is my host name. I am not aware of this settings file. Please any one cloud help me out with this installation process. Actually I couldn't run the worker nodes. But I can start the master node.
Thanks & Regards,
bips
Please notice error info below:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
I met the same error when I installed and started Spark master/workers on CentOS 6.2 x86_64 after making sure that libgcj.x86_64 and libgcj.i686 had been installed on my server, finally I solved it. Below is my solution, wish it can help you.
It seem as if your JAVA_HOME environment parameter didn't set correctly.
Maybe, your JAVA_HOME links to system embedded java, e.g. java version "1.5.0".
Spark needs java version >= 1.6.0. If you are using java 1.5.0 to start Spark, you will see this error info.
Try to export JAVA_HOME="your java home path", then start Spark again.

Hadoop cluster configuration with Ubuntu Master and Windows slave

Hi I am new to Hadoop.
Hadoop Version (2.2.0)
Goals:
Setup Hadoop standalone - Ubuntu 12 (Completed)
Setup Hadoop standalone - Windows 7 (cygwin being used for only sshd) (Completed)
Setup cluster with Ubuntu Master and Windows 7 slave (This is mostly for learning purposes and setting up a env for development) (Stuck)
Setup in relationship with the questions below:
Master running on Ubuntu with hadoop 2.2.0
Slaves running on Windows 7 with a self compiled version from hadoop 2.2.0 source. I am using cygwin only for the sshd
password less login setup and i am able to login both ways using ssh
from outside hadoop. Since my Ubuntu and Windows machine have
different usernames I have set up a config file in the .ssh folder
which maps Hosts with users
Questions:
In a cluster does the username in the master need to be same as in the slave. The reason I am asking this is that post configuration of the cluster when I try to use start-dfs.sh the logs say that they are able to ssh into the slave nodes but were not able to find the location "/home/xxx/hadoop/bin/hadoop-daemon.sh" in the slave. The "xxx" is my master username and not the slaveone. Also since my slave in pure Windows version the install is under C:/hadoop/... Does the master look at the env variable $HADOOP_HOME to check where the install is in the slave? Is there any other env variables that I need to set?
My goal was to use the Windows hadoop build on slave since hadoop is officially supporting windows now. But is it better to run the Linux build under cygwin to accomplish this. The question comes since I am seeing that the start-dfs.sh is trying to execute hadoop-daemon.sh and not some *.cmd.
If this setup works out in future, a possible question that I have is whether Pig, Mahout etc will run in this kind of a setup as I have not seen a build of Pig, Mahout for Windows. Does these components need to be present only on the master node or do they need to be in the slave nodes too. I saw 2 ways of running mahout when experimenting with standalone mode first using the mahout script which I was able to use in linux and second using the yarn jar command where I passed in the mahout jar while using the windows version. In the case Mahout/ Pig (when using the provided sh script) will assume that the slaves already have the jars in place then the Ubuntu + Windows combo does not seem to work. Please advice.
As I mentioned this is more as an experiment rather than an implementation plan. Our final env will be completely on linux. Thank you for your suggestions.
You may have more success going with more standard ways of deploying hadoop. Try out using ubuntu vm's for master and slaves.
You can also try to do a pseudo-distributed deployment in which all of the processes run on a single VM and thus avoid the need to even consider multiple os's.
I have only worked with the same username. In general SSH allows to login with a different login name with the -l command. But this might get tricky. You have to list your slaves in the slaves file.
At least at the manual https://hadoop.apache.org/docs/r0.19.1/cluster_setup.html#Slaves I did not find anything to add usernames. it might be worth trying to add -l login_name to the slavenode in the slave conf file and see if it works.

Hadoop 2 node cluster

Is it necessary to have 2 linux machine for configuring two node hadoop cluster(ie 1 master 1 slave).Can it be done on single machine
Yes, you can run in local mode or pseudo-distributed mode, you don't necessarily need an actual cluster of machines.
See: http://hadoop.apache.org/docs/stable/single_node_setup.html

Hadoop CDH3 ERROR. Could not start Hadoop datanode daemon

I'm deploying Hadoop CDH3 in pseudo-distributed mode on a VPS.
So i have installed CDH3, then i have executed
sudo apt-get install hadoop-0.20-conf-pseudo
but if i try to start all daemons with
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done
it throws
ERROR. Could not start Hadoop datanode daemon
The same installation and starting commands works on my notebook.
I don't understand the cause. In fact the log file is empty. The available RAM is about 900MB, with 98G of available disk space.
Which can be the cause or how can i discover it? I'm excluding that the error is from the configuration files.
Consider using Cloudera Manager, it could save you some time (especially if you use multiple nodes). There is a nice video on Youtube which shows deployment process

Resources