How do I safely remove a Hadoop datanode for maintenance? - hadoop

I want to take a single machine out of a Hadoop cluster temporarily.
Most documentation says take it out of by adding it to the yarn and dfs .exclude files. I don't want to add it to the dfs.exclude and yarn.exclude files and decommission it with hdfs dfsadmin -refreshNodes, though, because I want to take it out, make some changes to the machine, and bring it back online as soon as possible. I don't want to copy hundreds of gigabytes of data over to avoid under-replicated blocks!
Instead, I'd like to be able to power off the machine quickly while making sure:
The cluster as a whole is still operational.
No data is lost by the journalmanager or nodemanager processes.
No Yarn jobs fail or go AWOL when the process dies.
My best guess at how to do this is by issuing:
./hadoop-daemon.sh --hosts hostname stop datanode
./hadoop-daemon.sh --hosts hostname stop journalnode
./yarn-daemon.sh --hosts hostname stop nodemanager
And then starting each of these processes individually again when the machine comes back online.
Is that safe? And is there a more efficient way to do this?

Related

Restarting NameNode in Hadoop Cluster without format

Due to some reasons had to shut down my master node in cluster, as if we start the cluster again the namenode wont run unless we format it again, is their any solution to start name-node without formatting... Tried everything..
Start-all.sh or starting namenode/datanodes individually but Namenode wont start untill i format it again, How can i start Name-node without formatting.
Thanks in Advance
Please post the log information.
In fact, it needn't format when you restart the hadoop. Because the HDFS meta information would be storage in the disk, if you format the namenode, the meta information will be lost.
You can try whether there is namenode process still exist when you stop the cluster, use the commond ps -e|grep java. If yes, kill it and start namenode again.

Nodemanager in unhealthy state

I am working on hadoop-2.6.0 single node cluster in windows. When i submit any mapreduce job, it always in accepted state. It seems my nodemanager is in unhealthy state. How to make it healthy? Why the nodemanager in unhealthy state? or when it will back to the healthy state?
Found the solution here
It seems that the cause of the problem is low disk space in the hadoop installed drive. So i just cleaned up with more space then nodemanager automatically changed into healthy state. We can't do it manually using any commands for changing the states of the hadoop nodes as analyzed.
When the job is in Accepted stage , it means that its waiting for the datanode to accept and start processing.
The following are to be done:
Check for available slots
If slots are available and its taking time to change status to Running , then check the datanodes health using either cloudera manager or hadoop dfs admin command.
If there are dead nodes , Restarting would solve the issue.
Please try to add the config in yarn-site.xml
name=yarn.nodemanager.disk-health-checker.enable value=false

After restar HBase ZooKeeper log Quorum.Learner: Got zxid 0x100000001 expected 0x1

I am performing some tests using HBase and Hadoop, I did setup a cluster with one master, two zookeeper and four region servers. Up until yestarday everything was working perfectly well, starting from today it simply don't start anymore.
When executing start-hbase all the process get alive:
HMaster using ports 8020 and 60010
HQuorumPeer using ports 2181 and 3888
HRegionServer
However when I take a look onto the server logs it seems the servers got stucked for some reason...
. HServer stop printing a WARNING about a native library that I was supposed to be using
. HQuorumPeer on node 1 prints a WARNING about Getting a zxid 0x10000000001 expected 0x1
. HQuorumPerr on node 1 has not print at all
Does someone has any idea on this?
Thanks.
Well, I am far, far away to be considered a hbase/hadoop expert. In fact it is just the first time I am playing around with it. Probably, the problem I had face was related to unproperly shutdown or corrupt file from the couple hbase/hadoop.
So here is my tip if you found yourself on the same situation:
cleanup all hbase logs, in my case at $HBASE_INSTALL/logs/*
cleanup all zookeeper data, in my case at /var/zookeeper/*
cleanup all hadoop data, in my case at /var/hadoop/*
cleanup all hdfs logs, in my case at /var/hdfs/log/*
cleanup all hdfs namenode data, in my case at /var/hdfs/namenode/*
cleanup all hdfs datanode data, in my case at /var/hdfs/datanode/*
format your hdfs cluster typing the command hdfs namenode -format
IMPORTANT: Don't do that if you have data, you will probably loose all of it. I could do that once I am just using it for test purpose.
I will keep reading about hbase/hadoop in order to understand it better, anyway I can guarantee that is a tool far to be "plug and play" when compared to cassandra.
Hope this can help.
Regards

SecondaryNamenode and MapReduce jobs

maybe that's a silly question... but anyway...
How would I understand that the secondary namenode does something (I mean it works), I must configure It to do something?
Also jobs in MapReduce run in parallel by default, I mean what you program in MR always run in parallel?
I made these questions because I have to proof (I have an project to do) that jobs on hadoop run in parallel.
Thanks you in advance.
P.S: Sorry for my bad english, and hope that I was understandable.
Yon, when you configure Hadoop you put hostname of some machine into the /conf/masters. This is where your SNN will run. You could go to the terminal of that machine and issue JPS. This will show you all the java processing running currently. You should be able to see SecondaryNameNode along with other processes. Something like this :
apache#hadoop:~$ jps
21615 TaskTracker
21268 SecondaryNameNode
21014 DataNode
27656 HRegionServer
21362 JobTracker
19908 org.eclipse.equinox.launcher_1.3.0.v20120522-1813.jar
17643 Jps
27364 HMaster
28451 Main
27194 HQuorumPeer
29811 RunJar
20744 NameNode
To cross check you could change this to some other machine and see the effect. Alternatively you could check it via the SNN port, which is 50090 by default. Does it make sense?
And when you run a MR job, you could open the mapreduce webUI by pointing your web browser to jobtracker_machine:50030. Here you can see a list of all the jobs you are running(or which you have run previously) along with the total number of mappers/reducers created for a particular job. You can click on a job and it will show you all the mappers and reducers running currently on your cluster. You can see the progress of each mapper/reducer over there. All these mappers/reducers run in parallel in different machines. To verify that you could click on each mapper and it will show you the machine where that particular mapper/reducer is running along with the % completion of each mapper/reducer.
HTH

How to separate Hadoop MapReduce from HDFS?

I'm curious if you could essentially separate the HDFS filesystem from the MapReduce framework. I know that the main point of Hadoop is to run the maps and reduces on the machines with the data in question, but I was wondering if you could just change the *.xml files to change the configuration of what machine the jobtracker, namenode and datanodes are running on.
Currently, my configuration is a 2 VMs setup: one (the master) with Namenode, Datanode, JobTracker, Tasktracker (and the SecondaryNameNode), the other (the slave) with DataNode, Tasktraker. Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each). The bottleneck will be the data transfer between the two VMs for the computations of maps and reduces, but since the data at this stage is so small I'm not primarily concerned with it. I would just like to know if this configuration is possible, and how to do it. Any tips?
Thanks!
You don't specify this kind of options in the configuration files.
What you have to do is to take care of what kind of deamons you start on each machine(you call them VMs but I think you mean machines).
I suppose you usually start everything using the start-all.sh script which you can find in the bin directory under the hadoop installation dir.
If you take a look at this script you will see that what it does is to call a number of sub-scripts corresponding to starting the datanodes, tasktrackers and namenode, jobtracker.
In order to achive what you've said, I would do like this:
Modify the masters and slaves files as this:
Master file should contain the name of machine1
Slaves should contain the name of machine2
Run start-mapred.sh
Modify the masters and slaves files as this:
Master file should contain the machine1
Slaves file should contain machine1
Run start-dfs.sh
I have to tell you that I've never tried such a configuration so I'm not sure this is going to work but you can give it a try. Anyway the solution is in this direction!
Essentially, what I want to change is have the master with NameNode DataNode(s), JobTracker, and have the slave with only the TaskTracker to perform the computations (and later on, have more slaves with only TaskTrackers on them; one on each).
First, I am not sure why to separate the computation from the storage. The whole purpose of MR locality is lost, thought you might be able to run the job successfully.
Use the dfs.hosts, dfs.hosts.exclude parameters to control which datanodes can connect to the namenode and the mapreduce.jobtracker.hosts.filename, mapreduce.jobtracker.hosts.exclude.filename parameters to control which tasktrackers can connect to the jobtracker. One disadvantage of this approach is that the datanodes and tasktrackers are started on the nodes which are excluded and aren't part of the Hadoop cluster.
Another approach is to modify the code to have a separate slave file for the tasktracker and the datanode. Currently, this is not supported in Hadoop and would require a code change.

Resources