Hadoop Quirky Behavior - hadoop

Is it possible that datanode doesn't start sometimes on running start-all.sh but when you restart the computer it starts fine. What could be a cause of such a quirky behavior?
Do other java processes running within the same namespace corrupt the hadoop processes?

Here are some of the things I observed.
hadoop namenode -format. This command needs to be executed before you start-all. Otherwise namenode doesn't start.
Also check that the application you are using on top of hadoop complies with the version of hadoop that the application expects. In my case I was using hadoop 20.203 and I switched to 20.2 my problem was solved.
Check the logs, they often give valuable insights into what has gone wrong.

Related

How to isolate Hadoop program to certain core

We are trying to run Hadoop version 1.2.1 single node cluster on a 4 core Intel server. We are trying to run the hadoop programs only on 3 cores instead of 4. We have tried using taskset but it didn't work. We tried using cset as well but it was not good.
Our theory was the init which spawns the hadoop processes (Namenode, Datanode, etc) make their affinity to all cores, so we couldn't manually change their affinity using taskset.
Is there any option on hadoop-env that we can use?
I couldn't find any documentation online.
Any help would be appreciated.

What is the exact difference between pseudo mode and stand alone mode in hadoop?

What is the exact difference between pseudo mode and stand alone mode in hadoop?
How can we get to know that when working on our own laptop / desktop?
The differences are the one described in the product documentation:
Standalone Operation: By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
Pseudo-Distributed Operation: Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Unless you want to debug Hadoop code, you should always run in pseudo-distributed.

Hadoop is sometimes too slow (stuck at 100%)

I set up a cluster of ten machines in which I installed CDH4 (yarn).
I run the nameNode, the resourceManger and the historyServer in the same node, and the client in another node.
On the rest of machines, I turned on dataNode and NodeManager.
I launched my application on a 100GBytes file, it worked at first and it was relatively quick, but now it gets really really slow at the end of the map (around 90% 100% it takes 30 minutes).
I don't know if the problem comes from the way I coded the program or the way I configured cloudera CDH4.
The problem is that it works sometimes but does not work other times although I didn't change anything.
I found out why it took so much time at the end, in fact I thought that the command hadoop fs -expunge allows me to empty the trash but it doesn't, so when Hadoop tried to write in HDFS files the result it was very slow because there was a very little space left.

DataFlow difference in Hadoop Standalone and Pseudodistributed mode?

Can someone please tell me what is the difference in dataflow of Hadoop Standalone and Pseudodistributed mode. Infact I am trying to run an example of matrix multiplication presented by John Norstad. It runs fine in hadoop standalone mode but does not work properly in pseudodistributed mode. I am unable to fix the problem so please tell me the principle difference between hadoop standalone and pseudodistributed mode which can be helpful for fixing the stated problem.Thanks
Reagrds,
WL
In standalone mode everything (namenode, datanode, tasktracker, jobtracker) is running in one JVM on one machine. In pseudo-distributed mode, everything is running each in it's own JVM, but still on one machine. In terms of the client interface there shouldn't be any difference, but I wouldn't be surprised if the serialization requirements are more strict in pseudo-distributed mode.
My reasoning for the above is that in pseudo-distributed mode, everything must be serialized to pass data between JVMs. In standalone mode, it isn't strictly necessary for everything to be serializable (since everything is in one JVM, you have shared memory), but I don't remember if the code is written to take advantage of that fact, since that's not a normal use case for Hadoop.
EDIT: Given that you are not seeing an error, I think it sounds like a problem in the way the MapReduce job is coded. Perhaps he relies on something like shared memory among the reducers? If so, that would work in standalone mode but not in pseudo-distributed mode (or truly distributed mode, for that matter).

How can I tell if a hadoop namenode has already been formatted?

When configuring my hadoop namenode for the first time, I know I need to run
bin/hadoop namenode -format
but running this a second time, after loading data into HDFS, will wipe out everything and reformat. Is there an easy way to tell if a namenode has already been formatted?
Well you can check this file
store1/name/current/VERSION
If it exist then it is already formatted.
PS: You do format only once in life in production system. Its better to do in the installation process or manually in emergency restoration time.

Resources