We are trying to run Hadoop version 1.2.1 single node cluster on a 4 core Intel server. We are trying to run the hadoop programs only on 3 cores instead of 4. We have tried using taskset but it didn't work. We tried using cset as well but it was not good.
Our theory was the init which spawns the hadoop processes (Namenode, Datanode, etc) make their affinity to all cores, so we couldn't manually change their affinity using taskset.
Is there any option on hadoop-env that we can use?
I couldn't find any documentation online.
Any help would be appreciated.
Related
I did an install of Hadoop 2.7.2 single node, almost manually following tutorial on internet, then almost manually I even installed Spark, starting from spark-1.6.0-bin-hadoop2.6.tgz, a version that claims to work with hadoop 2.6+.
I did no effort to configure Spark, just started to use it with interactive python: it worked immediately, that is a little surprising too, but anyway...
Then I decided to run an example in order to see if it scale properly vertically. My box as 4 CPU, so i decide to run an easy to parallelize job, ie the PI computation: http://spark.apache.org/examples.html.
Surprisingly, the result is this one ( the box is an old 4 core ):
and, from the PySpark console:
It seems to me that it scale perfectly on the 4 cpu cores I have.
Problem: I did not configure number of cores, and I don't know why, this is a typical behavior that will not work in production, and would be hard to explain what to configure. Is there some YARN/SPARK feature that automatically scale to the cores available?
Then, other question, is this using YARN? How can I understand if my Spark leverages YARN a s a cluster manager?
What is the exact difference between pseudo mode and stand alone mode in hadoop?
How can we get to know that when working on our own laptop / desktop?
The differences are the one described in the product documentation:
Standalone Operation: By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
Pseudo-Distributed Operation: Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Unless you want to debug Hadoop code, you should always run in pseudo-distributed.
I need to paralleling a code written in tcl/tk language,in a cluster have 12 nodes my IP is registered to access to 4 nodes, every node have 4 processors with 2 cores each. the OS is linux centos 64 bit, I have the admin user pass and a normal user pass.
so my question the installation of the tcl/tk. could be done on each node! in a way not changing the version works on other nodes!?. if it can be done, then how to link or launch one of the versions. to compile the code!
or it should be done for all the cluster!, and the previous running should be stopped for all other user meanwhile updating the code in all nodes for all users.
The code works with tcl/tk 8.4.
i have not thread version 8.4.7. I installed a new version in one directory in one of the nodes 4.8.14 threaded. but using the tclshell just gives me the old version platform. so should this be done changing the $PATH.
when I Issue echo $PATH in a node I cannot see any specific directory of tcl. maybe its in the root. but I donot know if it is the global root or just the node root!
Can someone please tell me what is the difference in dataflow of Hadoop Standalone and Pseudodistributed mode. Infact I am trying to run an example of matrix multiplication presented by John Norstad. It runs fine in hadoop standalone mode but does not work properly in pseudodistributed mode. I am unable to fix the problem so please tell me the principle difference between hadoop standalone and pseudodistributed mode which can be helpful for fixing the stated problem.Thanks
Reagrds,
WL
In standalone mode everything (namenode, datanode, tasktracker, jobtracker) is running in one JVM on one machine. In pseudo-distributed mode, everything is running each in it's own JVM, but still on one machine. In terms of the client interface there shouldn't be any difference, but I wouldn't be surprised if the serialization requirements are more strict in pseudo-distributed mode.
My reasoning for the above is that in pseudo-distributed mode, everything must be serialized to pass data between JVMs. In standalone mode, it isn't strictly necessary for everything to be serializable (since everything is in one JVM, you have shared memory), but I don't remember if the code is written to take advantage of that fact, since that's not a normal use case for Hadoop.
EDIT: Given that you are not seeing an error, I think it sounds like a problem in the way the MapReduce job is coded. Perhaps he relies on something like shared memory among the reducers? If so, that would work in standalone mode but not in pseudo-distributed mode (or truly distributed mode, for that matter).
Is it possible that datanode doesn't start sometimes on running start-all.sh but when you restart the computer it starts fine. What could be a cause of such a quirky behavior?
Do other java processes running within the same namespace corrupt the hadoop processes?
Here are some of the things I observed.
hadoop namenode -format. This command needs to be executed before you start-all. Otherwise namenode doesn't start.
Also check that the application you are using on top of hadoop complies with the version of hadoop that the application expects. In my case I was using hadoop 20.203 and I switched to 20.2 my problem was solved.
Check the logs, they often give valuable insights into what has gone wrong.