DataFlow difference in Hadoop Standalone and Pseudodistributed mode? - hadoop

Can someone please tell me what is the difference in dataflow of Hadoop Standalone and Pseudodistributed mode. Infact I am trying to run an example of matrix multiplication presented by John Norstad. It runs fine in hadoop standalone mode but does not work properly in pseudodistributed mode. I am unable to fix the problem so please tell me the principle difference between hadoop standalone and pseudodistributed mode which can be helpful for fixing the stated problem.Thanks
Reagrds,
WL

In standalone mode everything (namenode, datanode, tasktracker, jobtracker) is running in one JVM on one machine. In pseudo-distributed mode, everything is running each in it's own JVM, but still on one machine. In terms of the client interface there shouldn't be any difference, but I wouldn't be surprised if the serialization requirements are more strict in pseudo-distributed mode.
My reasoning for the above is that in pseudo-distributed mode, everything must be serialized to pass data between JVMs. In standalone mode, it isn't strictly necessary for everything to be serializable (since everything is in one JVM, you have shared memory), but I don't remember if the code is written to take advantage of that fact, since that's not a normal use case for Hadoop.
EDIT: Given that you are not seeing an error, I think it sounds like a problem in the way the MapReduce job is coded. Perhaps he relies on something like shared memory among the reducers? If so, that would work in standalone mode but not in pseudo-distributed mode (or truly distributed mode, for that matter).

Related

what's the difference between spark-shell and submitted sbt programs

Spark-shell can be used to interact with the distributed storage of data, then what is the essential difference between coding in spark-shell and uploading packaged sbt independent applications to the cluster operation?(I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not) After all, sbt is very troublesome, and the shell is very convenient.
Thanks a lot!
Spark-shell gives you a bare console-like interface in which you can run your codes like individual commands. This can be very useful if you're still experimenting with the packages or debugging your code.
I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not
Actually, spark shell also comes up in the job UI as "Spark-Shell" itself and you can monitor the jobs you are running through that.
Building spark applications using SBT gives you some organization in your development process, iterative compilation which is helpful in day-to-day development, and a lot of manual work can be avoided by this. If you have a constant set of things that you always run, you can simply run the same package again instead of going through the trouble of running the entire thing like commands. SBT does take some time getting used to if you are new to java style of development, but it can help maintain applications on the long run.

Spark seems to scale vertically, without no configuration effort

I did an install of Hadoop 2.7.2 single node, almost manually following tutorial on internet, then almost manually I even installed Spark, starting from spark-1.6.0-bin-hadoop2.6.tgz, a version that claims to work with hadoop 2.6+.
I did no effort to configure Spark, just started to use it with interactive python: it worked immediately, that is a little surprising too, but anyway...
Then I decided to run an example in order to see if it scale properly vertically. My box as 4 CPU, so i decide to run an easy to parallelize job, ie the PI computation: http://spark.apache.org/examples.html.
Surprisingly, the result is this one ( the box is an old 4 core ):
and, from the PySpark console:
It seems to me that it scale perfectly on the 4 cpu cores I have.
Problem: I did not configure number of cores, and I don't know why, this is a typical behavior that will not work in production, and would be hard to explain what to configure. Is there some YARN/SPARK feature that automatically scale to the cores available?
Then, other question, is this using YARN? How can I understand if my Spark leverages YARN a s a cluster manager?

What is the exact difference between pseudo mode and stand alone mode in hadoop?

What is the exact difference between pseudo mode and stand alone mode in hadoop?
How can we get to know that when working on our own laptop / desktop?
The differences are the one described in the product documentation:
Standalone Operation: By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
Pseudo-Distributed Operation: Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Unless you want to debug Hadoop code, you should always run in pseudo-distributed.

MRUnit with Standalone Mode

Is it possible to run MRUnit in standalone mode? To combine the benefits of isolated mappers/reducers with a transparent and simple output check that still reads from local disk (I want to test a particular FileSYstem implementation).
Yes, mrunit runs in standalone mode, iirc. Just make sure to set fs.default.name to 'local' in your configuration.

Hadoop Quirky Behavior

Is it possible that datanode doesn't start sometimes on running start-all.sh but when you restart the computer it starts fine. What could be a cause of such a quirky behavior?
Do other java processes running within the same namespace corrupt the hadoop processes?
Here are some of the things I observed.
hadoop namenode -format. This command needs to be executed before you start-all. Otherwise namenode doesn't start.
Also check that the application you are using on top of hadoop complies with the version of hadoop that the application expects. In my case I was using hadoop 20.203 and I switched to 20.2 my problem was solved.
Check the logs, they often give valuable insights into what has gone wrong.

Resources