MRUnit with Standalone Mode - hadoop

Is it possible to run MRUnit in standalone mode? To combine the benefits of isolated mappers/reducers with a transparent and simple output check that still reads from local disk (I want to test a particular FileSYstem implementation).

Yes, mrunit runs in standalone mode, iirc. Just make sure to set fs.default.name to 'local' in your configuration.

Related

JDK Flight Recorder on CI/CD Plattform

I am currently using JDK Flight Recorder with JDK 11 and came across some trouble in the CI/CD Plattform. Unfortunately, there is not too much documentation on the new Flight Recorder, but rather on the older one, which was still developed under Java.
When I try to start tests directly from the IDE, everything works fine and I get my recording files.
When I try to do the same thing, automatically, in the CI/CD Plattform, it causes time out and a lot of different indefinite failures, among them: trouble creating the file, the file is not even written, etc.
The JVM commands I used are the following (I put extra spaces for better readability):
-XX:+FlightRecorder
-XX:StartFlightRecording= name="UiTestServer", settings="profile", dumponexit=true, filename=""+System.getenv("CI_PROJECT_DIR") + "flightRecording/javaFlightRecorder.jfr"
The commands are the same that the IDE uses automatically, when starting the flight recording with right click on the specified test.
Does anybody know, whether the Flight Recorder has problems with such systems or specific services which might run parallely to it? I heard of some profiling tools, that are unable to perform on CI Plattforms.
If you need more detail, just ask me. Though, it might happen that I cannot tell anything related to the project.
Bit late as an answer, but JFR can definitely run in CI/CD environments. I have successfully attached JFR to our JMH microbenchmarks and published the results as artifacts in Atlassian Bamboo. Our Bamboo agents are running on AWS, so JFR itself should be good for most cloud environments.
JFR has been built to work in production systems, but if you want guarantees of low overhead (<1%), you should use the default settings, not profile.
'profile' is for a shorter period of time, i.e. 10 minutes, where it may be OK with additional overhead to gain more insight.
This is what I would recommend, for JDK 11 and later:
$ java -XX:StartFlightRecording:filename=/path
There is no need set dumponexit=true if a filename has been specified.
-XX:+FlightRecorder is only needed before JDK 8u40.
You can set a name if you like, but it's typically not needed. If you want to use jcmd and dump a recording, the name can be omitted.

what's the difference between spark-shell and submitted sbt programs

Spark-shell can be used to interact with the distributed storage of data, then what is the essential difference between coding in spark-shell and uploading packaged sbt independent applications to the cluster operation?(I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not) After all, sbt is very troublesome, and the shell is very convenient.
Thanks a lot!
Spark-shell gives you a bare console-like interface in which you can run your codes like individual commands. This can be very useful if you're still experimenting with the packages or debugging your code.
I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not
Actually, spark shell also comes up in the job UI as "Spark-Shell" itself and you can monitor the jobs you are running through that.
Building spark applications using SBT gives you some organization in your development process, iterative compilation which is helpful in day-to-day development, and a lot of manual work can be avoided by this. If you have a constant set of things that you always run, you can simply run the same package again instead of going through the trouble of running the entire thing like commands. SBT does take some time getting used to if you are new to java style of development, but it can help maintain applications on the long run.

What is the exact difference between pseudo mode and stand alone mode in hadoop?

What is the exact difference between pseudo mode and stand alone mode in hadoop?
How can we get to know that when working on our own laptop / desktop?
The differences are the one described in the product documentation:
Standalone Operation: By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
Pseudo-Distributed Operation: Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Unless you want to debug Hadoop code, you should always run in pseudo-distributed.

how to add GUI to a hadoop program?

I use hadoop to write a Mapreduce program which is able to deploy to ec2 and local cluster, I am fine to use the command line to run the program, but is there any way to add interface to the hadoop program, so that users just need to click and run the program instead of using command line? Thanks!
I'm not sure exactly what you want, but I think your asking if there is an UI for submitting map reduce jobs hadoop? if so, you should try Hue: http://cloudera.github.com/hue/

DataFlow difference in Hadoop Standalone and Pseudodistributed mode?

Can someone please tell me what is the difference in dataflow of Hadoop Standalone and Pseudodistributed mode. Infact I am trying to run an example of matrix multiplication presented by John Norstad. It runs fine in hadoop standalone mode but does not work properly in pseudodistributed mode. I am unable to fix the problem so please tell me the principle difference between hadoop standalone and pseudodistributed mode which can be helpful for fixing the stated problem.Thanks
Reagrds,
WL
In standalone mode everything (namenode, datanode, tasktracker, jobtracker) is running in one JVM on one machine. In pseudo-distributed mode, everything is running each in it's own JVM, but still on one machine. In terms of the client interface there shouldn't be any difference, but I wouldn't be surprised if the serialization requirements are more strict in pseudo-distributed mode.
My reasoning for the above is that in pseudo-distributed mode, everything must be serialized to pass data between JVMs. In standalone mode, it isn't strictly necessary for everything to be serializable (since everything is in one JVM, you have shared memory), but I don't remember if the code is written to take advantage of that fact, since that's not a normal use case for Hadoop.
EDIT: Given that you are not seeing an error, I think it sounds like a problem in the way the MapReduce job is coded. Perhaps he relies on something like shared memory among the reducers? If so, that would work in standalone mode but not in pseudo-distributed mode (or truly distributed mode, for that matter).

Resources