what's the difference between spark-shell and submitted sbt programs - shell

Spark-shell can be used to interact with the distributed storage of data, then what is the essential difference between coding in spark-shell and uploading packaged sbt independent applications to the cluster operation?(I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not) After all, sbt is very troublesome, and the shell is very convenient.
Thanks a lot!

Spark-shell gives you a bare console-like interface in which you can run your codes like individual commands. This can be very useful if you're still experimenting with the packages or debugging your code.
I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not
Actually, spark shell also comes up in the job UI as "Spark-Shell" itself and you can monitor the jobs you are running through that.
Building spark applications using SBT gives you some organization in your development process, iterative compilation which is helpful in day-to-day development, and a lot of manual work can be avoided by this. If you have a constant set of things that you always run, you can simply run the same package again instead of going through the trouble of running the entire thing like commands. SBT does take some time getting used to if you are new to java style of development, but it can help maintain applications on the long run.

Related

JDK Flight Recorder on CI/CD Plattform

I am currently using JDK Flight Recorder with JDK 11 and came across some trouble in the CI/CD Plattform. Unfortunately, there is not too much documentation on the new Flight Recorder, but rather on the older one, which was still developed under Java.
When I try to start tests directly from the IDE, everything works fine and I get my recording files.
When I try to do the same thing, automatically, in the CI/CD Plattform, it causes time out and a lot of different indefinite failures, among them: trouble creating the file, the file is not even written, etc.
The JVM commands I used are the following (I put extra spaces for better readability):
-XX:+FlightRecorder
-XX:StartFlightRecording= name="UiTestServer", settings="profile", dumponexit=true, filename=""+System.getenv("CI_PROJECT_DIR") + "flightRecording/javaFlightRecorder.jfr"
The commands are the same that the IDE uses automatically, when starting the flight recording with right click on the specified test.
Does anybody know, whether the Flight Recorder has problems with such systems or specific services which might run parallely to it? I heard of some profiling tools, that are unable to perform on CI Plattforms.
If you need more detail, just ask me. Though, it might happen that I cannot tell anything related to the project.
Bit late as an answer, but JFR can definitely run in CI/CD environments. I have successfully attached JFR to our JMH microbenchmarks and published the results as artifacts in Atlassian Bamboo. Our Bamboo agents are running on AWS, so JFR itself should be good for most cloud environments.
JFR has been built to work in production systems, but if you want guarantees of low overhead (<1%), you should use the default settings, not profile.
'profile' is for a shorter period of time, i.e. 10 minutes, where it may be OK with additional overhead to gain more insight.
This is what I would recommend, for JDK 11 and later:
$ java -XX:StartFlightRecording:filename=/path
There is no need set dumponexit=true if a filename has been specified.
-XX:+FlightRecorder is only needed before JDK 8u40.
You can set a name if you like, but it's typically not needed. If you want to use jcmd and dump a recording, the name can be omitted.

Scheduling Cucumber test features to run repeatedly

I have a Cucumber test (feature files) in the RubyMine IDE and lately I have a need to execute one of the feature repeatedly on a scheduled time.
I haven't found a way to do so. Any idea or thoughts on scheduling that feature file?
You can create a cron job which will execute a rake.
The software utility Cron is a time-based job scheduler in Unix-like
computer operating systems. People who set up and maintain software
environments use cron to schedule jobs (commands or shell scripts) to
run periodically at fixed times, dates, or intervals.
These links might help
How to create a cron job using Bash
how to create a cron job to run a ruby script?
http://rake.rubyforge.org/
I solved the problem by simply installing Jenkins on my machine from its official site,https://jenkins-ci.org/. I configured master and slave nodes on my own machine because I just needed to run one feature file(it has the script I want to run on daily basis).
Although, it is recommended to configure slave on a different machine if we have multiple jobs to run and our jobs are resource intensive.
There is a very good illustration on installing, configuring and running jobs in this link http://yakiloo.com/setup-jenkins-and-windows/

DataFlow difference in Hadoop Standalone and Pseudodistributed mode?

Can someone please tell me what is the difference in dataflow of Hadoop Standalone and Pseudodistributed mode. Infact I am trying to run an example of matrix multiplication presented by John Norstad. It runs fine in hadoop standalone mode but does not work properly in pseudodistributed mode. I am unable to fix the problem so please tell me the principle difference between hadoop standalone and pseudodistributed mode which can be helpful for fixing the stated problem.Thanks
Reagrds,
WL
In standalone mode everything (namenode, datanode, tasktracker, jobtracker) is running in one JVM on one machine. In pseudo-distributed mode, everything is running each in it's own JVM, but still on one machine. In terms of the client interface there shouldn't be any difference, but I wouldn't be surprised if the serialization requirements are more strict in pseudo-distributed mode.
My reasoning for the above is that in pseudo-distributed mode, everything must be serialized to pass data between JVMs. In standalone mode, it isn't strictly necessary for everything to be serializable (since everything is in one JVM, you have shared memory), but I don't remember if the code is written to take advantage of that fact, since that's not a normal use case for Hadoop.
EDIT: Given that you are not seeing an error, I think it sounds like a problem in the way the MapReduce job is coded. Perhaps he relies on something like shared memory among the reducers? If so, that would work in standalone mode but not in pseudo-distributed mode (or truly distributed mode, for that matter).

Jenkins/Hudson - Run script on all slaves

I have a requirement to run a script on all available slave machines. Primarily this is so they get relevant windows hotfixes and new 3rd party tools before building.
The script I have can be run multiple times without undesirable side effects & is quite light weight, so I'm happy for this to be brute force if necessary.
Can anybody give suggestions as to how to ensure that a slave is 'up-to-date' before it works on a job?
I'm happy with solutions that are driven by a job on the master, or ones which can inject the task (automatically) before normal slave job processing.
My shop does this as part of the slave launch process. We have the slaves configured to launch via execution of a command on the master; this command runs a shell script that rsync's the latest tool files to the slave and then launches the slave process. When there is a tool update, all we need to do is to restart the slaves or the master.
However - we use Linux whereas it looks like you are on Windows, so I'm not sure what the equivalent solution would be for you.
To your title: either use Parameter Plugin or use matrix configuration and list your nodes in it.
To your question about ensuring a slave is reliable, we mark it with a 'testbox' label and try out a variety of jobs on it. You could also have a job that is deployed to all of them and have the job take the machine offline it fails, I imagine.
Using Windows for slaves is very obnoxious for us too :(

Considerations for building SysV or Upstart compatible Bash scripts

I've just knocked out a quick script for keeping a slave web server in sync with a master using rsync. (https://github.com/simonjgreen/liveFolderSync/blob/master/liveFolderSync.sh)
I'd like to make this run on boot and be controllable via the usual /etc/init.d/... or service commands, however this is an area I've always fallen down in. I find both init.d scripts and upstart scripts horrendously confusing, and can't find a guide anywhere for starting from scratch.
The only control I'd like to have over it is start/stop/restart. Obviously later I will move the config into a separate file in /etc but that's already on the cards so outside the scope of this question.
Any pointers/advise and best practices would be helpful. I should add that I'm doing this on Ubuntu.
To get started with Sys V init scripts, I suggest the following links:
Linux: How to write a System V init script to start, stop, and restart my own application or service
Writing System V init scripts for Red Hat Linux
Ubuntu Bootup Howto
For instructions specific to Upstart, I would recommend starting with:
Getting Started
The Upstart Cookbook
At present, there are also 129 questions on AskUbuntu, several of which will point you in the right direction:
What Events are available for Upstart
Want to make an Upstart script, need help and advice

Resources