Switching Spark versions and distributing jars to all nodes - Yarn v Standalone - hadoop

I have an environment setup with both Spark 2.0.1 and 2.2.0. both run in standalone mode and contain a master and 3 slaves. They each sit on the same servers and are configured in the exact same way. I only ever want to run one at once and to do so I set the SPARK_HOME environment version to the location of the Spark version I wish to start and run start-master.sh and start-slaves.sh in the bin folder of that particular version.
I have a jar file which I wish to use for all Spark programs to be executed with. This is regardlss of version. I'm aware I could just pass it in the spark-submit --jars parameter but I don't want to have to account for any transfer time in the job execution so am currently placing the the jar file in the jars folder of each of the master and slave nodes prior to startup. This is a regular task as the jar file gets updated quite often.
If I wish to switch Spark versions I must run stop-slaves.sh and stop-master.sh in the bin folder of the version I wish to stop, then go through the above process again.
Key things I wish achieve are that I can differentiate the transfer of jars from execution timings and that I can easily switch versions. I am able to do this with my current setup but its all done manually and I'm looking at automating it. However, I don't want to spend time doing this if theres already a solution that will do what I need.
Is there a better way of doing this? I'm currently looking at Yarn to see if it can offer anything.

Related

Can I use Spark without Hadoop for development environment?

I'm very new to the concepts of Big Data and related areas, sorry if I've made some mistake or typo.
I would like to understand Apache Spark and use it only in my computer, in a development / test environment. As Hadoop include HDFS (Hadoop Distributed File System) and other softwares that only matters to distributed systems, can I discard that? If so, where can I download a version of Spark that doesn't need Hadoop? Here I can find only Hadoop dependent versions.
What do I need:
Run all features from Spark without problems, but in a single computer (my home computer).
Everything that I made in my computer with Spark should run in a future cluster without problems.
There's reason to use Hadoop or any other distributed file system for Spark if I will run it on my computer for testing purposes?
Note that "Can apache spark run without hadoop?" is a different question from mine, because I do want run Spark in a development environment.
Yes you can install Spark without Hadoop.
Go through Spark official documentation :http://spark.apache.org/docs/latest/spark-standalone.html
Rough steps :
Download precomplied spark or download spark source and build locally
extract TAR
Set required environment variable
Run start script .
Spark(without Hadoop) - Available on Spark Download page
URL : https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
If this url do not work then try to get it from Spark download page
This is not a proper answer to original question.
Sorry, It is my fault.
If someone want to run spark without hadoop distribution tar.gz.
there should be environment variable to set. this spark-env.sh worked for me.
#!/bin/sh
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Need to create a jar file everytime a code is modified in Map reduce

I am running a program(MapReduce) in Hadoop single node cluster.
There a few modifications to the MapReduce logic frequently. I am using eclipse IDE.
Every time after making a modification to the logic, I am creating a jar file to check the Mapreduce job in Hadoop. This is a little tedious to create a jar file every time after modifying the MapReduce logic.
Is there any easier way to create the jar file every time a change is made? Please advise.
It's not clear whether the fact that you have to make a jar file is the issue, or whether the process of making the jar seems too difficult. First, you do have to make a jar file to submit a job to Hadoop. There is no way around this. Second, to make the process of creating the jar file easier, you can use a build tool like Maven to make this simpler. The Maven Assembly plugin will also package up an uber jar with all your dependencies if needed.
I use Scalding to write mapreduce jobs (it's as concise as Pig and as flexible/performant as Java), I then use sbt to build. I have an rsync script that syncs my code to a location on the cluster, where I startup the 'sbt concole' - it's a repl (shell) where you can import libraries.
The result is, that I can run mapreduce jobs interactively in a shell just by either calling my code from the project, writting the code directly into the shell or copy pasting code into the shell. IME you cannot beat this kind of workflow!
If you are using Eclipse IDE, you can have MapReduce Plugin added to eclipse and create location having provided the port numbers of HDFS and MapReduce. So in eclipse you can simple right click and run > choose Run on hadoop, which will avoid creating of jar file.

Hudson/SVN job configuration

I am working on a project that is built in a modular way. When the branch is checked out it, there are folders for each module. I would like to setup a hudson job for each module to build each module individually, but I cannot figure out how to have one workspace for all the jobs and have each hudson job only check for changes for its respective module within the common workspace without triggering an update of the whole workspace. Is this even possible what I am trying to do?
Each job needs it's own workspace. everything else asks for trouble. If you have a shared workspace, you might need to synchronize the jobs so that they don't interfere with each other. That can be trickier than you expect.

Jenkins slave and Maven Project

I have problem with understanding how jenkins slaves are working. I have a master jenkins and a node defined as slave. I have a Maven project that I want to run on slave but having the report available on Jenkins Master.
Should I have Maven installed on Master or on Slave or both? Do I keep the Maven project on Slave or Master or both? I think it should be both but I can not understand why?
When using slaves, you just have to ensure that each slave has external tools installed (Maven in your case), and properly configured (available in path, custom settings.xml if any, etc.)
To avoid having to bind a job to a unique slave, an obvious best practice is to have all your slave configured alike (i.e. all linux slaves with all needed tools, same for windows slaves, etc.)
Once all your tools are available on each slave, Jenkins takes care of running the project on an available node (master or a slave). Monitoring, log tail, build history and so on... is transparently available to the user, whatever the node used.
You don't even need to care about plugins, they are automatically available to the slaves once installed on the master.
I don't think you need any MAVEN or Jenkins component on slave. When you create the slave then, It gives you the option to launch it as JNLP(Java Network Launch Protocol) file.
Save this file and copy it to slave machine and launch it. Only prequisite is to have JAVA installed on slave machine.
On launching it will establish connection with Master. I am using selenium grid like that. I am not aware of your use case. This may help.

Jenkins/Hudson Java.IO error Unable to clean workspace - Windows server

I have a Jenkins/Hudson CI server, hosted on a Dedicated server (Kindly hosted by someone else). We have come to a problem which we cannot solve, and need help from people who may know solutions:
When we try to run a build, we get a Build Failed, and
java.io.IOException: Unable to delete C:\Program Files (x86)\Jenkins\jobs\JumpPorts-2\workspace
Jenkins was able to create the files, so surely it can delete them? It is running as a service, and it is cloning the source (Maven - Java) from GitHub. This is on a windows server. I tested it on my VPS (Centos5) and it worked correctly, however due to it being a VPS, java does not run well with my other services, so i am unable to host it on there.
Full Error: http://pastebin.com/0tWVVdiH
Thanks in advance
Most likely you are using the Maven project type.
The Maven project type can parse the pom on disk before building and while accessing the GUI. As a result when building on Windows, there is the chance that window's strict version of file locking can get in the way, marking a file as in use until absolutely every file handle is released.
One way to reduce this issue is to have the windows builds run on a slave node rather than the master (note that the slave node can be the same physical machine, but because the remoting channel is required to see the slave's filesystem, the file handles may not be as big an issue)
Another way to reduce this issue is to switch to a FreeStyle project with a Maven build step. Note that my personal preference is to avoid the Maven project type on pain of death. ;-)

Resources