Implementing checkpointing when spark streaming job is diretcly submitted to spark seems straight forward . We are a facing quite some complexities when we need to the same when the streaming job is submitted using Spark Job server..any pointers/reference could be quite helpful. Can anybody help?
Related
I am trying to use a REST service to trigger Spark jobs using Dataproc API client. However, each job inside the dataproc clusters take 10-15 s to initialize the Spark Driver and submit the application. I am wondering if there is an effective way to eliminate the initialization time for Spark Java jobs triggered from a JAR file in gs bucket? Some solutions I am thinking of are:
Pooling a single instance of JavaSparkContext that can be used for every Spark job
Start a single job and run Spark-based processing in a single job
Is there a more effective way? How would I implement the above ways in Google Dataproc?
Instead of writing this logic yourself, you may want to investigate the Spark Job Server: https://github.com/spark-jobserver/spark-jobserver as this should allow you to reuse spark contexts.
You can write a driver program for Dataproc which accepts RPCs from your REST server and re-use the SparkContext yourself and then submit this driver via the Jobs API, but I personally would look at the job server first.
In regards to being able to run machine learning jobs with Spark. Which is a better choice the Yarn scheduler or the Spark Standalone scheduler?
There is no difference when it comes to run the actual spark job.
Yarn/Mesos helps you to schedule resources if you have different spark applictions running and/or other components running in your cluster (which support Yarn/Mesos of course).
The Spark standalone cluster cannot manage resources. That is if you start a Spark application and it uses all the ressources, the second application will not find any resources left. That means you have to do this by yourself (e.g. adapting Spark config accordingly)
We have very complex pipelines which we need to compose and schedule. I see that Hadoop ecosystem has Oozie for this. What are the choices for Spark based jobs when I am running Spark on Mesos or Standalone and doesn't have a Hadoop cluster?
Unlike with Hadoop, it is pretty easy to chains things with Spark. So writing a Spark Scala script might be enough. My first recommendation is tying that.
If you like to keep it SQL like, you can try SparkSQL.
If you have a really complex flow, it is worth looking at Google data flow https://github.com/GoogleCloudPlatform/DataflowJavaSDK.
Oozie can be used in case of Yarn,
for spark there is no built in scheduler available, So you are free to choose any scheduler which works in the cluster mode.
For Mesos I feel Chronos would be the right choice, more info on Chronos
A cluster which runs mapreduce 2 doesn't have a job tracker and instead it is split into two separate components, resource manager and job manager. However, these thing are transparent from a user and he doesn't need to know whether the cluster is running mapreduce 1 or 2 when submitting a mapreduce job.
The thing I cannot quite understand is Yarn application. How is it different from a regular mapreduce application? What's the advantage of running a mapreduce job as a yarn application, etc? Could someone shed some light on that for me?
MR1 has Job tracker and task tracker which takes care of Map reduce application.
In MR2 Apache separated the management of the map/reduce process from the cluster's resource management by using YARN. YARN is a better resource manger than we had in MR1. It also enables versatility. MR2 is built on top of YARN.
Apart from Map reduce, we can run applications like spark, storm, Hbase, Tex etc on top of Yarn, which we cannot do using MR1.
The following is the architecture for MR1 and MR2.
HDFS <---> MR
HDFS <----> Yarn <----> MR
What possibly can i do with Hadoop and Nutch used as a search engine ? I know that nutch is used to build a web crawler . But i'm not finding the perfect picture . Can i use mapreduce with nutch and do some mapreduce job ? Any ideas are welcome . Few links will be greatly appreciated . Thanks.
If you want to only do Map/Reduce jobs you don't need Nutch but Hadoop only. Hadoop brings you a cluster file system and a scheduler for map/reduce jobs on the filesystem.
As Nutch builds on top of Hadoop you can create your own map/reduce jobs on Nutch data as long as you understand the data structure and what the crawler is doing.
However if you only wanted to run some map/reduce jobs, just install hadoop and off you go.