How jobs are assigned to executors in Spark Streaming? - job-scheduling

Let's say I've got 2 or more executors in a Spark Streaming application.
I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.
If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?
Even if the previous one didn't finish?
I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.
If you know some links where all of those things are explained, I would really appreciate to see them.
Thank you.

Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.
This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).
The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.
There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.

Related

Is there any way to know which job will start next in qsub

In our institute (IISc Bangalore)Supercomputer ,we submit jobs using qsub. The jobs will start running according to the following-
(1) Its wall time(Expected completion time)
(2) Its position in the respected queue(small,medium,large etc).
So,it is very difficult to know which job will start after finishing one job which is currently running. But qsub is probably has a list of its own,by which it is starting a new job after finishing another job immediately.
Is there any way to know which job will start next.Is there any command for this.
Thank you.
Unfortunately, there is no clear way to know which job will be run next in a supercomputing system. The job start is depending not only on it's wall time or position in the queue but also many other factors based on the site-level policy, scheduling strategies and priorities. There can be some internal job ranking (priorities) chosen by the institute based on factors like power management, load balancing etc.
On the other side, there are many researches to predict the waiting time for job allocation. TeraGrid systems provides estimated waiting time. Also, see link1, link2 (by SERC) for more information about predicting the waiting time.

Apache Spark - How to avoid failing slow tasks

Dear fellow Apache Spark enthusiasts
I recently kicked off a sideline project with the goal of turning a couple of ODROID XU4 computers into a stand-alone Spark Cluster.
After setting up the cluster I ran into a problem that seems to be specific to heterogeneous multi processors. Spark executor tasks run extremely slow on the XU4 when using all 8 processors. The reason, as mentioned in a comment on my post below, is that Spark does not wait for the executors that have been kicked off on the slow processors.
http://forum.odroid.com/viewtopic.php?f=98&t=21369&sid=4276f7dc89a8d7825320e7f705011326&p=152415#p152415
One solution is to use fewer executor cores and to set the CPU affinity to not use the LITTLE processors. This is however a less than ideal solution.
Is there a way to ask Spark to wait a bit longer for feedback from slower executors? Obviously waiting too long will have a negative effect on performance. The positive effect of utilising all cores should however balance out the negative effect.
Thanks in advance for any help!
#Dikei response highlights two potential causes, but it turns out the problem is not the one he suspects. I have the same set up as the #TJVR, and it turns out the driver is missing heartbeats from executors. To address this, I added the following to spark-env.sh:
export SPARK_DAEMON_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=600 -Dspark.akka.timeout=200 -Dspark.shuffle.consolidateFiles=true"
This changes the default timeouts for executor heartbeats. Also set spark.shuffle.consolidateFiles to true to improve performance on my ext4 filesystem. These defaults changes allowed me to increased the core usage above one and not frequently lose executors.
Spark does not kill slow executors, but will mark an executor as dead in two cases:
If the driver doesn't receive a heartbeat signal in a while (default: 120s): The executor have to regularly (default: 10s) send a heartbeat message to notify the driver that it is still alive. Network issues or large GC pause can prevent these heartbeat from happening.
The executor has crashed due to exception in the code or JVM runtime error, most likely due to GC pause as well.
In my opinion, it's probably that GC overhead has killed your slowed executor and the driver has to redo the task on a different executor. If this is the case, you can try splitting your data into smaller partitions, so that each executor has to process less data at a time.
Secondly, you should NOT set spark.speculation to 'true' without testing. It's 'false' by default for a reason, I've seen it do more harm than good in some case.
Lastly, the following assumption might not hold true.
The positive effect of utilising all cores should however balance out
the negative effect.
Slow executors (straggles) can cause the program to perform much worse, depending on workload. It's entirely possible that avoiding the slow cores will provide the best result.

Recovery techniques for Spark Streaming scheduling delay

We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control: This is happens after a while even if we double the batch interval.
We are not sure what causes the delay to happen (theories include garbage collection). The cluster has generally low CPU utilization regardless of whether we use 3, 5 or 10 slaves.
We are really reluctant to further increase the batch interval, since the delay is zero for such long periods. Are there any techniques to improve recovery time from a sudden spike in scheduling delay? We've tried seeing if it will recover on its own, but it takes hours if it even recovers at all.
Open the batch links, and identified which stages are in delay. Are there any external access to other DBs/application which are impacting this delay?
enter image description here
Go in each job, and see the data/records processed by each executor. you can find problems here.
enter image description here
There may be skewness in data partitions as well. If the application is reading data from kafka and processing it, then there can be skewness in data across cores if the partitioning is not well defined. Tune the parameters: # of kafka partitions, # of RDD partitions, # of executors, # of executor cores.

Actual processing time of hadoop job

My cluster is currently occupied by a job A that takes long time and has VERY_LOW priority.
I started another job B yesterday while A was already running and I think it should have ran quite fast.
However, I saw it took 47 minutes at the job details.
I don't think this is the actual processing time.
I'm trying to find out when the job really started.
Where can I look?
I cant seem to find anywhere which states exactly what you're after, but you could look into the job in the job tracker on port 50030 and look at the individual mapper and reducer details. On there you can see how long each individual mapper and reducer took to complete their tasks from their start and end times.
If there weren't any mappers or reducers free when you started the second job, the second job wouldnt be able to make any progress until the first job released them, which might explain why it claimed to take so long, as they might not have actually been running simultaneously. The time of the job being started and the first actual mapper starting should give you an indication of whether it was just waiting around for resources, which means you can deduct the period of time between the job and mapper's start times from the overall 47 minutes.

Performance of Resque jobs

My Resque job basically takes params hash and stores it into the DB. In the process it does several reads and writes.
These R/Ws take approx. 5ms in total on my local machine and a little bit more on Heroku (I guess it's because of the shared DB).
However, the rate at which the queue is processed is very low / about 2-3 jobs per second. What could be causing this?
Thank you.
Check for a new job, lock a job, do the job, mark it as completed, look for a new job.
You might find that the negotiation to get a new job, accessing Redis etc is causing a lot of overhead. If your task is only 5ms long, it can probably live inside the request-response cycle. Background jobs are great when running a task would extend the response time considerably, very small jobs generally aren't worth the effort involved.

Resources