Java 8 parallelstream worker issue - spring-boot

I am running a weekly job using java 8 springboot. I use forkjoin custom pool. With 8 threads, I see that the job takes 3 hours to complete. When I check the logs I see that the performance/throughput or is consistent till about 80% and I see almost 5 to 6 threads are running fine. But after the job completes around 80% I see only one thread running and the performance/throughput is decreased drastically.
Going with initial analysis I feel some how the threads are lost after 80%. Not sure thought.
Question:
1) Any hints on what is going wrong?
2) What is best way to debug this issue and fix it, so the all threads run correctly till job completes.
I think the job should be complete within lesser time than it is now, and I feel threads might be the issue.

Related

Spark Stage performance, found GC Time very high just for few tasks

I'm trying to tune a Spark application, in order to reduce overall time execution, but I'm having a strange behaviour during a Stage execution.
Basically just 14/120 tasks needs around 20 min to finish, the others instead take 4 or 5 min to be completed.
Looking a the Spark UI, the partitioning seems good, the only difference I see is the GC Time that is very high for the 14 tasks.
I attach an image of the situation.
Do you have any idea for find the performance solution?
I had a similar problem and could resolve it by using Parallel GC instead of G1GC. You may add the following options to the executors additional Java options in the submit request
-XX:+UseParallelGC -XX:+UseParallelOldGC

Spring batch multithreading: throttle-limit impact

I have a multi-threaded Step configured with a threadpool with a corePoolSize of 48 threads (it's a big machine) but I did not configure the throttle-limit.
I am wondering if I have been under utilizaing the machine because of this.
The Spring Batch documentation says that throttle-limit is the max amount of concurrent tasks that can run at one time and the default is 4.
I can see in jconsole that in fact there are 48 threads created and they seem to be executing (I can also see that in my logs).
But, so, even though I can see the 48 threads created, does the throttle-limit of 4 mean that only 4 of those 48 threads are indeed executing work concurrently?
Thank you in advance.
Yes, your understanding is correct i.e. only threads equal to throttle limit be doing work concurrently.
In your case, since its a thread - pool , any four threads could be chosen randomly to do the work and rest of threads will remain idle but since threads get rotated for those four tasks, it will give an impression that all threads are doing work concurrently.
corePoolSize simply indicates the number of threads to be started and maintained during job run but that doesn't mean that all are running concurrently what it means that you are trying to avoid thread creation overhead etc during job run.
You have not shared any code or job structure so its hard to point any more specifics.
Hope it helps !!

Apache Spark on EC2 massive slowdown on iterations

I have driver program that runs a set of 5 experiments - basically the driver program just tells the program which dataset to use (of which there are 5 and they're very similar).
The first iteration takes 3.5 minutes, the second 6 minutes, the third 30 minutes and the fourth has been running for over 30 minutes.
After each run the SparkContext object is stopped, it is then re-started for the next run - I thought this method would prevent slow down, as when sc.stop is called I was under the impression that the instances were cleared of all their RDD data - this is at least how it works in local mode. The dataset is quite small and according to Spark UI only 20Mb of data on 2 nodes is used.
Does sc.stop not remove all data from a node? What would cause such a slow down?
call sc.stop after all iterations are complete. Whenever we stop SparkContenxt and invoke new, it require time to load spark configurations,jars and free driver port to execute the next job.
and
using config --executor-memory you can speed up the process, depending on how much memory you have in each node.
Stupidly, I had used T2 instances. Their burstable performance means they only work on full power for a small amount of time. Read the documentation thoroughly - lesson learnt!

How jobs are assigned to executors in Spark Streaming?

Let's say I've got 2 or more executors in a Spark Streaming application.
I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.
If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?
Even if the previous one didn't finish?
I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.
If you know some links where all of those things are explained, I would really appreciate to see them.
Thank you.
Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.
This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).
The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.
There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.

why do pentaho jobs ran through kitchen take a lot of cpu resources?

could you please give some small explanation on what happens when kitchen.bat calls a job?
I can only guess that it instantiates it and that probably is the reason why my taskmgr spikes up whenever I call 5 jobs all at the same time. after a couple of seconds, the spikes would wind down.
or maybe not? would there be other reasons that the calling of jobs through kitchen uses a lot of resources?
would there be ways to save the cpu resources while taking advantage of parallelism (calling the jobs all at the same time)? are there optimizations that can be done?
How exactly are you calling 5 jobs at the same time? in the shell script? In which case the spike is because you're starting 5 JVM's at the same time - starting the JVM is relatively expensive. And there should be no need to do this - you can do it all in one JVM and do the parallelisation in the job?
Kitchen itself doesn't specifically use a lot of resources. If your transformation has a large number of steps, then getting that going can take some time, but not ages.
Is this really a problem? Why does it matter if your cpu spikes for a couple of seconds? The point of parallelism is generally to max out the CPU/box/resource!

Resources