I have a multi-threaded Step configured with a threadpool with a corePoolSize of 48 threads (it's a big machine) but I did not configure the throttle-limit.
I am wondering if I have been under utilizaing the machine because of this.
The Spring Batch documentation says that throttle-limit is the max amount of concurrent tasks that can run at one time and the default is 4.
I can see in jconsole that in fact there are 48 threads created and they seem to be executing (I can also see that in my logs).
But, so, even though I can see the 48 threads created, does the throttle-limit of 4 mean that only 4 of those 48 threads are indeed executing work concurrently?
Thank you in advance.
Yes, your understanding is correct i.e. only threads equal to throttle limit be doing work concurrently.
In your case, since its a thread - pool , any four threads could be chosen randomly to do the work and rest of threads will remain idle but since threads get rotated for those four tasks, it will give an impression that all threads are doing work concurrently.
corePoolSize simply indicates the number of threads to be started and maintained during job run but that doesn't mean that all are running concurrently what it means that you are trying to avoid thread creation overhead etc during job run.
You have not shared any code or job structure so its hard to point any more specifics.
Hope it helps !!
Related
My application operates on pairs of long vectors - say it adds them together to produce a vector result. Its rules state that it must completely finish with one pair before it can be given another. I would like to use multiple threads to speed things up. I am running Windows 10.
I created an OpenMP parallel for construct and divided the vector among all the threads of the team. All threads start, all threads run pretty fast, so the multithreading is effective.
But the speedup is slight, and the reason is that some of the time, one of the worker threads takes way longer than usual. I have instrumented the operation, and I see that sometimes the worker threads take a long time to start - delay varies from 20 microseconds on average to dozens of milliseconds depending on system load. The master thread does not show this delay.
That makes me think that the scheduler is taking some time to start the worker threads. The master thread is already running, so it doesn't have to wait to be started.
But here is the nub of the question: raising the priority of the process doesn't make any difference. I can raise it to high priority or even realtime priority, and I still see that startup of the worker threads is often delayed. It looks like the Windows scheduler is not fully preemptive, and sometimes lets a lower-priority thread run when a higher-priority one is eligible. Can anyone confirm this?
I have verified that the worker threads are created with the default OS priority, namely the base priority of the class of the master process. This should be higher that the priority of any running thread, I think. Or is it normal for there to be some thread with realtime priority that might be blocking my workers? I don't see one with Task Manager.
I guess one last possibility is that the task switch might take 20 usec. Is that plausible?
I have a 4-core system without hyperthreading.
I am running a weekly job using java 8 springboot. I use forkjoin custom pool. With 8 threads, I see that the job takes 3 hours to complete. When I check the logs I see that the performance/throughput or is consistent till about 80% and I see almost 5 to 6 threads are running fine. But after the job completes around 80% I see only one thread running and the performance/throughput is decreased drastically.
Going with initial analysis I feel some how the threads are lost after 80%. Not sure thought.
Question:
1) Any hints on what is going wrong?
2) What is best way to debug this issue and fix it, so the all threads run correctly till job completes.
I think the job should be complete within lesser time than it is now, and I feel threads might be the issue.
I'm trying to understand the benefits of using puma clustering. The GitHub says that the number puma workers should be set to the number of cpu cores, and the default number of threads for each is 0-16. The worker processes can run in parallel while the threads run concurrently. It was my understanding that The MRI GIL only allows one thread across all cores to run Ruby code, so how does puma enable things to run in parallel /provide benefits over running one worker process with double the amount of threads? The site I'm working on is heavily IO bound, handling several uploads and downloads at the same time - any config suggestions for this set up are also welcome.
The workers in clustered mode will actually spawn new child processes each of which has its own "GIL". Only one thread in a single process can be running code at one time, thus having a process per cpu core works well because each cpu can only be doing one thing at a time. It also makes sense to run multiple threads per process because if a thread is waiting for IO it will allow another thread to execute.
I am starting to use PPL to create tasks and dispatch them [possibly] to other threads, like this:
Concurrency::task_group tasks;
auto simpleTask = Concurrency::make_task(&simpleFunction);
tasks.run(simpleTask);
I experimented with a small application that creates a task every second. Task performs heavy calculations during 5 seconds and then stops.
I wanted to know how many threads PPL creates on my machine and whether the load of the machine influences the number of threads or the tasks assigned to the threads. When I run one or more instances of my application on my 12-core machine, I notice this:
When running 1 application, it creates 6 threads. Total CPU usage is 50%.
When running 2 applications, both of them create 6 threads. Total CPU usage is 100% but my machine stays rather responsive.
When running 3 applications, all of them create 6 threads (already 18 threads in total). Total CPU usage is 100%.
When running 4 applications, I already have 24 threads in total.
I investigated the running applications with Process Explorer and with 4 applications I can clearly see that they all have 6 (sometimes even 12) threads that are all trying to consume as much CPU as possible.
PPL allows you to limit the number of threads by configuring the default scheduler, like this:
Concurrency::SchedulerPolicy policy(1, Concurrency::MaxConcurrency,2);
Concurrency::Scheduler::SetDefaultSchedulerPolicy(policy);
With this you statically limit the number of threads (2 in this case). It can be handy if you know beforehand that on a server with 24 cores there are 10 simultaneous users (so you can limit every application to 2 threads), but if one of the 10 users is working late, he still only uses 2 threads, while the rest of the machine is idling.
My question: is there a way to configure PPL so that it dynamically decides how many threads to create (or keep alive or keep active) based on the load of the machine? Or does PPL already does this by default and my observations are incorrect.
EDIT: I tried starting more instances of my test application, and although my machine remains quite responsive (I was wrong in the original question) I can't see the applications reducing their number of simultaneous actions.
The short answer to your question is "No." The default PPL scheduler and resource manager will only use process-local information to decide when to create/destroy threads. As stated in the Patterns and Practices article on MSDN:
The resource manager is a singleton that works across one process. It
does not coordinate processor resources across multiple
operating-system processes. If your application uses multiple,
concurrent processes, you may need to reduce the level of concurrency
in each process for optimum efficiency.
If you're willing to accept the complexity, you may be able to implement a custom scheduler/resource manager to take simple system-level performance readings (e.g. using the PDH functions) to achieve what you want.
I want to run 50 tasks. All these tasks execute the same piece of code. Only difference will be the data. Which will be completed faster ?
a. Queuing up 50 tasks in a queue
b. Queuing up 5 tasks each in 10 different queue
Is there any ideal number of tasks that can be queued up in 1 queue before using another queue ?
The rate at which tasks are executed depends on two factors: the number of instances your app is running on, and the execution rate of the queue the tasks are on.
The maximum task queue execution rate is now 100 per queue per second, so that's not likely to be a limiting factor - so there's no harm in adding them to the same queue. In any case, sharding between queues for more execution rate is at best a hack. Queues are designed for functional separation, not as a performance measure.
The bursting rate of task queues is controlled by the bucket size. If there is a token in the queue's bucket the task should run immediately. So if you have:
queue:
- name: big_queue
rate: 50/s
bucket_size: 50
And haven't queue any tasks in a second all tasks should start right away.
see http://code.google.com/appengine/docs/python/config/queue.html#Queue_Definitions for more information.
Splitting the tasks into different queues will not improve the response time unless the bucket hadn't had enough time to completely fill with tokens.
I'd add another factor into the mix- concurrency. If you have slow running (more than 30 seconds or so) tasks, then AppEngine seems to struggle to scale up the correct number of instances to deal with the requests (seems to max out about 7-8 for me).
As of SDK 1.4.3, there's a setting in your queue.xml and your appengine-web.config you can use to tell AppEngine that each instance can handle more than one task at a time:
<threadsafe>true</threadsafe> (in appengine-web.xml)
<max-concurrent-requests>10</max-concurrent-requests> (in queue.xml)
This solved all my problems with tasks executing too slowly (despite setting all other queue params to the maximum)
More Details (http://blog.crispyfriedsoftware.com)
Queue up 50 tasks and set your queue to process 10 at a time or whatever you would like if they can run independently of each other. I see a similar problem and I just run 10 tasks at a time to process the 3300 or so that I need to run. It takes 45 minutes or so to process all of them but the CPU time used is negligible surprisingly.