Does the Windows scheduler sometimes fail to preempt a running thread immediately to let a higher-priority thread run? - windows

My application operates on pairs of long vectors - say it adds them together to produce a vector result. Its rules state that it must completely finish with one pair before it can be given another. I would like to use multiple threads to speed things up. I am running Windows 10.
I created an OpenMP parallel for construct and divided the vector among all the threads of the team. All threads start, all threads run pretty fast, so the multithreading is effective.
But the speedup is slight, and the reason is that some of the time, one of the worker threads takes way longer than usual. I have instrumented the operation, and I see that sometimes the worker threads take a long time to start - delay varies from 20 microseconds on average to dozens of milliseconds depending on system load. The master thread does not show this delay.
That makes me think that the scheduler is taking some time to start the worker threads. The master thread is already running, so it doesn't have to wait to be started.
But here is the nub of the question: raising the priority of the process doesn't make any difference. I can raise it to high priority or even realtime priority, and I still see that startup of the worker threads is often delayed. It looks like the Windows scheduler is not fully preemptive, and sometimes lets a lower-priority thread run when a higher-priority one is eligible. Can anyone confirm this?
I have verified that the worker threads are created with the default OS priority, namely the base priority of the class of the master process. This should be higher that the priority of any running thread, I think. Or is it normal for there to be some thread with realtime priority that might be blocking my workers? I don't see one with Task Manager.
I guess one last possibility is that the task switch might take 20 usec. Is that plausible?
I have a 4-core system without hyperthreading.

Related

Why Golang scheduler uses two Queues (global run queue and local run queue) to manage goroutine?

I was reading how Golang internally manages new created goroutine in the application. And I come to know runtime scheduler use to queue to manage the created goroutines.
Global run queue: All newly created goroutine is placed to this queue.
Local run queue: All go routine which is about to run is allocated to local run queue and from there scheduler will assign it to OS thread.
So, Here my question is why scheduler is using two queues to manage goroutine. Why can't they just use global run queue and from there scheduler will map it to OS thread.
First, please note that this blog is an unofficial and old source, so the information in it shouldn't be taken as totally accurate with respect to the current version of Go (or any version, for that matter). You can still learn from it, but the Go scheduler is improved over time, which can make information out of date. For example, the blog says "Go scheduler is not a preemptive scheduler but a cooperating scheduler". As of Go 1.14, this is no longer true as preemption was added to the runtime. As for the other information, I won't vouch for it's accuracy, but here's an explanation of what they say.
Reading the blog post:
There are two different run queues in the Go scheduler: the Global Run Queue (GRQ) and the Local Run Queue (LRQ). Each P is given a LRQ that manages the Goroutines assigned to be executed within the context of a P. These Goroutines take turns being context-switched on and off the M assigned to that P. The GRQ is for Goroutines that have not been assigned to a P yet. There is a process to move Goroutines from the GRQ to a LRQ that we will discuss later.
This means the GRQ is for Goroutines that haven't been assigned to run yet, the LRQ is for Goroutines that have been assigned to a P to run or have already begun executing. Each Goroutine will start on the GRQ, and join a LRQ later to begin executing.
Here is the process that the previous quote was referencing, where Goroutines are moved from the GRQ to LRQ:
In figure 10, P1 has no more Goroutines to execute. But there are Goroutines in a runnable state, both in the LRQ for P2 and in the GRQ. This is a moment where P1 needs to steal work. The rules for stealing work are as follows.
runtime.schedule() {
// only 1/61 of the time, check the global runnable queue for a G.
// if not found, check the local queue.
// if not found,
// try to steal from other Ps.
// if not, check the global runnable queue.
// if not found, poll network.
}
This means a P will prioritize running goroutines in their own LRQ, then from other P's LRQ, then from the GRQ, then from network polling. There is also a small chance to immediately run a Goroutine from the GRQ immediately. By having multiple queues, it allows this priority system to be constructed.
Why do we want priority in which goroutines get run? It may have various performance benefits. For example, it could make better use of the CPU cache. If you run a Goroutine that was already running recently, it's more likely that the data it's working with is still in the CPU cache, making it fast to access. When you start up a new Goroutine, it may use or create data that isn't in the cache yet. That data will then enter the cache and could evict the data being used by another Goroutine, which in turn causes that Goroutine to be slower when it resumes again. In the pathological case, this is called cache thrashing, and greatly reduces the effective speed of the processor.
Allowing the CPU cache to work effectively can be one of the most important factors in achieving high performance on modern processors, but it's not the only reason to have such a queue system. In general, the more logical processes that are running at the same time (such as Goroutines in a Go program), the more resource contention will occur. This is because the resources used by a process tend to be fairly stable over the runtime of the process. In other words, every time you start a new process tends to increase the overall resource load, while continuing an already started process tends to maintain the resource load, and finishing a process tends to reduce the resource load. Therefore, prioritizing already running processes over new processes would tend to help keep the resource load in a manageable range.
It's analogous to the practical advice of "finish what you started". If you have a lot of tasks to accomplish, it's more effective to complete them one at a time, or multitask just a handful of things if you can. If you just keep starting new tasks and never finished the previous ones, eventually you have so many things going on at the same time that you feel overwhelmed.

Puma clustering benefits for site which handles lots of uploads/downloads

I'm trying to understand the benefits of using puma clustering. The GitHub says that the number puma workers should be set to the number of cpu cores, and the default number of threads for each is 0-16. The worker processes can run in parallel while the threads run concurrently. It was my understanding that The MRI GIL only allows one thread across all cores to run Ruby code, so how does puma enable things to run in parallel /provide benefits over running one worker process with double the amount of threads? The site I'm working on is heavily IO bound, handling several uploads and downloads at the same time - any config suggestions for this set up are also welcome.
The workers in clustered mode will actually spawn new child processes each of which has its own "GIL". Only one thread in a single process can be running code at one time, thus having a process per cpu core works well because each cpu can only be doing one thing at a time. It also makes sense to run multiple threads per process because if a thread is waiting for IO it will allow another thread to execute.

Detecting CPU load on different machines

I am trying to create a background task scheduler for my process, which needs to schedule the tasks(compute intensive) parallelly while maintaining the responsiveness of the UI.
Currently, I am using CPU usage(percentage) to against a threshold (~50%) for the scheduler to start a new task, and it sort of works fine.
This program can run on a variety hardware configurations( e.g processor speed, number of cores), so 50% limit can be too harsh or soft for certain configurations.
Is there any good way to include different parameters of CPU configuration e.g cores, speed; which can dynamically come up with a threshold number based on the hardware configuration?
My suggestions:
Run as many threads as CPUs in the system.
Set the priority of each thread to an idle (lowest)
In the thread main loop do a smallest sleep possible, i.e. usleep(1)

Spring batch multithreading: throttle-limit impact

I have a multi-threaded Step configured with a threadpool with a corePoolSize of 48 threads (it's a big machine) but I did not configure the throttle-limit.
I am wondering if I have been under utilizaing the machine because of this.
The Spring Batch documentation says that throttle-limit is the max amount of concurrent tasks that can run at one time and the default is 4.
I can see in jconsole that in fact there are 48 threads created and they seem to be executing (I can also see that in my logs).
But, so, even though I can see the 48 threads created, does the throttle-limit of 4 mean that only 4 of those 48 threads are indeed executing work concurrently?
Thank you in advance.
Yes, your understanding is correct i.e. only threads equal to throttle limit be doing work concurrently.
In your case, since its a thread - pool , any four threads could be chosen randomly to do the work and rest of threads will remain idle but since threads get rotated for those four tasks, it will give an impression that all threads are doing work concurrently.
corePoolSize simply indicates the number of threads to be started and maintained during job run but that doesn't mean that all are running concurrently what it means that you are trying to avoid thread creation overhead etc during job run.
You have not shared any code or job structure so its hard to point any more specifics.
Hope it helps !!

Scheduling Priorities,windows

Based on msdn ,windows os schedule threads based on base prorety and uses as a boost dynamic priorety
The system treats all threads with the same priority as equal. The
system assigns time slices in a round-robin fashion to all threads
with the highest priority. If none of these threads are ready to run,
the system assigns time slices in a round-robin fashion to all threads
with the next highest priority. If a higher-priority thread becomes
available to run, the system ceases to execute the lower-priority
thread (without allowing it to finish using its time slice), and
assigns a full time slice to the higher-priority thread.
From the above quote
The system treats all threads with the same priority as equal
Does it mean that the system treats threads based on dynamic priorety?And base priorety is used just as low limit for dynamic priorety change?
Thank you
Based on msdn ,windows os schedule threads based on base prorety and uses as a boost dynamic
priorety
Well, you follow that with a nice text snipped that has NO SIGN OF A BOOST DYNAMIC PRIORITY.
More information about that is in the documentation - for example http://msdn.microsoft.com/en-us/library/windows/desktop/ms684828(v=vs.85).aspx is a good start.
In simple words, the scheduler schedules threads based on their current priority, and boost priority changes that, so they get scheduled differently.

Resources