WIN32: Yielding execution to another (given) thread

WIN32: Yielding execution to another (given) thread - windows

I am looking for a way to yield the remainder of the thread execution's scheduled time slice to a different thread. There is a SwitchToThread function in WINAPI, but it doesn't let the caller specify the thread it wants to switch to. I browsed MSDN for quite some time and haven't found anything that would offer just that.
For an operating-system-internals layman like me, it seems that yielding thread should be able to specify which thread does it want to pass the execution to. Is it possible or is it just my imagination?

The reason you can't yield processor time-slices to a designated thread is that Windows features a preemptive scheduling kernel which pretty much places the responsibility and authority of scheduling the processor time in the hands of the kernel and only the kernel.
As such threads don't have any control over when they run, if they run, and even less over which thread is switched to after their time slice is up.
However, there are a few way you may influence context switches:
by increasing the priority of a certain thread you may force the scheduler to schedule it more often in the detriment of other threads (obviously the reverse applies as well - you can lower the priority of other threads)
you can code your process to place threads in kernel wait mode when they don't have work to do in order to help the scheduler do it's job. When using proper kernel wait constructs such as Critical Sections, Mutexes, Semaphores, and Timers you effectively tell the kernel a certain thread doesn't need to be scheduled until a certain codition is met.
Note: There is rarely a reason you should tamper with task priorities so USE WITH CAUTION

You might use 'fibers' instead of 'threads': for example there's a Win32 API named SwitchToFiber which lets you specify the fiber to be scheduled.

Take a look at UMS (User-mode scheduling) threads in Windows 7
http://msdn.microsoft.com/en-us/library/dd627187(VS.85).aspx

The second thread can simply wait for the yielding thread either by calling WaitForSingleObject() on its handle or periodically polling GetExitCodeThread(). The other answers are correct about altering the operating system's scheduling mechanisms - it is better to design the threads properly in the first place.

This is not possible. Only the kernel can decide what code runs next though you can influence it by reducing the non-waiting threads it has to choose from to run next, and by setting thread priorities with SetThreadPriority.

You can use regular synchronization primitives like events, semaphores, etc. to serialize your two threads. This does not in any form prevent the kernel from scheduling other threads in between, or in parallel on another CPU core, or virtually simultaneously on the same core. This is due to preemtive multitasking nature of modern general purpose operating systems.

If you want to do your own scheduling under Windows, you can use fibers, which essentially are threads that you have to schedule yourself. However, given that you describe yourself as a layman to the OS internals world, that would probably be a bad idea, as fibers are something of an advanced feature.

Can I ask why you want to use SwitchToThread?
If for example it's some form of because thread x is computing some value that you want to wait for on thread Y, then I'd really suggest looking at the Parallel Pattern Library or the Asynchronous Agents Library in Visual Studio 2010 which allows you to do this either with message blocks (receive on an asynchronous value) or simply via tasks : wait for a set of tasks to complete and inline their execution while waiting...
//i.e. on an arbitrary thread
task_group* tasks;
tasks->run(... / some functor/)
a call to tasks->wait() will wait and inline any tasks running.

Related

Threadpool - CPU usage?

I am working on a Windows C++ application. We use the boost library. I have an operation in my application that can be parallelized to run on multiple threads. Number of threads depends each time on the operation parameters and can be big(say like 50 or 70). I dont want to spawn the maximum threads that I can, since that is a risk of the application being non-responsive to other operations(since the all the processor(s) could be occupied doing this). How can I make sure I dont create a situation I described? Would a threadpool help and if so how?

70 threads on modern hardware can be easily handled w/o any noticeable impact on system performance. Thread creation time, memory usage, scheduling and context switch overhead can be a problem but we don't know if it's a problem in your particular case.
If creating 70 threads is not an option, consider using OpenMP (supported by all major compilers) as it's a very simple and often very efficient solution:
#pragma omp parallel for
for(int i = 1; i < 100; ++i)
{
do_task(i);
}
It uses a thread pool under the hood.
If OpenMP is not acceptable for some reason(s), you can go with explicit thread pool. It can be a "home-made" thread pool (not recommended), or one from #sehe's answer, or one that is provided by OS (as #Hans Passant mentioned in his comment), or one from a 3rd-party library (e.g. Intel Threading Building Blocks).
Yes, thread pool can help with responsiveness, though typical thread pool implementation by default creates number of threads == number of logical CPU cores. This means all your cores can be busy doing your work and it's not necessarily a problem. Windows uses preemptive multithreading. This means it can handle number of threads much greater than number of CPUs and still being responsive.
Thread pool can help because it's not possible to simultaneously execute more tasks than number of logical CPU cores you have. Thread pool can be more efficient because of better use of caches and reduced number of context switches. Or because same threads can be used to execute your operation multiple times. To know for sure profile your performance.

Just create a thread pool, e.g. the one I posted here boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
Two more flavours here c++ work queues with blocking (one using Asio, one using just C++11)

You can use std::async with default launch policy. However, this is not the same as thread pool.
In OpenMP, you can set a fixed number of threads and then use OpenMP tasks. Unfortunately, there is no such option in C++11. The Standard says that the choice whether the function will be invoked asynchronously in a new thread or synchronously in a thread that calls wait or get on a corresponding std::future object can be deferred, however, then still a new thread must be created when asynchronous invocation is selected.

Could process running only on one processor have threads running on other processors?

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
(Question came from discussion started in one company, regarding using synchronization objects (Events, Mutexes, Semaphores) and WinAPIs, like WaitForSignleObject, etc, especially SignalObjectAndWait for which MSDN states
"Note that the "signal" and "wait" are not guaranteed to be performed
as an atomic operation. Threads executing on other processors can
observe the signaled state of the first object before the thread
calling SignalObjectAndWait begins its wait on the second object"
Does it mean that for single processor it's guaranteed to be atomic?
P.S. Is there any differences for Windows Context Switching that there are multiple processors or single processor with more real cores?
P.P.S. Please be patient with this question if I didn't use exact and concrete terms - this are is still not very good known for me.

No.
The set of processor cores a thread can run on is the intersection of the process affinity mask and the thread affinity mask.
To get the behavior you describe, one would set the thread affinity mask for the main thread, and not mess with the process mask.
For your followup questions: If it isn't atomic, it isn't atomic. There are additional guarantees for threads sharing a core, because preemption follows certain rules, but they are very complex, since relative priority and dynamic priority are important factors in thread scheduling. Because of the complexity, it is best to use proper synchronization.
Notably, race conditions between threads of equal priority certainly still exist on a single core (or single core restricted) system, but they are far less frequent and therefore far more difficult to find and debug.

Is it possible, in multiprocessor environment (PC) that one windows process is configured to run only on one processor (affinity mask = 1 or SetProcessAffinityMask(GetCurrentProcess(),1)), but its thread are spawned on other processors?
If not set cpu affinity to only one core, one process could run on multiple cores?
What's the difference between processes and threads?
Thread could have processes or process could have threads?
Could process seen from a thread point of view or vice verse?
What is atomic notion?
when number 1 could seen as multidimensional unit?
Could we divide 1/0 (to zero)? When could we or couldn't?
Does it mean that for single processor it's guaranteed to be atomic?
One cpu: do you remember: run and stay resident? Good old time!
Then Unix: multiprocessing, multithreading, etc. :)
Note:
You couldn't ask a question without knowing answer to that question.
Try to ask something you don't know, that's impossible! You're asking because you have an answer. Look inside your question. Answer is evident. :)

Instruct win32 threads to run on a single processor core

I have a test program which would be much simpler if it could rely on threads being scheduled in strict priority order on Windows. I'm seeing a low priority thread running alongside higher priority threads and wonder if this is happening because the different threads are being scheduled on different processor cores.
Is there a way to force all Win32 threads in a process to use a single processor core? SetThreadAffinityMask looks like it might be interesting but its docs aren't entirely clear and I'm not sure how to use it.

SetThreadAffinityMask function: Sets a processor affinity mask for the specified thread.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247%28v=vs.85%29.aspx
SetThreadAffinityMask(GetCurrentThread(), (1 << CoreNumber));
Sets the current thread's affinity to 'CoreNumber' variable

Even if you force all threads onto one virtual processor you will still often have low-priority threads running and high-priority threads waiting for them (priority inversion). Once a thread is scheduled by the windows-scheduler it runs until it is either preempted or sleeps (or some other sleep-inducing system call). You will have to change the design of your application so that it no-longer assumes that no low-priority thread runs while a high-priority thread would be ready to run also.

Is it better to poll or wait?

I have seen a question on why "polling is bad". In terms of minimizing the amount of processor time used by one thread, would it be better to do a spin wait (i.e. poll for a required change in a while loop) or wait on a kernel object (e.g. a kernel event object in windows)?
For context, assume that the code would be required to run on any type of processor, single core, hyperthreaded, multicore, etc. Also assume that a thread that would poll or wait can't continue until the polling result is satisfactory if it polled instead of waiting. Finally, the time between when a thread starts waiting (or polling) and when the condition is satisfied can potentially vary from a very short time to a long time.
Since the OS is likely to more efficiently "poll" in the case of "waiting", I don't want to see the "waiting just means someone else does the polling" argument, that's old news, and is not necessarily 100% accurate.

Provided the OS has reasonable implementations of these type of concurrency primitives, it's definitely better to wait on a kernel object.
Among other reasons, this lets the OS know not to schedule the thread in question for additional timeslices until the object being waited-for is in the appropriate state. Otherwise, you have a thread which is constantly getting rescheduled, context-switched-to, and then running for a time.
You specifically asked about minimizing the processor time for a thread: in this example the thread blocking on a kernel object would use ZERO time; the polling thread would use all sorts of time.
Furthermore, the "someone else is polling" argument needn't be true. When a kernel object enters the appropriate state, the kernel can look to see at that instant which threads are waiting for that object...and then schedule one or more of them for execution. There's no need for the kernel (or anybody else) to poll anything in this case.

Waiting is the "nicer" way to behave. When you are waiting on a kernel object your thread won't be granted any CPU time as it is known by the scheduler that there is no work ready. Your thread is only going to be given CPU time when it's wait condition is satisfied. Which means you won't be hogging CPU resources needlessly.

I think a point that hasn't been raised yet is that if your OS has a lot of work to do, blocking yeilds your thread to another process. If all processes use the blocking primitives where they should (such as kernel waits, file/network IO etc.) you're giving the kernel more information to choose which threads should run. As such, it will do more work in the same amount of time. If your application could be doing something useful while waiting for that file to open or the packet to arrive then yeilding will even help you're own app.

Waiting does involve more resources and means an additional context switch. Indeed, some synchronization primitives like CLR Monitors and Win32 critical sections use a two-phase locking protocol - some spin waiting is done fore actually doing a true wait.
I imagine doing the two-phase thing would be very difficult, and would involve lots of testing and research. So, unless you have the time and resources, stick to the windows primitives...they already did the research for you.

There are only few places, usually within the OS low-level things (interrupt handlers/device drivers) where spin-waiting makes sense/is required. General purpose applications are always better off waiting on some synchronization primitives like mutexes/conditional variables/semaphores.

I agree with Darksquid, if your OS has decent concurrency primitives then you shouldn't need to poll. polling usually comes into it's own on realtime systems or restricted hardware that doesn't have an OS, then you need to poll, because you might not have the option to wait(), but also because it gives you finegrain control over exactly how long you want to wait in a particular state, as opposed to being at the mercy of the scheduler.

Waiting (blocking) is almost always the best choice ("best" in the sense of making efficient use of processing resources and minimizing the impact to other code running on the same system). The main exceptions are:
When the expected polling duration is small (similar in magnitude to the cost of the blocking syscall).
Mostly in embedded systems, when the CPU is dedicated to performing a specific task and there is no benefit to having the CPU idle (e.g. some software routers built in the late '90s used this approach.)
Polling is generally not used within OS kernels to implement blocking system calls - instead, events (interrupts, timers, actions on mutexes) result in a blocked process or thread being made runnable.

There are four basic approaches one might take:
Use some OS waiting primitive to wait until the event occurs
Use some OS timer primitive to check at some defined rate whether the event has occurred yet
Repeatedly check whether the event has occurred, but use an OS primitive to yield a time slice for an arbitrary and unknown duration any time it hasn't.
Repeatedly check whether the event has occurred, without yielding the CPU if it hasn't.
When #1 is practical, it is often the best approach unless delaying one's response to the event might be beneficial. For example, if one is expecting to receive a large amount of serial port data over the course of several seconds, and if processing data 100ms after it is sent will be just as good as processing it instantly, periodic polling using one of the latter two approaches might be better than setting up a "data received" event.
Approach #3 is rather crude, but may in many cases be a good one. It will often waste more CPU time and resources than would approach #1, but it will in many cases be simpler to implement and the resource waste will in many cases be small enough not to matter.
Approach #2 is often more complicated than #3, but has the advantage of being able to handle many resources with a single timer and no dedicated thread.
Approach #4 is sometimes necessary in embedded systems, but is generally very bad unless one is directly polling hardware and the won't have anything useful to do until the event in question occurs. In many circumstances, it won't be possible for the condition being waited upon to occur until the thread waiting for it yields the CPU. Yielding the CPU as in approach #3 will in fact allow the waiting thread to see the event sooner than would hogging it.

Win32 Thread scheduling

As I understand, windows thread scheduler does not discriminate beween threads belonging two different processes, provided all of them have the same base priority. My question is if I have two applications one with only one thread and the other with say 50 threads all with same base priority, does it mean that the second process enjoys more CPU time then the first one?

Scheduling in Windows is at the thread granularity. The basic idea behind this approach is that processes don't run but only provide resources and a context in which their threads run. Coming back to your question, because scheduling decisions are made strictly on a thread basis, no consideration is given to what process the thread belongs to. In your example, if process A has 1 runnable thread and process B has 50 runnable threads, and all 51 threads are at the same priority, each thread would receive 1/51 of the CPU time—Windows wouldn't give 50 percent of the CPU to process A and 50 percent to process B.
To understand the thread-scheduling algorithms, you must first understand the priority levels that Windows uses. You can refer here for quick reference.
Try reading Windows Internals for in depth understanding.

All of the above are accurate but if you're worried about the 50 thread process hogging all the CPU, there ARE techniques you can do to ensure that no single process overwhelms the CPU.
IMHO the best way to do this is to use job objects to manage the usage of a process. First call CreateJobObject, then SetInformationJobObject to limit the max CPU usage of the processes in the job object and AssignProcessToJobObject to assign the process with 50 threads to the job object. You can then let the OS ensure that the 50 thread process doesn't consume too much CPU time.

The unit of scheduling is a thread, not a process, so a process with 50 threads, all in a tight loop, will get much more of the cpu than a process with only a single thread, provided all are running at the same priority. This is normally not a concern since most threads in the system are not in a runnable state and will not be up for scheduling; they are waiting on I/O, waiting for input from the user, and so on.
Windows Internals is a great book for learning more about the Windows thread scheduler.

That depends on the behavior of the threads. In general with a 50 : 1 difference in thread count, yes, the application with more threads is going to get a lot more time. However, windows also uses dynamic thread prioritization, which can change this somewhat. Dynamic thread prioritization is described here:
https://web.archive.org/web/20130312225716/http://support.microsoft.com/kb/109228
Relevant excerpt:
The base priority of a thread is the base level from which these upward adjustments are made. The current priority of a thread is called its dynamic priority. Interactive threads that yield before their time slice is up will tend to be adjusted upward in priority from their base priority. Compute-bound threads that do not yield, consuming their entire time slice, will tend to have their priority decreased, but not below the base level. This arrangement is often called heuristic scheduling. It provides better interactive performance and tends to lessen the system impact of "CPU hog" threads.

There is a local 'advanced' setting that purportedly can be used to shade scheduling slightly in favor of the app with focus. With the 'services' setting, there is no preference. In previous versions of Windows, this setting used to be somewhat more granular than just 'applications with focus'(slight preference to app with focus) and 'services' (all equal weigthing)
As this can be set by the user on the targe machine, it seems like it is asking for grief to depend on this setting...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio