What if GOMAXPROCS is too large? - performance

We all know that runtime.GOMAXPROCS is set to CPU core number by default, what if this property has been set too large?
Will program have more context switches?
Will garbage collector be triggered more frequently?

GOMAXPROCS is set to the number of available logical CPUs by default for a reason: this gives best performance in most cases.
GOMAXPROCS only limits the number of "active" threads, if a thread's goroutine gets blocked (e.g. by a syscall), a new thread might be started. There is no direct correclation, see Number of threads used by Go runtime.
If GOMAXPROCS is greater than the number of available CPUs, then there will be more active threads than CPU cores, which means active threads have to be "multiplexed" to the available processing units, so yes, there will be more context switches if there are more active threads than cores, which is not necessarily the case.
Garbage collections are not directly related to the number of threads, so you shouldn't worry about that. Quoting from package runtime:
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
If you have more threads that don't allocate / release memory, that shouldn't effect how frequently collections are triggered.
There might be cases when setting GOMAXPROCS above the number of CPUs increases the performance of your app, but they are rare. Measure to find out if it helps in your case.

Related

Is it nescessary to limit the number of go routines in an entirely cpu-bound workload?

If yes, how does one determine that maximum? That is the most important part to me. I'd really like to have it be set manually. I considered using runtime.GOMAXPROCS(0), as i doubt that more parallelism will yield any additional benefits. The comment seems to suggest, that it is marked for deprecation at some point.
From what I gather, the only limiting factor when it comes to go routines is memory, as a sleeping go routine still requires memory for its stack.
It's not strictly necessary. The number of threads running these goroutines is by default equal to the number of CPU cores on the machine (configurable through GOMAXPROCS), so there will be no contention at the thread level.
However, you might get performance benefits from having fewer goroutines ready to run, because of memory caching effects. For example, on an 8-core machine, if you have 1000 active goroutines that all touch significant amounts of memory, by the time a goroutine gets to run again, the needed memory pages have probably already been evicted from your CPU caches. With fewer goroutines, the odds of a cache hit are better.
As always with performance questions: the only way to be sure is to measure it yourself with a representative workload.
In our testing, we determined that it is best to spawn a fixed number of worker routines and use those to perform all the work. The creation and destruction of goroutines is lightweight, but not entirely free of overhead. That overhead is usually insignificant if the goroutines spend any amount of time blocked.
goroutines are very lightweight so it depends entirely on the system you are running on. An average process should have no problems with less than a million concurrent routines in 4GB Ram. Whether this goes for your target platform is, of course, something we can't answer without knowing what that platform is.
see this article and this, they are usefull

In which cases GetSystemInfo/GetLogicalProcessorInformationEx returns different processor count within the same program run?

There are Windows API functions to obtain CPU and CPU Cache topology.
GetSystemInfo fills SYSTEM_INFO with dwActiveProcessorMask and dwNumberOfProcessors.
GetLogicalProcessorInformationEx can be used to have more precise information about each processor, like cache size, cache line size, cache associativity, etc. Some of this information can be obtained by _cpuidex as well.
I'm asking in which cases the obtained values for one call are not consistent with obtained values for another call, if both calls are made without program restart.
Specifically, can CPU count change:
Should user hibernate, install new processor, and wake the system? Or even this wouldn't work?
Or operating system / hardware can just dynamically decide to plug in another processor?
Or it can be achieved with virtual machines, but not on real hardware?
Can cache line size and cache size change:
For GetLogicalProcessorInformationEx, because processor is replaced at runtime
For _cpuid just because system contains processors with different cache properties, and subsequent calls are on a different processor
The practical reason for these questions is that I want to obtain this information only once, and cache it:
I'm going to use it for allocator tuning, so changing allocation strategy for already allocated memory is likely to cause memory corruption.
These function may be expensive, they may be kernel calls, which I want to avoid
If a thread is scheduled to run on a core that belongs to a different processor group with a different number of processors, GetSystemInfo and GetLogicalProcessorInformation would return a different number of processors, which is the number of processors in the group that core belongs to.
On server-grade systems that support hot-adding CPUs, that number of processors can change as follows:
Hot-adding a CPU such that there is one processor group with less than 64 processors and the thread is running on that group would change the number of processors reported by GetSystemInfo, GetLogicalProcessorInformation, and
GetLogicalProcessorInformationEx.
The maximum size of a processor group is 64 and Windows always creates groups in such way as to minimize the total number of groups. So the number of groups is the ceiling of the total number of logical cores divided by 64. Hot-adding a CPU may result in a creating a new group. This not only changes the total number of processors, but also the number of groups, which can only be obtained via GetLogicalProcessorInformationEx.
On a virtual machine, the number of processors can also increase dynamically without hot-plugging.
You can simulate hot-adding processors using the PNPCPU tool. This is useful for testing the correctness of a piece of a code that depends on the number of processors.
It's possible to receive a notification from the system when a new processor is added by writing a device driver and registering a synchronous or asynchronous driver notification. I don't think this is possible without a device driver.
I think, on current systems, cache properties returned by GetLogicalProcessorInformationEx can only change on thread migration to another core or CPU where one or more of these properties are different. For example, on an Intel Lakefield processor, the properties of the L2 cache depend which core the thread is running because different cores have different L2 caches of different properties.

What is the maximum number of threads that can be running in a Delphi application?

In a Delphi application, what is the maximum number of concurrent threads that can be running at one time ? Suppose that a single thread processing time is about 100 milliseconds.
The number of concurrent threads is limited by available resources. However, keep in mind that every thread uses a minimum amount of memory (usually 1MB by default, unless you specify differently), and the more threads you run, the more work the OS has to do to manage them, and the more time it takes just to keep switching between them so they have fair opportunity to run. A good rule of thumb is to not have more threads than there are CPUs available, since that will be the maximum number of threads that can physically run at any given moment. But you can certainly have more threads than CPUs, the OS will simply schedule them accordingly, which can degrade performance if you have too many running at a time. So you need to think about why you are using threads in the first place and plan accordingly to trade off between performance, memory usage, overhead, etc. Multi-threaded programming is not trivial, so do not treat it lightly.
This is memory dependent, there is no fixed limit to how many threads or other objects that you can create. At some point, if you allocate too much memory, you may get an "out of memory" exception, so you should think about how many threads you really need to invoke and go from there. Also keep in mind the more threads that you invoke, you should expect the processing time for all of the threads to decrease. So you may not get the performance that you're looking for if you have too many concurrent threads at once. I hope that this helps!

Is it advisable to use the Windows API 'SetThreadPriority' within a parallel_for_each loop

I would like to reduce the thread-priority of the threads servicing a parallel_for_each, because under heavy load conditions they consume too much processor time relative to other threads in my system.
Questions:
1) Do the servicing threads of a parallel_for_each inherit the thread-priority of the calling thread? In this case I could presumably call SetThreadPriority before and after the parallel_for_each, and everything should be fine.
2) Alternatively is it advisable to call SetThreadPriority within the parallel_for_each? This will clearly invoke the API multiple times for the same threads. Is there a large overhead of doing this?
2.b) Assuming that I do this, will it affect thread-priorities the next time that parallel_for_each is called - ie do I need to somehow reset the priority of each thread afterwards?
3) I'm wondering about thread-priorities in general. Would anyone like to comment: supposing that I had 2 threads contending for a single processor and one was set to "below-normal" while the other was "normal" priority. Roughly what percentage more processor time would the one thread get compared to the other?
All threads initially start at THREAD_PRIORITY_NORMAL. So you'd have to reduce the priority of each thread. Or reduce the priority of the owning process.
There is little overhead in calling SetThreadPriority. Once you have woken up a thread, the additional cost of calling SetThreadPriority is negligible. Once you set the thread's priority it will remain at that value until changed.
Suppose that you have one processor, and two threads ready to run. The scheduler will always choose to run the thread with the higher priority. This means that in your scenario, the below normal threads would never run. In reality, there's a lot more to scheduling than that. For example priority inversion. However, you can think of it like this. If all processors are busy with normal priority threads, then expect lower priority threads to be starved of CPU.

How to reserve a core for one thread on windows?

I am working on a very time sensitive application which polls a region of shared memory taking action when it detects a change has occurred. Changes are rare but I need to minimize the time from change to action. Given the infrequency of changes I think the CPU cache is getting cold. Is there a way to reserve a core for my polling thread so that it does not have to compete with other threads for either cache or CPU?
Thread affinity alone (SetThreadAffinityMask) will not be enough. It does not reserve a CPU core, but it does the opposite, it binds the thread to only the cores that you specify (that is not the same thing!).
By constraining the CPU affinity, you reduce the likelihood that your thread will run. If another thread with higher priority runs on the same core, your thread will not be scheduled until that other thread is done (this is how Windows schedules threads).
Without constraining affinity, your thread has a chance of being migrated to another core (taking the last time it was run as metric for that decision). Thread migration is undesirable if it happens often and soon after the thread has run (or while it is running) but it is a harmless, beneficial thing if a couple of dozen milliseconds have passed since it was last scheduled (caches will have been overwritten by then anyway).
You can "kind of" assure that your thread will run by giving it a higher priority class (no guarantee, but high likelihood). If you then use SetThreadAffinityMask as well, you have a reasonable chance that the cache is always warm on most common desktop CPUs (which luckily are normally VIPT and PIPT). For the TLB, you will probably be less lucky, but there's nothing you can do about it.
The problem with a high priority thread is that it will starve other threads because scheduling is implemented so it serves higher priority classes first, and as long as these are not satisfied, lower classes get zero. So, the solution in this case must be to block. Otherwise, you may impair the system in an unfavorable way.
Try this:
create a semaphore and share it with the other process
set priority to THREAD_PRIORITY_TIME_CRITICAL
block on the semaphore
in the other process, after writing data, call SignalObjectAndWait on the semaphore with a timeout of 1 (or even zero timeout)
if you want, you can experiment binding them both to the same core
This will create a thread that will be the first (or among the first) to get CPU time, but it is not running.
When the writer thread calls SignalObjectAndWait, it atomically signals and blocks (even if it waits for "zero time" that is enough to reschedule). The other thread will wake from the Semaphore and do its work. Thanks to its high priority, it will not be interrupted by other "normal" (that is, non-realtime) threads. It will keep hogging CPU time until done, and then block again on the semaphore. At this point, SignalObjectAndWait returns.
Using the Task Manager, you can set the "affinity" of processes.
You would have to set the affinity of your time-critical app to core 4, and the affinity of all the other processes to cores 1, 2, and 3. Assuming four cores of course.
You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).
Given the infrequency of changes I think the CPU cache is getting cold.
That sounds very strange.
Let's assume your polling thread and the writing thread are on different cores.
The polling thread will be reading the shared memory address and so will be caching the data. That cache line is probably marked as exclusive. Then the write thread finally writes; first, it reads the cache line of memory in (so that line is now marked as shared on both cores) and then it writes. Writing causes the polling thread CPU's cache line to be marked as invalid. The polling thread then comes to read again; if it reads while the writing thread still has the data cached, it will read from the second cores cache, invalidating its cache line and taking ownership for itself. There's a lot of bus traffic overhead to do this.
Another issue is that the writing thread, if it doesn't write often, will almost certainly lose the TLB entry for the page with the shared memory address. Recalculating the physical address is a long, slow process. Since the polling thread polls often, possibly that page is always in that cores TLB; and in that sense, you might well do better, in latency terms, to have both threads on the same core. (Although if they're both compute intensive, they might interfere destructively and that cost could be much higher - I can't know, as I don't know what the threads are doing).
One thing you could do is use a hyperthread on the writing thread core; if you know early on you're going to write, get the hyperthread to read the shared memory address. This will load the TLB and cache while the writing thread is still busy computing, giving you parallelism.
The Win32 function SetThreadAffinityMask() is what you are looking for.

Resources