Slow thread creation on Windows

Slow thread creation on Windows - windows

I have upgraded a number crunching application to a multi-threaded program, using the C++11 facilities. It works well on Mac OS X but does not benefit from multithreading on Windows (Visual Studio 2013). Using the following toy program
#include <iostream>
#include <thread>
void t1(int& k) {
k += 1;
};
void t2(int& k) {
k += 1;
};
int main(int argc, const char *argv[])
{
int a{ 0 };
int b{ 0 };
auto start_time = std::chrono::high_resolution_clock::now();
for (int i = 0; i < 10000; ++i) {
std::thread thread1{ t1, std::ref(a) };
std::thread thread2{ t2, std::ref(b) };
thread1.join();
thread2.join();
}
auto end_time = std::chrono::high_resolution_clock::now();
auto time_stack = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time).count();
std::cout << "Time: " << time_stack / 10000.0 << " micro seconds" <<
std::endl;
std::cout << a << " " << b << std::endl;
return 0;
}
I have discovered that it takes 34 microseconds to start a thread on Mac OS X and 340 microseconds to do the same on Windows. Am I doing something wrong on the Windows side ? Is it a compiler issue ?

Not a compiler problem (nor an operating system problem, strictly speaking).
It is a well-known fact that creating threads is an expensive operation. This is especially true under Windows (used to be true under Linux prior to clone as well).
Also, creating and joining a thread is necessarily slow and does not tell a lot about creating a thread as such. Joining presumes that the thread has exited, which can only happen after it has been scheduled to run. Thus, your measurements include delays introduced by scheduling. Insofar, the times you measure are actually pretty good (they could easily be 20 times longer!).
However, it does not matter a lot whether spawning threads is slow anyway.
Creating 20,000 threads like in your benchmark in a real program is a serious error. While it is not strictly illegal or disallowed to create thousands (even millions) of threads, the "correct" way of using threads is to create no more threads than there are approximately CPU cores. One does not create very short-lived threads all the time either.
You might have a few short-lived ones, and you might create a few extra threads (which e.g. block on I/O), but you will not want to create hundreds or thousands of these. Every additional thread (beyond the number of CPU cores) means more context switches, more scheduler work, more cache pressure, and 1MB of address space and 64kB of physical memory gone per thread (due to stack reserve and commit granularity).
Now, assume you create for example 10 threads at program start, it does not matter at all whether this takes 3 milliseconds alltogether. It takes several hundred milliseconds (at least) for the program to start up anyway, nobody will notice a difference.

Visual C++ uses Concurrency Runtime (MS specific) to implement std.thread features. When you directly call any Concurrency Runtime feature/function, it creates a default runtime object (not going into details). Or, when you call std.thread function, it does the same as of ConcRT function was invoked.
The creation of default runtime (or say, scheduler) takes sometime, and hence it appear to be taking sometime. Try creating a std::thread object, let it run; and then execute the benching marking code (whole of above code, for example).
EDIT:
Skim over it - http://www.codeproject.com/Articles/80825/Concurrency-Runtime-in-Visual-C
Do Step-Into debugging, to see when CR library is invoked, and what it is doing.

Related

Wanting to understand purpose and count of V8's WorkerThread(s)

I have a very simple program as follows that just creates an isolate, then sleeps:
#include <libplatform/libplatform.h>
#include <v8-platform.h>
#include <v8.h>
#include <stdio.h>
#include <unistd.h>
using v8::Isolate;
int main() {
std::unique_ptr<v8::Platform> platform = v8::platform::NewDefaultPlatform();
v8::V8::InitializePlatform(platform.get());
v8::V8::Initialize();
v8::Isolate::CreateParams create_params;
create_params.array_buffer_allocator = v8::ArrayBuffer::Allocator::NewDefaultAllocator();
Isolate* isolate = v8::Isolate::New(create_params);
printf("Sleeping...\n");
usleep(1000 * 1000 * 100);
printf("Done\n");
return 0;
}
When i run this program, i can then check the number of threads the process has created with ps -T -p <process_id>, and i see that on my 8 core machine, v8 creates 7 extra threads all named "V8 WorkerThread", and on my 16 core machine i get 8 instances of this "V8 WorkerThread" created.
I am looking to understand what determines the number of extra threads v8 spawns and what the purpose of these threads are. Thanks in advance!

The number of worker threads, when not specified by the embedder (that's you!), is chosen based on the number of CPU cores. In the current implementation, the formula is: number_of_worker_threads = (number_of_CPU_cores - 1) up to a maximum of 8, though this may change without notice. You can also specify your own worker thread pool size as an argument to NewDefaultPlatform.
The worker threads are used for various tasks that can be run in the background, mostly for garbage collection and optimized compilation.

OpenMp doesn't utilize all CPUs(dual socket, windows and Microsoft visual studio)

I have a dual socket system with 22 real cores per CPU or 44 hyperthreads per CPU. I can get openMP to completely utilize the first CPU(22 cores/44 hyper) but I cannot get it to utilize the second CPU.
I am using CPUID HWMonitor to check my core usage. The second CPU is always at or near 0 % on all cores.
Using:
int nProcessors = omp_get_max_threads();
gets me nProcessors = 44, but I think it's just using the 44 hyperthreads of 1 CPU instead of 44 real cores(should be 88 hyperthreads)
After looking around a lot, I'm not sure how to utilize the other CPU.
My CPU is running fine as I can run other parallel processing programs that utilize all of them.
I'm compiling this in 64 bit but I don't think that matters. Also, I'm using Visual studio 2017 Professional version 15.2. Open MP 2.0(only one vs supports). Running on a windows 10 Pro, 64 bit, with 2 Intel Xeon E5-2699v4 # 2.2Ghz processors.

So answering my own question with thanks to #AlexG for providing some insight. Please see comments section of question.
This is a Microsoft Visual Studio and Windows problem.
First read Processor Groups for Windows.
Basically, if you have under 64 logical cores, this would not be a problem. Once you get past that, however, you will now have two process groups for each socket(or other organization Windows so chooses). In my case, each process group had 44 hyperthreads and represented one physical CPU socket and I had exactly two process groups. Every process(program) by default, is only given access to one process group, hence I initially could only utilize 44 threads on one core. However, if you manually create threads and use SetThreadGroupAffinity to set the thread's processor group to one that is different from your program's initially assigned group, then your program now becomes a multi processor group. This seems like a round-about way to enable multi-processors but yes this is how to do it. A call to GetProcessGroupAffinity will show that the number of groups becomes greater than 1 once you start setting each thread's individual process group.
I was able to create an open MP block like so, and go through and assign process groups:
...
#pragma omp parallel num_threads( 88 )
{
HANDLE thread = GetCurrentThread();
if (omp_get_thread_num() > 32)
{
// Reserved has to be zero'd out after each use if reusing structure...
GroupAffinity1.Reserved[0] = 0;
GroupAffinity1.Reserved[1] = 0;
GroupAffinity1.Reserved[2] = 0;
GroupAffinity1.Group = 0;
GroupAffinity1.Mask = 1 << (omp_get_thread_num()%32);
if (SetThreadGroupAffinity(thread, &GroupAffinity1, &previousAffinity))
{
sprintf(buf, "Thread set to group 0: %d\n", omp_get_thread_num());
OutputDebugString(buf);
}
}
else
{
// Reserved has to be zero'd out after each use if reusing structure...
GroupAffinity2.Reserved[0] = 0;
GroupAffinity2.Reserved[1] = 0;
GroupAffinity2.Reserved[2] = 0;
GroupAffinity2.Group = 1;
GroupAffinity2.Mask = 1 << (omp_get_thread_num() % 32);
if (SetThreadGroupAffinity(thread, &GroupAffinity2, &previousAffinity))
{
sprintf(buf, "Thread set to group 1: %d\n", omp_get_thread_num());
OutputDebugString(buf);
}
}
}
So with the above code, I was able to force 64 threads to run, 32 threads each per socket. Now I couldn't get over 64 threads even though I tried forcing omp_set_num_threads to 88. The reason seems to be linked to Visual Studio's implementation of OpenMP not allowing more than 64 OpenMP threads. Here's a link on that for more information
Thanks all for helping glean some more tidbits that helped in the eventual answer!

Thread-Ids in Windows greater than 0xFFFF

we have a big and old software project. This software runs in older days on an old OS, so it has an OS-Wrapper. Today it runs on windows.
In the OS-Wrapper we have structs to manage threads. One Member of this struct is the thread-Id, but it is defined with an uint16_t. The thread-Ids will be generated with the Win-API createThreadEx.
Since some month at one of our customers thread-Ids appears which are greater than
numeric_limits<uint16_t>::max()
We run in big troubles, if we try to change this member to an uint32_t. And even if we fix it, we had to test the fix.
So my question is: How is it possible in windows to get thread-Ids which are greater than 0xffff? How must be the circumstances to reach this?

Windows thread IDs are 32 bit unsigned integers, of type DWORD. There's no requirement for them to be less than 0xffff. Whatever thought process led you to that belief was flawed.
If you want to stress test your system to create a scenario where you have thread IDs that go above 0xffff then you simply need to create a large number of threads. To make this tenable, without running out of virtual address space, create threads with very small stacks. You can create the threads suspended too because you don't need the threads to do anything.
Of course, it might still be a little tricky to force the system to allocate that many threads. I found that my simple test application would not readily generate thread IDs above 0xffff when run as a 32 bit process, but would do so as a 64 bit process. You could certainly create a 64 bit process that would consume the low-numbered thread IDs and then allow your 32 bit process to go to work and so deal with lower numbered thread IDs.
Here's the program that I experimented with:
#include <Windows.h>
#include <iostream>
DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
return 0;
}
int main()
{
for (int i = 0; i < 10000; i++)
{
DWORD threadID;
if (CreateThread(NULL, 64, ThreadProc, NULL, CREATE_SUSPENDED, &threadID) == NULL)
return 1;
std::cout << std::hex << threadID << std::endl;
}
return 0;
}

Re
” We run in big troubles, if we try to change this member to an uint32_t. And even if we fix it, we had to test the fix.
Your current software’s use of a 16-bit object to store a value that requires 32 bits, is a bug. So you have to fix it, and test the fix. There are at least two practical fixes:
Changing the declaration of the id, and all uses of it.
It can really help with finding all copying of the id, to introduce a dedicated type that is not implicitly convertible to integer, e.g. a C++11 based enumeration type.
Adding a layer of indirection.
Might be possible without changing the data, only changing the threading library implementation.
A deeper fix might be to replace the current threading with C++11 standard library threading.
Anyway you're up for a bit of work, and/or some cost.

How to make pthread_cond_timedwait() robust against system clock manipulations?

Consider the following source code, which is fully POSIX compliant:
#include <stdio.h>
#include <limits.h>
#include <stdint.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/time.h>
int main (int argc, char ** argv) {
pthread_cond_t c;
pthread_mutex_t m;
char printTime[UCHAR_MAX];
pthread_mutex_init(&m, NULL);
pthread_cond_init(&c, NULL);
for (;;) {
struct tm * tm;
struct timeval tv;
struct timespec ts;
gettimeofday(&tv, NULL);
printf("sleep (%ld)\n", (long)tv.tv_sec);
sleep(3);
tm = gmtime(&tv.tv_sec);
strftime(printTime, UCHAR_MAX, "%Y-%m-%d %H:%M:%S", tm);
printf("%s (%ld)\n", printTime, (long)tv.tv_sec);
ts.tv_sec = tv.tv_sec + 5;
ts.tv_nsec = tv.tv_usec * 1000;
pthread_mutex_lock(&m);
pthread_cond_timedwait(&c, &m, &ts);
pthread_mutex_unlock(&m);
}
return 0;
}
Prints the current system date every 5 seconds, however, it does a sleep of 3 seconds between getting the current system time (gettimeofday) and the condition wait (pthread_cond_timedwait).
Right after it is printing "sleep (...)", try setting the system clock two days into the past. What happens? Well, instead of waiting 2 more seconds on the condition as it usually does, pthread_cond_timedwait now waits for two days and 2 seconds.
How do I fix that?
How can I write POSIX compliant code, that does not break when the user manipulates the system clock?
Please keep in mind that the system clock might change even without user interaction (e.g. a NTP client might update the clock automatically once a day). Setting the clock into the future is no problem, it will only cause the sleep to wake up early, which is usually no problem and which you can easily "detect" and handle accordingly, but setting the clock into the past (e.g. because it was running in the future, NTP detected that and fixed it) can cause a big problem.
PS:
Neither pthread_condattr_setclock() nor CLOCK_MONOTONIC exists on my system. Those are mandatory for the POSIX 2008 specification (part of "Base") but most systems still only follow the POSIX 2004 specification as of today and in the POSIX 2004 specification these two were optional (Advanced Realtime Extension).

Interesting, I've not encountered that behaviour before but, then again, I'm not in the habit of mucking about with my system time that much :-)
Assuming you're doing that for a valid reason, one possible (though kludgy) solution is to have another thread whose sole purpose is to periodically kick the condition variable to wake up any threads so affected.
In other words, something like:
while (1) {
sleep (10);
pthread_cond_signal (&condVar);
}
Your code that's waiting for the condition variable to be kicked should be checking its predicate anyway (to take care of spurious wakeups) so this shouldn't have any real detrimental effect on the functionality.
It's a slight performance hit but once every ten seconds shouldn't be too much of a problem. It's only really meant to take care of the situations where (for whatever reason) your timed wait will be waiting a long time.
Another possibility is to re-engineer your application so that you don't need timed waits at all.
In situations where threads need to be woken for some reason, it's invariably by another thread which is perfectly capable of kicking a condition variable to wake one (or broadcasting to wake the lot of them).
This is very similar to the kicking thread I mentioned above but more as an integral part of your architecture than a bolt-on.

You can defend your code against this problem. One easy way is to have one thread whose sole purpose is to watch the system clock. You keep a global linked list of condition variables, and if the clock watcher thread sees a system clock jump, it broadcasts every condition variable on the list. Then, you simply wrap pthread_cond_init and pthread_cond_destroy with code that adds/removes the condition variable to/from the global linked list. Protect the linked list with a mutex.

Critical Sections leaking memory on Vista/Win2008?

It seems that using Critical Sections quite a bit in Vista/Windows Server 2008 leads to the OS not fully regaining the memory.
We found this problem with a Delphi application and it is clearly because of using the CS API. (see this SO question)
Has anyone else seen it with applications developed with other languages (C++, ...)?
The sample code was just initialzing 10000000 CS, then deleting them. This works fine in XP/Win2003 but does not release all the peak memory in Vista/Win2008 until the application has ended.
The more you use CS, the more your application retains memory for nothing.

Microsoft have indeed changed the way InitializeCriticalSection works on Vista, Windows Server 2008, and probably also Windows 7.
They added a "feature" to retain some memory used for Debug information when you allocate a bunch of CS. The more you allocate, the more memory is retained. It might be asymptotic and eventually flatten out (not fully bought to this one).
To avoid this "feature", you have to use the new API InitalizeCriticalSectionEx and pass the flag CRITICAL_SECTION_NO_DEBUG_INFO.
The advantage of this is that it might be faster as, very often, only the spincount will be used without having to actually wait.
The disadvantages are that your old applications can be incompatible, you need to change your code and it is now platform dependent (you have to check for the version to determine which one to use). And also you lose the ability to debug if you need.
Test kit to freeze a Windows Server 2008:
- build this C++ example as CSTest.exe
#include "stdafx.h"
#include "windows.h"
#include <iostream>
using namespace std;
void TestCriticalSections()
{
const unsigned int CS_MAX = 5000000;
CRITICAL_SECTION* csArray = new CRITICAL_SECTION[CS_MAX];
for (unsigned int i = 0; i < CS_MAX; ++i)
InitializeCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
EnterCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
LeaveCriticalSection(&csArray[i]);
for (unsigned int i = 0; i < CS_MAX; ++i)
DeleteCriticalSection(&csArray[i]);
delete [] csArray;
}
int _tmain(int argc, _TCHAR* argv[])
{
TestCriticalSections();
cout << "just hanging around...";
cin.get();
return 0;
}
-...Run this batch file (needs the sleep.exe from server SDK)
#rem you may adapt the sleep delay depending on speed and # of CPUs
#rem sleep 2 on a duo-core 4GB. sleep 1 on a 4CPU 8GB.
#for /L %%i in (1,1,300) do #echo %%i & #start /min CSTest.exe & #sleep 1
#echo still alive?
#pause
#taskkill /im cstest.* /f
-...and see a Win2008 server with 8GB and quad CPU core freezing before reaching the 300 instances launched.
-...repeat on a Windows 2003 server and see it handle it like a charm.

Your test is most probably not representative of the problem. Critical sections are considered "lightweight mutexes" because a real kernel mutex is not created when you initialize the critical section. This means your 10M critical sections are just structs with a few simple members. However, when two threads access a CS at the same time, in order to synchronize them a mutex is indeed created - and that's a different story.
I assume in your real app threads do collide, as opposed to your test app. Now, if you're really treating critical sections as lightweight mutexes and create a lot of them, your app might be allocating a large number of real kernel mutexes, which are way heavier than the light critical section object. And since mutexes are kernel object, creating a excessive number of them can really hurt the OS.
If this is indeed the case, you should reduce the usage of critical sections where you expect a lot of collisions. This has nothing to do with the Windows version, so my guess might be wrong, but it's still something to consider. Try monitoring the OS handles count, and see how your app is doing.

You're seeing something else.
I just built & ran this test code. Every memory usage stat is constant - private bytes, working set, commit, and so on.
int _tmain(int argc, _TCHAR* argv[])
{
while (true)
{
CRITICAL_SECTION* cs = new CRITICAL_SECTION[1000000];
for (int i = 0; i < 1000000; i++) InitializeCriticalSection(&cs[i]);
for (int i = 0; i < 1000000; i++) DeleteCriticalSection(&cs[i]);
delete [] cs;
}
return 0;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio