What is the maximum number of threads that can be generated in Windows 8.1, and what factors can limit the number of threads?
Like most limits in Windows, this is limited by available memory. A 32-bit process keels over somewhat shy of 2000 threads when all available virtual memory is occupied by the stacks of the threads (1 MB each). A 64-bit process is limited by the size of the paging file, needed to commit the allocation. Many thousands, it depends on how fast the paging file can grow to meet the needs of the program. There is also a limit imposed by the kernel's paged memory pool, each thread has a kernel stack so that it can make kernel calls, typically 24 KB per thread.
These limits are far beyond the number of balls a programmer can keep in the air without dropping one on his foot. He'll be limping around for a long time, threading bugs are exceedingly hard to troubleshoot.
Mark Russinovich explores the limits in this excellent blog post.
I believe you may be thinking of this somewhat wrong. Threads are limited by your CPU and how many can fit on ram, not Windows OS. Also, the way you attack concurrent programming is largely dependent on how the programming language your using addresses the issue. For instance c++ using the stl libraries vs c++ using mpi(process based) are very different.
Each CPU has a physical/ virtual thread limit it can do at once. Any more than that number will cause a over-subscription of threads, forcing task switching. For instance my PC has eight threads (4 cores/ 4 virtual), if I create 10 thread I will get them BUT 2 will always not be running at any time. This will for the machine to do extra task switching to meet all 10 threads. Also, keep in mind your program isn't the only one in the PC running.
To find the max number of thread you can run in C++ using the stl:
#include<iostream>
#include<thread>
using namespace std;
int main()
{
cout << thread::hardware_concurrency();
cin.get();
return 0;
}
Though I recommend using a programming language such a C# that abstracts the threading for you using automatic thread pools and tasks. This will make it easier to learn the concepts.
Also, processes and thread are different. For a great explanation of this difference I suggest the following Link:
What is the difference between a process and a thread
Question from Microsoft intern:
Possible ans: Depends on RAM size and Processor.
My computer has a quadcore i7 processor. I'm studying parallelization of scientific simulations. How does hyperthreading impact on parallel performances? I know I should never use more than 4 working processes to get descent performances. But should I disable hyperthreading as well? Does it have an impact on parallel performances?
In my experience, running electromagnetic modelling and inversion codes, the answer is yes, you should disable hyperthreading. But this is not the sort of question which is well answered by other people's anecdotes (not even mine, fascinating and true as they are).
You are the student, this is definitely a topic worth your time spent in coming to your own conclusions. There are so many factors involved that my experience running my codes on my platforms is nearly worthless to you.
Under Linux, if you have 4 busy threads on an i7 it will place each one on a different core. Provided the other half of the core is idle, the performance should be the same. If you are running another program, it is debatable as to whether having hyperthreading to run the extra programs or context switching is better. (I suspect less context switching is better)
A common mistake is assuming that if you use 8 threads instead of 4 it will be twice as fast. It might be only slightly faster (in which case it might still be worth it) or slightly slower (in which case limit your program to 4 threads) I have found examples of where using double the number of threads was slightly faster. IMHO, Its all a matter of test it to find the optimal number and use that many.
The only time I can see you need to turning HT off is when you have no control over how your application behaves and using 4 threads is faster.
You state:
I know I should never use more than 4 working processes to get descent performances.
This isn't necessarily true! Here is an example of what I have found running on an i7-3820 with HT enabled. All of my code that I was running was C++. Consider that I have 8 separate programs (albeit identical) that I need to run. I have tried the two following ways of running these codes:
Run only 4 separate threads at a time, simultaneously. When these 4 complete, run the next 4 threads (4 x 2 = 8 total).
Run all 8 as separate threads simultaneously (8 x 1 = 8 total).
As you can see these two scenarios achieve the same thing. However, what I have found is that the run times are:
1 hour for each set of 4 threads; for a total of 2 hours to complete all 8.
1.5 hours for the set of 8 threads.
What you find is that a single thread will finish faster for case #1, but that overall #2 gives better performance since ALL of your work is completed in less time. I found typical increases in performance to be ~25% with HT enabled.
As is evident, there are scenarios when running 8 threads is faster than 4.
HyperTreading is the Intel implementation of Simultaneous Multi Threading (SMT). In general, SMT is almost always beneficial (this is why it is usually enabled), unless your application is CPU-bound. If you know for sure that your application is CPU-bound, then disable SMT. Otherwise (your application is IO-bound or is not able to completely saturate the cores), leave it enabled.
This question already has answers here:
Windows OSes and Memory Management-- What happens when an application is minimized?
(2 answers)
Closed 8 years ago.
I've noticed something odd when running resource intensive programs under Window, such as games. If you run the game in windowed mode and look at the memory usage you can see that it goes in the order of hundreds of megabytes for 2D games. But if you minimize that game, I've seen the memory usage go as low as a few megabytes, even less than ten.
What exactly is happening? Who's doing this, the games or the OS? Surely, the resources can't actually be unloaded from memory (that would be awful), so what's with the drop?
Windows trims the working set of a process when its main window is minimized. The working set isn't necessarily the best indicator of how much system resources a process is using.
I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.
There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.
You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!
Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?
If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.
I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.
I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)
The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com
The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.
I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.
4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.
Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.
You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.
The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.
One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.
speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.
Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...