Why is parallel compilation performance with HT worse than without? - performance

I've made several measurements of compilation time of wine with HyperThreading enabled and disabled in BIOS on my Core i7 930 #2.8GHz (quad-core) on Linux 2.6.39 x86_64. Each measurement was like this:
git clean -xdf
./configure --prefix=/usr
time make -j$N
where N is number from 1 to 8.
Here're the results ("speed" is 60/real from time(1)):
Here the blue line corresponds to HT disabled and purple one to HT enabled. It appears that when HT is enabled, using 1-4 threads is slower than without HT. I guess this might be related to the kernel not distributing the processes to different cores and reusing second threads of already busy cores.
So, my question: how can I force the kernel to give 1 process per core scheduling higher priority than adding more processes to the same core's different thread? Or, if my reasoning is wrong, how can I have performance with HT not worse than without HT for 1-4 processes running in parallel?

Hyper-threading on Intel chips is implemented as duplication of some of the elements of a pysical core but without enough electronics to be an independent core (e.g. they may share an instruction decoder but I cant recall the specifics of Intel's implementation).
Image a pysical core with HT as 1.5 physical cores that your OS sees as 2 real cores. This doesn't equate to 1.5x speed though (this can vary depending on use case)
In your example, non-HT is faster up to 4 threads because none of the cores are sharing work with their HT pipeline. You see a flatline above 4 threads because now you only have 4 execution threads and you get a little extra overhead context switching between threads.
In the HT example you are a bit slower up to 4 threads probably because some of those threads are being assigned to a real core and it's HT, so you are losing performance as those two execution threads share physical resources. Above 4 threads you are seeing the benefit of the extra execution threads, but you see the beginning of diminishing returns.
You could probably match performance on both cases for up to 4 threads, but likely not with a compilation job. To many processes being spawned for processor affinity to be setup I think. If you instead ran a real parallel job using OpenMP or MPI with X<=4 threads bound to the specific real CPU cores, I think you'd see similar performance between HT-off and -on.

Given a number of threads <= the number of real cores, using HT should be slower because (considered crudely) you are potentially cutting the speed of your cores in half.1
Keep in mind that generally more cores is NOT better than FASTER cores. In fact, the only reason so much work was put into developing multi-core systems is that it became increasingly difficult to make faster and faster ones. So if you cannot have a 20 Ghz processor, then 8 x 3 Ghz ones will have to do.
HT is, I believe, primarily intended as an advantage in contexts where each thread is not necessarily gobbling as much processor as it can; it's doing some particular task that's governed by interaction with a user, such as CAD stuff, video games, etc; these are the kind of applications that benefit from multi-tasking. By contrast, server platforms -- wherein the primary applications tend to thread independent tasks that are not governed by a dependence on anything else, hence are optimally run as fast as possible -- do not benefit directly from multi-tasking; they benefit from speed. make is in the same category, although with a perhaps greater degree of interdependence between threads, which is why you see an advantage for HT from 4-8 threads.
1. This is a simplification. HT doesn't simply double the number of cores and halve their speed, but whatever dynamic is used, the total number of processor cycles per second for the system is not improved. It's the same -- only more fragmented.

Related

Optimal number of parallel processes for computation with a CPU with 6 cores and 12 threads

On a computer with an Intel CPU marketed "6 cores / 12 threads", I want to run as many processes as possible, each of them doing math similar computations (each process has a single thread) with different input data. There is no GPU involved, and no inter-process communication is needed.
What is the optimal number of parallel processes of the same executable doing math computations?
Should I run 6 processes (one per physical core)? Or 12 processes (one per thread / virtual core)?
If one process does, say, 1000 computations per second, I'm pretty sure that running 6 of them will run at ~1000/sec each (so a total of ~6000/sec).
But won't running 12 processes make them only 500 computations per second each?
TL;DR: should I run one process per "core" or one process per "thread" on a "6 cores/12 threads Intel CPU"?
It is very dependent of the actual computing code. Some application can benefit from hyper-threading while some do not. High-performance application rarely benefit from hyper-threading so using 1 process per core is certainly the best configuration assuming the code is compute bound and scale well.
Multiple hyper-threads of recent Intel processors (eg. Skylake/Icelake) can share some execution ports. As a result, the overall execution can be faster if one process is not able to saturate the ports. In practice, this is a bit more complex (modern processor are very complex) since compute-bound processes can be bound by other part of the processor like instruction decoding or more tricky low-level units.
For example, the following C code should benefit from hyper-threading (assuming no fast-math optimizations are applied and the code is compiler with optimizations):
float sum = 0.f;
for(int i=0 ; i<maxi ; ++i)
sum += array[i];
Indeed, the latency of a floating-point addition instruction is 3 to 4 cycles while generally 2 of them can be executed per cycle (only 1 before Skylake). This means the code is bound by the latency of the addition instruction chain. Hyper-threads can use the waiting execution port during this time resulting in a up to twice faster execution (other bottleneck cause the execution not to be so fast in practice). If the code is optimized with fast-math optimization, then compilers can unroll the loop and make use of instruction-level parallelism (IPC). A low IPC often means that using hyper-thread may be beneficial, especially if the cause of this low IPC is due to latency issues (eg. instruction latency and cache misses). Unfortunately, this is not always true. For example, the following code should not be faster with hyper-threading:
for(int i=0 ; i<maxi ; ++i)
out_array[i] += in_array[i];
This is because there is generally 1 execution store port on Intel processor and it should already be saturated with 1 hyper-thread (otherwise it should be memory throughput bound which is not better for hyper-threading). Thus using more hyper-thread should not improve the execution time. In fact, hyper-threading introduces a slight overhead that should cause a slightly slower execution.
The thing is applications are generally much more complex than that and one does not know how math functions are implemented. A a result, this is nearly impossible for a developer to know what is the best configuration without a basic benchmark unless the computing kernel is simple.

Why the time needed for executing code on "Hackerrank" site varies significantly?

When you try some coding challenges on www.hackerrank.com and you track the time also (e.g. in C++ using the "chrono" library), you will see every time you run the Code you will get a different time needed for execution at exactly the same Code. The variance I estimate at 10 to 30 percent. What is the reason that Code execution times are varying significantly at equal code? Which factors account to it?
It could be the Server System; but are there even physical reasons (stochastic processes in electronic components)?
Almost certainly just a busy system where your process competes for CPU time against other processes. (And competes for memory bandwidth even when it has a CPU to itself).
Also it's likely that the server runs on a CPU with SMT (Simultaneous multithreading) that uses each physical cores as two logical cores, so the "shared execution resources" that processes compete for includes stuff inside a single CPU core, not just L3 cache and memory bandwidth.
Intel calls their SMT tech "hyperthreading"; most servers these days run on Intel Xeon CPUs. AMD Zen also uses SMT, so either way unless the server admins disabled it, when the OS schedules tasks onto both logical cores of a single physical core, they will slow each other down some (with the slowdown amount mostly depending on their average IPC (instructions per cycle) - two threads that have a lot of branch mispredicts will typically not see much slowdown (so throughput nearly doubles). But two threads that could both nearly saturate the FP multiplier ALUs on their own will each run at nearly half speed (near zero gain in throughput).
OSes "know about" SMT / Hyperthreading and can detect which logical cores share a physical core. They try to avoid scheduling threads to the same physical core while there are some physical cores with both threads idle.
See also Modern Microprocessors
A 90-Minute Guide! which covers SMT, with the background to understand what execution resources are being shared.
but are there even physical reasons (stochastic processes in electronic components)?
No, CPUs are deterministic (except for the rdrand instruction which uses true analog electrical noise as a randomness source).
CPUs do use dynamic voltage and frequency scaling to save power (idle vs. max turbo, or somewhere in between if thermal limits don't allow max turbo.) https://en.wikipedia.org/wiki/Intel_Turbo_Boost / https://en.wikipedia.org/wiki/Dynamic_frequency_scaling
See also Idiomatic way of performance evaluation? for more microbenchmarking pitfalls / warm-up effects.

Hyper-threading and gaming (and other computing applications)?

I was wondering what the real-world performance effects are of hyperthreading (multiple logical cores for each physical core) in different situations. Intel advertises this as being effective for when threads of execution are waiting for I/O, however in memory intensive applications, it can be ineffective because when a switch occurs between logical cores, locality is lost in the processor cache. The second application's data is loaded into cache, forcing the first application's memory out of cache. Upon returning to the first application, its references are all cache misses and performance is lost. I know several super computer managers and they claim that they turn off hyperthreading because doing so is more efficient in their cases. Are there "normal" user cases where disabling hyperthreading is more efficient? Gaming can be pretty memory intensive--would it be better without hyperthreading?
First, it should be recognized that hyperthreading is an Intel marketing term labelling Switch-on-Event MultiThreading (on Itanium) and Simultaneous MultiThreading (on x86). SoEMT is primarily beneficial in hiding high latency events such as last level cache misses, is easier to implement, and is friendlier to VLIW-like scheduling. SoEMT is also a better fit for a small L1 (given a somewhat fast L2) than SMT since cache contention is moved more to L2 or L3 (thousands of accesses between thread switches) which can better handle contention given their greater capacity and higher associativity. SMT can be useful in hiding smaller latencies like branch resolution delay or L2 cache hits and provides instruction level parallelism, but introduces more intense contention for resources.
(There is also a difference between disabling hyperthreading and not using hyperthreading. Disabling hyperthreading might provide a small performance benefit in that some shareable resources will be used even by an inactive but enabled thread and some partitioned resources may still use a small amount of power, but the primary benefit would be in preventing the OS from making disruptive scheduling decisions.)
For "normal" code, the available thread-level parallelism may well be lower than the number of cores available. In that case, a modern OS typically will not use the hardware multithreading since it recognizes that a full core has more performance than a core shared by more than one thread. (Sharing a core can theoretically improve performance in special cases where using L1 to communicate between threads is unusually helpful. In addition, waking an inactive thread on an active core is much faster and requires less energy than waking up a core, so using multithreading might be helpful for energy efficiency in some special cases.)
HPC codes tend to be the worst case for SMT. HPC code is more likely to be friendly to static scheduling. This means that the latency hiding benefits of SMT tend to be minimized. (Similarly, HPC code tends to benefit less from out-of-order execution.) HPC code also tends to be constrained by memory bandwidth rather than memory latency. SMT can increase the bandwidth demand per unit of execution (by increasing cache misses) and reduce the actual achieved memory bandwidth by contention at the memory controller. (DRAM is not friendly to random access; such causes excessive refresh and row active cycles.) SMT may also cause the number of data streams that are active to exceed the hardware's support for prefetching. HPC code is also more likely to be blocked according to cache sizes assuming one thread per core; in such cases SMT will produce significant cache thrashing.
Disabling hyperthreading may also be friendlier to gang-scheduled operation, which is common in HPC. If only some of the cores are using multithreading, those cores might have higher performance per core yet would have lower performance per thread; that forces other cores to idly wait for the slowed threads to complete. (HPC systems may have dedicated OS cores and spare cores to avoid similar problems, where OS activity would slow down one core/thread and force hundreds of others to wait or where a failed core could cause, e.g., a 16-thread gang scheduled program to run 15 threads and then one thread, doubling execution time.)
(In theory, SMT could be used in HPC to reduce register pressure in some optimized loops since the effective latency of operations like FMADD in a dual threaded core may be viewed as roughly being halved. Since compilers generally use a fixed latency for scheduling [SMT is treated as a transparent feature], exploiting this feature is not generally practical even when it could be beneficial.)
Rather like out-of-order execution, SMT is most beneficial for irregular code. (OoO looks ahead in a single code stream for instruction level and memory level parallelism; SMT looks "sideways" across threads for such parallelism.) If branch mispredictions and cache misses are common, SMT can use existing thread-level parallelism to hide such latencies (the cost of a branch misprediction is largely in the latency of resolution).
The benefit from SMT varies by workload and by the specific hardware. A deeply pipelined in-order microarchitecture like the initial Intel Atom benefits more from SMT than a shallower pipelined OoO microarchitecture would (latencies, especially branch resolution latency, being generally higher with longer pipelines and OoO providing some parallelism that would otherwise be used by SMT's thread-level parallelism).
Enabled hyperthreading may also have the disadvantage of increasing the number of threads used by an application where performance scaling with increased thread count is sufficiently sublinear that the lower performance per thread with hyperthreading would result in a net loss of performance. E.g., if two-thread-per-core hyperthreading provided a 30% increase in per core performance and doubling thread count increased performance by 50%, then total performance would decrease by 2.5%.
The standard advice of "when in doubt, measure" obviously applies.
Obviously some people don't understand some things. I have done so, here is what I copied froma site:
Depending on when you last bought a computer, you may remember Hyper-Threading as a feature that Intel introduced and then discontinued. This could understandably leave a sour taste in your mouth – why would Intel discontinue it if it wasn’t trouble?
The truth isn’t so grim. Hyper-Threading was for a time made available on certain Intel Pentium 4 and Intel Xeon processors. It was discontinued not because the feature itself was bad, but rather because the processor that used it turned out to be a bit of a misstep for other reasons. The Pentium 4 architecture was a minor disaster for Intel because it was incapable of going the direction Intel hoped (Intel wanted to have Pentium 4 processors with clock speeds of up to 10 GHz). As a result, Intel jumped back to designing processors based on the Pentium Pro family tree.
Hyper-Threading was gone, but not forgotten. Intel eventually found the time and resources to integrate it into another new processor architecture - Nehalem. This is the architecture that is the basis for all current Intel Core i3, i5 and i7 processors.
Source: http://www.makeuseof.com/tag/hyperthreading-technology-explained/

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.
A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.
An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Source:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.
CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.
In the early days...like before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single task...so when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told you...one program at a time...so the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 cores...ie 8 mini processors in 1 i7...ie its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.
I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.
Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

Optimal number of threads per core

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?
If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.
I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.
I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)
The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com
The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.
I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.
4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.
Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.
You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.
The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.
One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.
speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.
Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...

Resources