Puppeteer-Cluster consistently using only half of my cores - parallel-processing

I'm running a pretty standard puppeteer cluster job, with the following settings:
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE, // maximize sharing data between jobs
maxConcurrency: 8,
monitor: true
}
However, my cpu monitor never shows more than 4 cores spiking.
Forcing maxConcurrency higher shows more workers running, but the cpu utilization is unchanged.
I've read dozens of StackOverflow threads but I can't find anything to addresses this phenomenon, and none of the other tricks to maximize performance (changing the concurrency model, passing custom chrome args, etc) have any effect on the overall cpu utilization.
FWIW, I'm on Mac OS Ventura running node v16.19.0, with an 8-core i7 from 2019. (One theory I have is that there's some OS/node thread affinity thing going on here...)

Aaaaand it turns out this is indeed a platform issue, but it's even lower-level than OS/node!
My octo-core i7 isn't octo-core at all: it uses SMT, to run two threads in parallel on a single core. So I'm actually dealing with four cores, and node's worker threads are maxing them out.
This is described in detail here: https://github.com/denoland/deno/issues/10592
I'm leaving this up, in case someone else runs into the same issue.

Related

Are 2 cpu machines better for single tasks or multiple tasks

I realize that there may not be a hard and fast rule but it seems 2 CPU machines will provide greater performance improvement when running multiple tasks as opposed to just running one task. Is this true in a Windows environment? Would a different OS make a difference?
Back in the old days CPU's were what we today would call "single core" and if you had a program use 100% CPU there was nothing left for anything else, including the taskbar you tried to get up with Ctrl+Alt+Del.
Two CPU systems (I had a dual Pentium III system at one time) fixed this as the other CPU was usually not 100% busy so it could handle the taskmanager even with the rogue program running at full speed.
Today this has moved inside the single CPU as multiple cores. So having more cores than rogue programs running at the same time is a good thing. For most users this is a dual core system but prices are falling and an eight core AMD CPU can be bought for under $100. I believe it is close to impossible to find a singe core CPU these days.

How to get concurrent function (pmap) to use all cores in Elixir?

I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.

Memory-intense jobs scaling poorly on multi-core cloud instances (ec2, gce, rackspace)?

Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?
When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.
But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.
So has this behavior been seen by others with jobs about the same size in memory usage?
The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.
So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.
My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.
You might be running into memory bandwidth constraints, depending on your data access pattern.
The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:
Running one copy of the single-threaded program on your laptop takes X minutes to complete.
Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.
Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.
Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.
If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.

Optimal number of threads per core

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?
If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.
I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.
I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)
The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com
The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.
I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.
4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.
Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.
You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.
The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.
One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.
speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.
Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...

Is it advantageous to use threads in windows?

Some of the fellows in the office think that when they've added threads to their code that windows will assign these threads to run on different processors of a multi-core or multi-processor machine. Then when this doesn't happen everything gets blamed on the existence of these threads colliding with one another on said multi-core or multi-processor machine.
Could someone debunk or confirm this notion?
When an application spawns multiple threads, it is indeed possible for them to get assigned to different processors. In fact, it is not uncommon for incorrect multi-threaded code to run ok on a single-processor machine but then display problems on a multi-processor machine. (This happens if the code is safe in the face of time slicing but broken in the face of true concurrency.)
You generally can only optimally have 1 thread per CPU, but unless your application has some explicit thread affinity to one processor then yes Windows will assign these threads to a free processor.
Windows will automatically execute multiple threads on different processors if the machine has multiple processors. If you are running on a single processor machine, the threads are time-sliced but when you move the process to a multiple processor machine, the process will automatically take advantage of the multiple processors.
Because the code is running simultaneously, the threads may be more likely to step on each others toes on a multi-core machine then on a single core machine since both threads could be writing to a shared location at the same time instead of this happening if the thread swap is timed just right.
Yes, Threads and multi-Threading has almost nothing to do with the number of cpus or cores in a machine...
EDIT ADDITION: To talk about "how many threads run on a cpu" is an oxymoron. Only one thread can ever run on a single CPU. Multi-Threading is about multiple threads in a PROCESS, not on a CPU. Before another thread can be run on any CPU, the thread currently on that CPU has to STOP running, and it's state must be preserved somewhere so that the OS can restart it when it get's it's next "turn".
Code runs in "Processes" which are logical abstractions that can run one or more sequences of code instructions, and manage computer resources independantly from other processes. Within a process, each separate sequence of code instructions is a "thread". Which cpu they run on is irrelevant. A single thread can run on a differnt cpu each time is is allocated a cpu to run on... and multiple threads, as they are each allocated cpu cycles, may, by coincidence, run on the same cpu (although obviously not simultaneously)
The OS (a component of the OS) is responsible for "running" threads. It keeps an in-memory list of all threads, and constantly "switches" (it's called a context switch) among them. It does this in a single CPU machine in almost exactly the same way as it does ion in a multiple-cpu machine. Even in a multiple Cpu machine, each time it "turns on" a thread, it might give it to a different CPU, or to the same cpu, as it did the last time.
There is no guarantee that threads of your process will be assigned to run on different CPUs.

Resources