May I have Project Loom Clarified? - parallel-processing

Brian Goetz got me excited about project Loom and, in order to fully appreciate it, I'll need some clarification on the status quo.
My understanding is as follows: Currently, in order to have real parallelism, we need to have a thread per cpu/core; 1) is there then any point in having n+1 threads on an n-core machine? Project Loom will bring us virtually limitless threads/fibres, by relying on the jvm to carry out a task on a virtual thread, inside the JVM. 2) Will that be truly parallel? 3)How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
Thanks for your time.

Virtual threads allow for concurrency (IO bound), not parallelism (CPU bound). They represent causal simultaneity, but not resource usage simultaneity.
In fact, if two virtual threads are in an IO bound* state (awaiting a return from a REST call for example), then no thread is being used at all. Whereas, the use of normal threads (if not using a reactive or completable semantic) would both be blocked and unavailable for use until the calls are complete.
*Except for certain conditions (e.g., use of synchonize vs ReentrackLock, blocking that occurs in a native method, and possibly some other minor areas).

is there then any point in having n+1 threads on an n-core machine?
For one, most modern n-core machines have n*2 hardware threads because each core has 2 hardware threads.
Sometimes it does make sense to spawn more OS threads than hardware threads. That’s the case when some OS threads are asleep waiting for something. For instance, on Linux, until io_uring arrived couple years ago, there was no good way to implement asynchronous I/O for files on local disks. Traditionally, disk-heavy applications spawned more threads than CPU cores, and used blocking I/O.
Will that be truly parallel?
Depends on the implementation. Not just the language runtime, but also the I/O related parts of the standard library. For instance, on Windows, when doing disk or network I/O in C# with async/await (an equivalent of project loom, released around 2012) these tasks are truly parallel, the OS kernel and drivers are indeed doing more work at the same time. AFAIK on Linux async/await is only truly parallel for sockets but not files, for asynchronous file I/O it uses a pool of OS threads under the hood.
How, specifically, will that differ from the aforementioned scenario "n+1 threads on an n-core machine "?
OS threads are more expensive for a few reasons. (1) They require native stack so each OS thread consumes memory (2) Memory is slow, processors have caches to compensate, switching between OS threads increases RAM bandwidth because thread-specific data invalidates after a context switch (3) OS schedulers were improving over decades but still they’re not free. One reason is saving/restoring thread state to/from memory takes time.
The higher-level cooperative multitasking implemented in C# async/await or Java’s Loom causes way less overhead when switching contexts, compared to switching OS threads. At least in theory, this should improve both throughput and latency for I/O heavy applications.

Related

The effects of heavy thread consumption on ARM (4-core A72) vs x86 (2-core i5)

I have a realtime linux desktop application (written in C) that we are porting to ARM (4-core Cortex v8-A72 CPUs). Architecturally, it has a combination of high-priority explicit pthreads (6 of them), and a couple GCD(libdispatch) worker queues (one concurrent and another serial).
My concerns come in two areas:
I have heard that ARM does not hyperthread the way that x86 can and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this?
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A couple of the pthreads are high-priority handlers for fairly rare-ish events, does this change the prospects much?(i.e. they are sitting on a select statement)
My bigger concern comes from the impact of GCD in this application. My understanding of the inner workings of GCD is a that it is a dynamically scaled threadpool that interacts with the scheduler, and will try to add more threads to suit the load. It sounds to me like this will have an almost exclusively negative impact on performance in my scenario. (I.E. in a system whose cores are fully consumed) Correct?
I'm not an expert on anything x86-architecture related (so hopefully someone more experienced can chime in) but here are a few high level responses to your questions.
I have heard that ARM does not hyperthread the way that x86 can [...]
Correct, hyperthreading is a proprietary Intel chip design feature. There is no analogous ARM silicon technology that I am aware of.
[...] and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this? [...]
This is not necessarily the case, although it could very well happen in many scenarios. It really depends more on what the nature of your per-thread computations are...are you just doing lots of hefty computations, or are you doing a lot of blocking/waiting on IO? Either way, this degradation will happen on both architectures and it is more of a general thread scheduling problem. In hyperthreaded Intel world, each "physical core" is seen by the OS as two "logical cores" which share the same resources but have their own pipeline and register sets. The wikipedia article states:
Each logical processor can be individually halted, interrupted or directed to execute a specified thread, independently from the other logical processor sharing the same physical core.[7]
Unlike a traditional dual-processor configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The degree of benefit seen when using a hyper-threaded or multi core processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.[7]
So if a few of your threads are constantly blocking on I/O then this might be where you would see more improvement in a 6-thread application on a 4 physical core system (for both ARM and intel x86) since theoretically this is where hyperthreading would shine....a thread blocking on IO or on the result of another thread can "sleep" while still allowing the other thread running on the same core to do work without the full overhead of an thread switch (experts please chime in and tell me if I'm wrong here).
But 4-core ARM vs 2-core x86... assuming all else equal (which obviously is not the case, in reality clock speeds, cache hierarchy etc. all have a huge impact) then I think that really depends on the nature of the threads. I would imagine this drop in performance could occur if you are just doing a ton of purely cpu-bound computations (i.e. the threads never need to wait on anything external to the CPU). But If you are doing a lot of blocking I/O in each thread, you might show significant speedups doing up to probably 3 or 4 threads per logical core.
Another thing to keep in mind is the cache. When doing lots of cpu-bound computations, a thread switch has the possibility to blow up the cache, resulting in much slower memory access initially. This will happen across both architectures. This isn't the case with I/O memory, though. But if you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower for the reasons above.
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A hardware context switch is a hardware context switch, you push all the registers to the stack and flip some bits to change execution state. So no, I don't believe either is "faster" in that regard. However, for a single physical core, techniques like hyperthreading makes a "context switch" in the Operating Systems sense (I think you mean switching between threads) much faster, since the instructions of both programs were already being executed in parallel on the same core.
I don't know anything about GCD so can't comment on that.
At the end of the day, I would say your best shot is to benchmark the application on both architectures. See where your bottlenecks are. Is it in memory access? Keeping the cache hot therefore is a priority. I imagine that 1-thread per core would always be optimal for any scenario, if you can swing it.
Some good things to read on this matter:
https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
https://lwn.net/Articles/250967/
Optimal number of threads per core
Thread context switch Vs. process context switch

Running GO runtime.GOMAXPROCS(4) on a dual core cpu

Sorry, if this sounds stupid.
What will happen if run runtime.GOMAXPROCS(4) while runtime.NumCpu() == 2
runtime.GOMAXPROCS controls how many operating system-level threads will be created to run goroutines of your program (and the runtime powering it). (The runtime itself will create several more threads for itself but this is beside the point.)
Basically, that is all that will happen.
But supposedly, you intended to actually ask something like "how would that affect the performance of my program?", right?
If yes, the answer is "it depends".
I'm not sure whether you had a chance to work with systems having only a single CPU with a single core (basically most consumer-grade IBM PC-compatible computers up to the generation of the Pentium® CPUs which had the so-called "hyper-threading" technology), but those systems were routinely running hundreds to thousands of OS threads on a "single core" (the term did not really existed in mainstream then but OK).
Another thing to consider is that your program does not run in isolation: there are other programs running on the same CPU, and the kernel itself has several in-kernel threads as well.
You may use a tool like top or htop to assess the number of threads your system is currently scheduling across all your cores.
By this time, you might be wondering why the Go runtime defaults to creating as many threads to power the goroutines as there are physical cores.
Presumably, this comes from a simple fact that in a typical server-side workload, your program will be sort of "the main one".
In other words, the contention of its threads with the threads
of other processes and the kernel will be reasonably low.

Multicore thread processing order

I am having some real trouble finding this info online, im in Uni monday so i could use the library then but the soon the better. When a system has multicore processors, does each processor take a thread from the first process in the ready queue or does it take one from the first and one from the second? Also just to check, threads will be sent and fetched from the multicores concurrently by the OS right? If anyone could point me in the right direction resource wise, that would be great!
The key thing is to appreciate what the machine's architecture actually is.
A "core" is a CPU with cache with a connection to the system memory. Most machine architectures are Symmetric Multi-Processing, meaning that the system memory is equally accessible by all cores in the system.
Most operating systems run a scheduler thread on each core (Linux does). The scheduler has a list of threads it is responsible for, and it will run them to the best of its ability on the core that it controls. The rules it uses to choose which thread to run will be either round robin, or priority based, etc; ie all the normal scheduling rules. So far it is just like a scheduler that you would find in a single core computer. To some extent each scheduler is independent from all the other schedulers.
However, this an SMP environment, meaning that it really doesn't matter which core runs which thread. This is because all the cores can see all the memory, and all the code and data for all threads in the entire system is stored in that single memory.
So the schedulers talk amongst themselves to help each other out. Schedulers with too many threads to run can pass a thread over to a scheduler whose core is under utilised. They are load balancing within the machine. "Pass a thread over" means copying the data structure that describes the thread (thread id, which data, which code).
So that's about it. As the only communication between cores is via memory it all relies on an effective mutual exclusion semaphore system being available, which is something the hardware has to allow for.
The Difficulty
So I've painted a very simple picture, but in practice the memory is not perfectly symmetrical. SMP these days is synthesised on top of HyperTransport and QPI.
Long gone are the days when cores really did have equal access to the system memory at the electronic level. At the very lowest layer of their architecture AMD are purely NUMA, and Intel nearly so.
Nowadays a core has to send a request to other cores over a high speed serial link (HyperTransport or QPI) asking them to send data that they've got in their attached memory. Intel and AMD have done a good job of making it look convincingly like SMP in the general case, but it's not perfect. Data in memory attached to a different core takes longer to get hold of. It's insanely complex - the cores are now nodes on a network - but it's what they've had to do to get improved performance.
So schedulers take that into account when choosing which core should run which thread. They will try to place a thread on a core that is closest to the memory holding the data that the thread has access to.
The Future, Again
If the world's software ecosystem could be weaned off SMP the hardware guys would be able to save a lot of space on the silicon, and we would have faster more efficient systems. This has been done before; Transputers were a good attempt at a strictly NUMA architecture.
NUMA and Communicating Sequential Processes would today make it far easier to write multi threaded software that scales very easily and runs more efficiently than today's SMP shared memory behemoths.
SMP was in effect a cheap and nasty way of bringing together multiple cores, and the cost in terms of software development difficulties and inefficient hardware has been very high.

Windows, multiple process vs multiple threads

We have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. What would be the best approach? Single process with 100 threads or 2 processes with 50-50 threads? What is the gain apart from process memory in case of second approach. does in windows first CPU time is allocated to process and then split between threads for that process, or OS counts the number of threads for each process and allocate CPU on the basis of threads rather than process. We notice that in first case, CPU utilization is 15-25% and we want to consume more CPU. Remember that we would like to get optimal performance thus 100 requests are just for example. We have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.
One more point; our product already supports clustering, but we want to utilize more CPU on the single node.
Any suggestions will be highly appreciated.
You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.
Windows has a thread pooling mechanism called IO Completion Ports.
Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.
The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.
On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.
Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.
If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.
To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains.
If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.
The only way to know is to write test cases that stress the server in various ways and measure the results.
The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.
If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.
just my .02

Multiple threads and performance on a single CPU

Is here any performance benefit to using multiple threads on a computer with a single CPU that does not having hyperthreading?
In terms of speed of computation, No. In fact things will slow down due to the overhead of managing the threads.
In terms of responsiveness, yes. You can for example have one thread wait on an IO operation and have another run a GUI at the same time.
It depends on your application. If it spends all its time using the CPU, then multithreading will just slow things down - though you may be able to use it to be more responsive to the user and thus give the impression of better performance.
However, if your code is limited by other things, for example using the file system, the network, or any other resource, then multithreading can help, since it allows your application to behave asynchronously. So while one thread is waiting for a file to load from disk, another can be querying a remote webserver and another redrawing the GUI, while another is doing various calculations.
Working with multiple threads can also simplify your business logic, since you don't have to pay so much attention to how various independent tasks need to interleave. If the operating system's scheduling logic is better than yours, then you may indeed see improved performance.
You can consider using multithreading on a single CPU
If you use network resources
If you do high-intensive IO operations
If you pull data from a database
If you exploit other stuff with possible delays
If you want to make your app with ultraspeed reaction
When you should not use multithreading on a single CPU
High-intensive operations which do almost 100% CPU usage
If you are not sure how to use threads and synchronization
If your application cannot be divided into several parallel processes
Yes, there is a benefit of using multiple threads (or processes) on a single CPU - if one thread is busy waiting for something, others can continue doing useful work.
However this can be offset by the overhead of task switching. You'll have to benchmark and/or profile your application on production grade hardware to find out.
Regardless of the number of CPUs available, if you require preemptive multitasking and/or applications with asynchronous components (i.e. pretty much anything that combines a responsive GUI with a non-trivial amount of computation or continuous I/O processing), multithreading performs much better than the alternative, which is to use multiple processes for each application.
This is because threads in the same process can exchange data much more efficiently than can multiple processes, because they share the same memory context.
See this Wikipedia article on computer multitasking for a fairly concise discussion of these issues.
Absolutely! If you do any kind of I/O, there is great advantage to having a multithreaded system. While one thread wait for an I/O operation (which are relatively slow), another thread can do useful work.

Resources