shared memory vs distributed memory and multithread vs multiprocess - shared-memory

I am learning parallel programming by myself. I wonder if distributed memory is always multiprocess and multithread is always shared memory? if multiprocess can be both for distributed memory and for shared memory?
Thanks and regards!

Yes, yes, and "yes, in a sense"
In a distributed memory system, different CPU units have their own memory systems. Access from another CPU will most likely be slower or with a more limited coherency model, if indeed it is possible at all. This will be more typical of a message-passing multiprocessor.
Using multiple threads for parallel programming is more of a software paradigm than a hardware issue, but you are correct, use of the term thread essentially specifies that a single shared memory is in use, and it may or may not include actual multiple processors. It may not even include multiple kernel threads, in which case the threads will not execute in parallel.
I'm not completely clear on the meaning of last question. Certainly, by saying "distributed memory" or "shared memory" it implies "distributed over processors" and "shared by processors", so I suppose the terms are only reasonably applied to multiprocessor or potentially multiprocessor systems. If we are talking about multiple processes in the software sense, I guess it's pretty much a requirement for distributed memory systems, and essentially a requirement (they might be called threads) for a shared memory system.
I should add that distributed-memory-but-cache-coherent systems do exist and are a type of shared memory multiprocessor design called NUMA. Only a few years ago these machines were the lunatic fringe of parallel computing, but now the Intel Core i7 processors have brought NUMA into the mainstream.

Related

The effects of heavy thread consumption on ARM (4-core A72) vs x86 (2-core i5)

I have a realtime linux desktop application (written in C) that we are porting to ARM (4-core Cortex v8-A72 CPUs). Architecturally, it has a combination of high-priority explicit pthreads (6 of them), and a couple GCD(libdispatch) worker queues (one concurrent and another serial).
My concerns come in two areas:
I have heard that ARM does not hyperthread the way that x86 can and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this?
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A couple of the pthreads are high-priority handlers for fairly rare-ish events, does this change the prospects much?(i.e. they are sitting on a select statement)
My bigger concern comes from the impact of GCD in this application. My understanding of the inner workings of GCD is a that it is a dynamically scaled threadpool that interacts with the scheduler, and will try to add more threads to suit the load. It sounds to me like this will have an almost exclusively negative impact on performance in my scenario. (I.E. in a system whose cores are fully consumed) Correct?
I'm not an expert on anything x86-architecture related (so hopefully someone more experienced can chime in) but here are a few high level responses to your questions.
I have heard that ARM does not hyperthread the way that x86 can [...]
Correct, hyperthreading is a proprietary Intel chip design feature. There is no analogous ARM silicon technology that I am aware of.
[...] and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this? [...]
This is not necessarily the case, although it could very well happen in many scenarios. It really depends more on what the nature of your per-thread computations are...are you just doing lots of hefty computations, or are you doing a lot of blocking/waiting on IO? Either way, this degradation will happen on both architectures and it is more of a general thread scheduling problem. In hyperthreaded Intel world, each "physical core" is seen by the OS as two "logical cores" which share the same resources but have their own pipeline and register sets. The wikipedia article states:
Each logical processor can be individually halted, interrupted or directed to execute a specified thread, independently from the other logical processor sharing the same physical core.[7]
Unlike a traditional dual-processor configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The degree of benefit seen when using a hyper-threaded or multi core processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.[7]
So if a few of your threads are constantly blocking on I/O then this might be where you would see more improvement in a 6-thread application on a 4 physical core system (for both ARM and intel x86) since theoretically this is where hyperthreading would shine....a thread blocking on IO or on the result of another thread can "sleep" while still allowing the other thread running on the same core to do work without the full overhead of an thread switch (experts please chime in and tell me if I'm wrong here).
But 4-core ARM vs 2-core x86... assuming all else equal (which obviously is not the case, in reality clock speeds, cache hierarchy etc. all have a huge impact) then I think that really depends on the nature of the threads. I would imagine this drop in performance could occur if you are just doing a ton of purely cpu-bound computations (i.e. the threads never need to wait on anything external to the CPU). But If you are doing a lot of blocking I/O in each thread, you might show significant speedups doing up to probably 3 or 4 threads per logical core.
Another thing to keep in mind is the cache. When doing lots of cpu-bound computations, a thread switch has the possibility to blow up the cache, resulting in much slower memory access initially. This will happen across both architectures. This isn't the case with I/O memory, though. But if you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower for the reasons above.
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A hardware context switch is a hardware context switch, you push all the registers to the stack and flip some bits to change execution state. So no, I don't believe either is "faster" in that regard. However, for a single physical core, techniques like hyperthreading makes a "context switch" in the Operating Systems sense (I think you mean switching between threads) much faster, since the instructions of both programs were already being executed in parallel on the same core.
I don't know anything about GCD so can't comment on that.
At the end of the day, I would say your best shot is to benchmark the application on both architectures. See where your bottlenecks are. Is it in memory access? Keeping the cache hot therefore is a priority. I imagine that 1-thread per core would always be optimal for any scenario, if you can swing it.
Some good things to read on this matter:
https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
https://lwn.net/Articles/250967/
Optimal number of threads per core
Thread context switch Vs. process context switch

Why does Windows switch processes between processors?

If a single threaded process is busy and uses 100% of a single core it seems like Windows is switching this process between the cores, because in Task Managers core overview all cores are equal used.
Why does Windows do that? Isn't this destroying L1/L2 caches?
There are advantages to pinning a process to one core, primarily caching which you already mentioned.
There are also disadvantages -- you get unequal heating, which can create mechanical stresses that do not improve the expected lifetime of the silicon die.
To avoid this, OSes tend to keep all cores at equal utilization. When there's only one active thread, it will have to be moved and invalidate caches. As long as this is done infrequently (in CPU time), the impact of the extra cache misses during migration is negligible.
For example, the abstract of "Energy and thermal tradeoffs in hardware-based load balancing for clustered multi-core architectures implementing power gating" explicitly lists this as a design goal of scheduling algorithms (emphasis mine):
In this work, a load-balancing technique for these clustered multi-core architectures is presented that provides both a low overhead in energy and an a smooth temperature distribution across the die, increasing the reliability of the processor by evenly stressing the cores.
Spreading the heat dissipation throughout the die is also essential for techniques such as Turbo Boost, where cores are clocked temporarily at a rate that is unsustainable long term. By moving load to a different core regularly, the average heat dissipation remains sustainable even though the instantaneous power is not.
Your process may be the only one doing a lot of work, but it is not the only thing running. There are lots of other processes that need to run occasionally. When your process gets evicted and eventually re-scheduled, the core on which it was running previously might not be available. It's better to run the waiting process immediately on a free core than to wait for the previous core to be available (and in any case its data will likely have been bumped from the caches by the other thread).
In addition, modern CPUs allow all the cores in a package to share high-level caches. See the "Smart Cache" feature in this Intel Core i5 spec sheet. You still lose the lower-level cache(s) on core switch, but those are small and will probably churn somewhat anyway if you're running more than just a small tight loop.

Should software cache improve performance on a NUMA machine

Since NUMA machine do not have local cache, would a software cache implementation improve performance in task that require access to remote memory ?
Some NUMA machines do have local cache. If you have a multi-socket Opteron or Xeon system, each socket is a NUMA domain with multiple levels of cache, some shared between cores and some not. At least for Intel chips since Nehalem, all of those caches can store remote memory references. This is good for performance in 2-8 sockets but also continues to be a benefit on larger systems built on longer range cache-coherent interconnects like NumaConnect or SGI NUMALink.
With that said, if you're stuck on a non-coherent system, you'll need to narrow down a bunch of other parameters before a yes/no answer is possible. How expensive is each state transition in your software coherency protocol? How often are those transitions happening for a trace of an app you're concerned about? If transitions are cheap enough or lines stay resident long enough, then sure, it could help... but that depends on the implementation, the underlying architecture and the behavior of the app itself.
Here's a group experimenting with some related performance issues: http://www.lfbs.rwth-aachen.de/content/17.html. You might also find some interesting work done relating to the Cell BE architecture used in the Playstation 3, for example: http://researcher.ibm.com/files/us-alexe/paper-gonzalez-pact08.pdf.

Difference between memory allocation and paging in modern operating systems

I've been doing research on operating systems lately, particularly regarding memory management. However, I'm not sure what the difference is between memory management schemes like those found at http://en.wikipedia.org/wiki/Memory_management such as memory pools or the buddy system, and components of virtual memory, such as paging. Do they both accomplish the same thing or different things? How are they typically implemented in modern operating systems?
They are complementary. Memory management generally refers to how virtual address space is allocated to hold objects in a program. The goal is to reduce fragmentation.
Virtual memory is a system that allows processes to beleive they have more memory then actually exists, allows processes to share parts of their memory without worrying about protecting the rest and so on. The OS's job here is to decide which pages should be backed by physical memory, and how to swap out ones that aren't in use.

what's the difference between parallel and multicore programming?

I think the topic says it all. What's the difference, if any, between parallel and multicore programming? Thanks.
Mutli-core is a kind of parallel programming. In particular, it is a kind of MIMD setup where the processing units aren't distributed, but rather share a common memory area, and can even share data like a MISD setup if need be. I believe it is even disctinct from multi-processing, in that a multi-core setup can share some level of caches, and thus cooperate more efficiently than CPUs on different cores.
General parallel programing would also include SIMD systems (like your GPU), and distributed systems.
The difference isn't in approach, just in the hardware the software runs on. Parallel programming is taking a problem and spliting the workload into smaller pieces that can be processed in parallel(Divide and Conquer type problems, etc.) or functions that can run independently of each other. Place that software on a multi-core piece of hardware and it will be optimized by the OS to run on the different cores. This gives it a better performance because each thread you create to do concurrent work can now run without consuming CPU cycles on a single processor/core.
Multicore systems are a subset of parallel systems. Different systems will have different memory architectures, each with their own set of challenges. How does one system deal with cache coherency? Is NUMA involved, etc. etc.

Resources