How can I let threads in a kernel module communicate? I'm writing a kernel module and my architecture is going to use three threads that need to communicate. So far, my research has led me to believe the only way is using shared memory (declaring global variables) and locking mechanisms to synchronize reading/writing between the threads. There's rather scarce material on this out there.
Is there any other way I might take into consideration? What's the most common, the standard in the kernel code?
You don't say what operating system you're programming on. I'll assume Linux, which is the most common unix system.
There are several good books on Linux kernel programming. Linux Device Drivers is available online as well as on paper. Chapter 5 deals with concurrency; you can jump in directly to chapter 5 though it would be best to skim through at least chapters 1 and 3 first. Subsequent chapters have relevant sections as well (in particular wait queues are discussed in chapter 6).
The Linux kernel concurrency model is built on shared variables. There is a large range of synchronization methods: atomic integer variables, mutual exclusion locks (spinlocks for nonblocking critical sections, semaphores for blocking critical sections), reader-writer locks, condition variables, wait queues, …
Related
I have a realtime linux desktop application (written in C) that we are porting to ARM (4-core Cortex v8-A72 CPUs). Architecturally, it has a combination of high-priority explicit pthreads (6 of them), and a couple GCD(libdispatch) worker queues (one concurrent and another serial).
My concerns come in two areas:
I have heard that ARM does not hyperthread the way that x86 can and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this?
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A couple of the pthreads are high-priority handlers for fairly rare-ish events, does this change the prospects much?(i.e. they are sitting on a select statement)
My bigger concern comes from the impact of GCD in this application. My understanding of the inner workings of GCD is a that it is a dynamically scaled threadpool that interacts with the scheduler, and will try to add more threads to suit the load. It sounds to me like this will have an almost exclusively negative impact on performance in my scenario. (I.E. in a system whose cores are fully consumed) Correct?
I'm not an expert on anything x86-architecture related (so hopefully someone more experienced can chime in) but here are a few high level responses to your questions.
I have heard that ARM does not hyperthread the way that x86 can [...]
Correct, hyperthreading is a proprietary Intel chip design feature. There is no analogous ARM silicon technology that I am aware of.
[...] and therefore my 4-cores will already be context switching to keep up with my 6 pthreads (and background processes). What kind of performance penalty should I expect from this? [...]
This is not necessarily the case, although it could very well happen in many scenarios. It really depends more on what the nature of your per-thread computations are...are you just doing lots of hefty computations, or are you doing a lot of blocking/waiting on IO? Either way, this degradation will happen on both architectures and it is more of a general thread scheduling problem. In hyperthreaded Intel world, each "physical core" is seen by the OS as two "logical cores" which share the same resources but have their own pipeline and register sets. The wikipedia article states:
Each logical processor can be individually halted, interrupted or directed to execute a specified thread, independently from the other logical processor sharing the same physical core.[7]
Unlike a traditional dual-processor configuration that uses two separate physical processors, the logical processors in a hyper-threaded core share the execution resources. These resources include the execution engine, caches, and system bus interface; the sharing of resources allows two logical processors to work with each other more efficiently, and allows a logical processor to borrow resources from a stalled logical core (assuming both logical cores are associated with the same physical core). A processor stalls when it is waiting for data it has sent for so it can finish processing the present thread. The degree of benefit seen when using a hyper-threaded or multi core processor depends on the needs of the software, and how well it and the operating system are written to manage the processor efficiently.[7]
So if a few of your threads are constantly blocking on I/O then this might be where you would see more improvement in a 6-thread application on a 4 physical core system (for both ARM and intel x86) since theoretically this is where hyperthreading would shine....a thread blocking on IO or on the result of another thread can "sleep" while still allowing the other thread running on the same core to do work without the full overhead of an thread switch (experts please chime in and tell me if I'm wrong here).
But 4-core ARM vs 2-core x86... assuming all else equal (which obviously is not the case, in reality clock speeds, cache hierarchy etc. all have a huge impact) then I think that really depends on the nature of the threads. I would imagine this drop in performance could occur if you are just doing a ton of purely cpu-bound computations (i.e. the threads never need to wait on anything external to the CPU). But If you are doing a lot of blocking I/O in each thread, you might show significant speedups doing up to probably 3 or 4 threads per logical core.
Another thing to keep in mind is the cache. When doing lots of cpu-bound computations, a thread switch has the possibility to blow up the cache, resulting in much slower memory access initially. This will happen across both architectures. This isn't the case with I/O memory, though. But if you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower for the reasons above.
I have heard that I should expect these ARM context-switches to be less efficient than x86. Is that true?
A hardware context switch is a hardware context switch, you push all the registers to the stack and flip some bits to change execution state. So no, I don't believe either is "faster" in that regard. However, for a single physical core, techniques like hyperthreading makes a "context switch" in the Operating Systems sense (I think you mean switching between threads) much faster, since the instructions of both programs were already being executed in parallel on the same core.
I don't know anything about GCD so can't comment on that.
At the end of the day, I would say your best shot is to benchmark the application on both architectures. See where your bottlenecks are. Is it in memory access? Keeping the cache hot therefore is a priority. I imagine that 1-thread per core would always be optimal for any scenario, if you can swing it.
Some good things to read on this matter:
https://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
https://lwn.net/Articles/250967/
Optimal number of threads per core
Thread context switch Vs. process context switch
I try to understand how the synchronization works in linux kernel.
I read that semaphores can be use for exceptions but I can not find an example for a situation , semaphore is needed.
So why using a semaphore in uni-processor system?
I am assuming that you are interested in locking generally, rather than semaphores opposed to mutexes (see also "Difference between Counting and Binary Semaphores"[1]). I won't give a detailed explanation of locking, just point out a couple of things.
It usually makes sense to assume that code you might write could be executed on a multi-processor system (uni-processor is increasingly rare these days). I assume because you explicitly mentioned uni-processor that you understand that case.
The Linux kernel can be built to be fully preemptive while running kernel code[2][3]. In that case threads can be interrupted and resumed at almost any point, including for example in the middle of writing I/O to a device. If a thread writing I/O is interrupted and another accessing the same device switched to things will probably not work as intended.
[1] Differnce between Counting and Binary Semaphores
[2] https://kernelnewbies.org/FAQ/Preemption
[3] https://rt.wiki.kernel.org/index.php/CONFIG_PREEMPT_RT_Patch
I have a question. I know the differece between a thread and a process in theory. But I still don't understand when we should use the first and when the latter. For example, we have a difficult task which needs to be parelleled. But in which way? Which is faster and MORE EFFECTIVE and in what cases? Should we split our task into a few processes or into a few threads? Could you give a few examples? I know that my question may seem silly, but I'm new to the topic of parallel computing. I hope that you understand my question. Thank you in advance.
In general, there is only one main difference between processes and threads: All threads of a given process share the same virtual address space. Whereas each process has its own virtual address space.
When dealing with problems that require concurrent access to the same set of data, it is easier to use threads, because they can all directly access the same memory.
Threads share memory. Processes do not.
This means that processes are somewhat more expensive to start up. It also means that threads can conveniently communicate through shared memory, and processes cannot.
However, from a coding perspective, it also means that threads are significantly more difficult to program correctly. It's very easy for threads to stomp on each others' memory in unintended ways. Processes are somewhat safer.
Welcome to the world of concurrency!
There is no theoretical difference between threads and processes that is practical to generalize from. There are many, many different ways to implement threads, including ways that nearly mirror those of processes (e.g. Linux threads). Then there's lightweight threading, which involves the process managing the threading by itself; but there's more variation there, because you can then have either co-operative or semi-preemptive threading model.
For example, we describe Haskell's threading model and Python's.
Haskell offers lightweight threads that introduce little runtime overhead; there are well-defined points at which threads may yield control, but this is largely hidden from the user, giving the appearance of pre-emptive multitasking. Shared state is held in specially typed variables that are treated specially by the language. Because of this, multi-threaded, even concurrent programs can be written in a largely single-threaded way, then forked from the main process. So there, threads are and abstraction mechanism, and may even be beneficial in a single-(OS)-threaded process to model the program; however, it scales well to N-threads, where N may be chosen dynamically. (And N Haskell threads are mapped dynamically to OS threads.)
Python allows threading, but with a huge bottleneck: the Global Interpreter lock. Therefore, to gain serious performance benefits, one must use processes in practice. There is no feasible, performant threading model to speak of.
I came across few articles talking about differences between Mutexes and Critical sections.
One of the major differences which I came across is , Mutexes run in kernel mode whereas Critical sections mainly run in user mode.
So if this is the case then arent the applications which use mutexes harmful for the system in case the application crashes?
Thanks.
Use Win32 Mutexes handles when you need to have a lock or synchronization across threads in different processes.
Use Win32 CRITICAL_SECTIONs when you need to have a lock between threads within the same process. It's cheaper as far as time and doesn't involve a kernel system call unless there is lock contention. Critical Section objects in Win32 can't span process boundaries anyway.
"Harmful" is the wrong word to use. More like "Win32 mutexes are slightly more expensive that Win32 Critical Sections in terms of performance". A running app that uses mutexes instead of critical sections won't likely hurt system performance. It will just run minutely slower. But depending on how often your lock is acquired and released, the difference may not even be measurable.
I forget the perf metrics I did a long time ago. The bottom line is that EnterCriticalSection and LeaveCriticalSection APIs are on the order of 10-100x faster than the equivalent usage of WaitForSingleObject and ReleaseMutex. (on the order of 1 microsecond vs 1 millisecond).
I have a bit of research related question.
Currently I have finished implementation of structure skeleton frame work based on MPI (specifically using openmpi 6.3). the frame work is supposed to be used on single machine.
now, I am comparing it with other previous skeleton implementations (such as scandium, fast-flow, ..)
One thing I have noticed is that the performance of my implementation is not as good as the other implementations.
I think this is because, my implementation is based on MPI (thus a two sided communication that require the match of send and receive operation)
while the other implementations I am comparing with are based on shared memory. (... but still I have no good explanation to reason out that, and it is part of my question)
There are some big difference on completion time of the two categories.
Today I am also introduced to configuration of open-mpi for shared memory here => openmpi-sm
and there come comes my question.
1st what does it means to configure MPI for shared memory? I mean while MPI processes live in their own virtual memory; what really is the flag like in the following command do?
(I thought in MPI every communication is by explicitly passing a message, no memory is shared between processes).
shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out
2nd why is the performance of MPI is so much worse with compared to other skeleton implementation developed for shared memory? At least I am also running it on one single multi-core machine.
(I suppose it is because other implementation used thread parallel programming, but I have no convincing explanation for that).
any suggestion or further discussion is very welcome.
Please let me know if I have to further clarify my question.
thank you for your time!
Open MPI is very modular. It has its own component model called Modular Component Architecture (MCA). This is where the name of the --mca parameter comes from - it is used to provide runtime values to MCA parameters, exported by the different components in the MCA.
Whenever two processes in a given communicator want to talk to each other, MCA finds suitable components, that are able to transmit messages from one process to the other. If both processes reside on the same node, Open MPI usually picks the shared memory BTL component, known as sm. If both processes reside on different nodes, Open MPI walks the available network interfaces and choses the fastest one that can connect to the other node. It puts some preferences on fast networks like InfiniBand (via the openib BTL component), but if your cluster doesn't have InfiniBand, TCP/IP is used as a fallback if the tcp BTL component is in the list of allowed BTLs.
By default you do not need to do anything special in order to enable shared memory communication. Just launch your program with mpiexec -np 16 ./a.out. What you have linked to is the shared memory part of the Open MPI FAQ which gives hints on what parameters of the sm BTL could be tweaked in order to get better performance. My experience with Open MPI shows that the default parameters are nearly optimal and work very well, even on exotic hardware like multilevel NUMA systems. Note that the default shared memory communication implementation copies the data twice - once from the send buffer to shared memory and once from shared memory to the receive buffer. A shortcut exists in the form of the KNEM kernel device, but you have to download it and compile it separately as it is not part of the standard Linux kernel. With KNEM support, Open MPI is able to perform "zero-copy" transfers between processes on the same node - the copy is done by the kernel device and it is a direct copy from the memory of the first process to the memory of the second process. This dramatically improves the transfer of large messages between processes that reside on the same node.
Another option is to completely forget about MPI and use shared memory directly. You can use the POSIX memory management interface (see here) to create a shared memory block have all processes operate on it directly. If data is stored in the shared memory, it could be beneficial as no copies would be made. But watch out for NUMA issues on modern multi-socket systems, where each socket has its own memory controller and accessing memory from remote sockets on the same board is slower. Process pinning/binding is also important - pass --bind-to-socket to mpiexec to have it pinn each MPI process to a separate CPU core.