Let say I have a hyper-threaded processor and OS sees them as two different virtual processors vp1 and vp2. Now in an LWP1 on vp1, I disable hardware interrupt interrupts. Does not it amounts to say that both of the virtual processors wont get any interrupts unless enabled? And if this is true it should also mean that enabling the interrupts back by another LWP2 on the other processor vp2 will enable interrupts on vp1 as well? I am assuming that disabling interrupts from kernel, only disables it on local processor.
Please explain how this works.
The two logical cores of a hyperthreaded processor have their own APIC IDs, so as far as interrupts are concerned they are separate CPUs. (In Knight's Landing / Xeon Phi, four logical cores)
This makes it possible to disable interrupts on any one logical core independently of anything else, using cli / sti in kernel mode. Everything is exactly the same as on a non-SMT multi-core system, because that's how hyperthreading is designed to work.
Anything else would be inconvenient and weird, and increase latency for the other logical core by sometimes disabling the interrupts it was waiting for, so it's pretty clear that this is the sane way for Intel to have designed it.
Related
I have a textbook statement says disabling interrupt is not recommended in multi-processor system, and it will take too much time. But I don't understand this, can anyone show me the process of multi-processor system disabling the interrupts? Thanks
on x86 (and other architectures, AFAIK), enabling/disabling interrupts is on a per-core basis. You can't globally disable interrupts on all cores.
Software can communicate between cores with inter-processor interrupts (IPIs) or atomic shared variables, but even so it would be massively expensive to arrange for all cores to sit in a spin-loop waiting for a notification from this core that they can re-enable interrupts. (Interrupts are disabled on other cores, so you can't send them an IPI to let them know when you're done your block of atomic operations.) You have to interrupt whatever all 7 other cores (e.g. on an 8-way SMP system) are doing, with many cycles of round-trip communication overhead.
It's basically ridiculous. It would be clearer to just say you can't globally disable interrupts across all cores, and that it wouldn't help anyway for anything other than interrupt handlers. It's theoretically possible, but it's not just "slow", it's impractical.
Disabling interrupts on one core doesn't make something atomic if other threads are running on other cores. Disabling interrupts works on uniprocessor machines because it makes a context-switch impossible. (Or it makes it impossible for the same interrupt handler to interrupt itself.)
But I think my confusion is that for me the difference between 1 core and 8 core is not a big number for me; why disabling all of them from interrupt is time consuming.
Anything other than uniprocessor is a fundamental qualitative difference, not quantitative. Even a dual-core system, like early multi-socket x86 and the first dual-core-in-one-socket x86 systems, completely changes your approach to atomicity. You need to actually take a lock or something instead of just disabling interrupts. (Early Linux, for example, had a "big kernel lock" that a lot of things depended on, before it had fine-grained locking for separate things that didn't conflict with each other.)
The fundamental difference is that on a UP system, only interrupts on the current CPU can cause things to happen asynchronously to what the current code is doing. (Or DMA from devices...)
On an SMP system, other cores can be doing their own thing simultaneously.
For multithreading, getting atomicity for a block of instructions by disabling interrupts on the current CPU is completely ineffective; threads could be running on other CPUs.
For atomicity of something in an interrupt handler, if this IRQ is set up to only ever interrupt this core, disabling interrupts on this core will work. Because there's no threat of interference from other cores.
Some CPUs have bugs at the architecture level (such as these), and it's possible that some programs developed for these CPUs also have bugs which are compensated by the CPU's own ones. If so, such program wouldn't work on a 'perfect' emulator. Do PC emulators include these bugs? For example, Bochs is known to be pretty accurate, does it handle them 'properly', as a real CPU would?
P.S. Already got two minuses. What's wrong?
Such emulators exist, cpu design process require extremely accurate emulators, with precise microarchitecture model. CPU designers need them to debug purposes or to estimate theoretical performance of future chip, also their mangers could calm down investors before chip is ready, by showing some expected functionality. Such emulators is strictly confidential.
Also RTL freeze in CPU design closure born a lot of erratas, or chains of erratas. To simplify future chip bring up, firmware developers could support special tools to emulate functional behavior of expected cpu, with all known erratas implemeted. But them is also proprietary.
But really, it is necessary to understand what is meant by the word "emulator" and "accurate" at that case.
Bochs, as a QEMU, is a functional ISA models, and they purpose is provide some workable architecture profile to run binaries of target ISA, emulation speed is a first goal: there is no micro-architecture modeled: no pipeline, no cache model, no performance monitors, and so on.
To understand kind of BOCHS accuracy, please look to their implementation of CPUID perfomance monitors and caches topology leafs:
cpu/cpuid.cc
When you specify some cpu at BOCHS, skylake for example, bochs known nothing except CPUID values that belong to that cpu, feature sets in other words: AVX2, FMA, XSAVE etc.
Also BOCHS do not implements precise model/family cpuid values: look for implementation of cpuid version information leaf (grep for get_cpu_version_information function): it is hardcoded value.
So there is no cpu erratas at Bochs.
The state of the process at any given time consists of the processes in execution right? So at the moment say there are 4 userspace programs running on the processors. Now after each time slice, I assume control has to pass over to the scheduler so that the appropriate process can be scheduled next. What initiates this transfer of control? For me it seems like there has to be some kind of special timer/register in hardware that keeps count of the current time taken by the process since the process itself has no mechanism to keep track of the time for which it has executed... Is my intuition right??
First of all, this answer concerns the x86 architecture only.
There are different kinds of schedulers: preemptive and non-preemptive (cooperative).
Preemptive schedulers preempt the execution of a process, that is, initiate a context switch using a TSS (Task State Segment), which then performs a jump to another process. The process is stopped and another one is started.
Cooperative schedulers do not stop processes. They rely on the process, which give up the CPU in favor of the scheduler, also called "yielding," similar to user-level threads without kernel support.
Preemption can be accomplished in two ways: as the result of some I/O-bound action or while the CPU is at play.
Imagine you sent some instructions to the FPU. It takes some time until it's finished. Instead of sitting around idly, you could do something else while the FPU performs its calculations! So, as the result of an I/O operation, the scheduler switches to another process, possibly resuming with the preempted process right after the FPU is done.
However, regular preemption, as required by many scheduling algorithms, can only be implemented with some interruption mechanism happening with a certain frequency, independently of the process. A timer chip was deemed suitable and with the IBM 5150 (a.k.a. IBM PC) released in 1981, an x86 system was delivered, incorporating, inter alia, an Intel 8086, an Intel 8042 keyboard controller chip, the Intel 8259 PIC (Programmable Interrupt Controller), and the Intel 8253 PIT (Programmable Interval Timer).
The i8253 connected, like a few other peripheral device, to the i8259. A couple of times in a second (18 Hz?) it issued an #INT signal to the PIC on IRQ 0 and after acknowledging and all the CPU was interrupted and a handler was executed.
That very handler could contain scheduling code, which decides on the next process to execute1.
Of course, we (most of us) are living in the 21st century by now and there's no IBM PC or one of its derivatives like the XT or AT used. The PIC has changed to the more sophisticated Intel 82093AA APIC to handle multiple processors/cores and for general improvement but the PIT has remained the same, I think, maybe in shape of some integrated version like the Intel AIP.
Cooperative schedulers do not need a regular interrupt and therefore no special hardware support (except maybe for hardware-supported multitasking). The process yields the CPU deliberately and if it doesn't, you have a problem. The reason as to why few OSes actually use cooperative schedulers: it poses a big security hole.
1 Note, however, that OSes on the 8086 (mostly DOS) didn't have a real
scheduler. The x86 architecture only natively supported multitasking in the
hardware with the advent of one of the 80386 versions (SX, DX, and whatever). I just wanted to stress that the IBM 5150 was the first x86 system with a timer chip (and, of course, the first PC altogether).
Systems running an OS with preemptive schedulers, (ie. all those in common use), are, IME, all provided with a hardware timer interrupt that causes a driver to run and can change the set of running threads.
Such a timer interrupt is very useful for providing timeouts for system calls, sleep() functionality and other time-related functions. It can also help share out the available CPU amongst ready threads when the system is overloaded, or the thread/s run on it are CPU-intensive, and so the number of ready threads exceeds the number of cores available to run them.
It is quite possible to implement a preemptive scheduler without any hardware timer, allowing the set of running threads to be secheduled upon software interrupts, (system calls), from threads that are already running, and all the other interrupts upon I/O completion from the peripheral drivers for disk, NIC, KB, mouse etc. I've never seen it done though - the timer functionality is too useful:)
This is a two fold question that raised from my trivial observation that I am running a SMP enabled Linux on our ARM-Cortex 8 based SoC. First part is about performance (memory space/CPU time) difference between SMP and NON-SMP Linux kernel on a Uni processor system. Does any difference exits?
Second part is about use of Spinlock. AFAIK spinklock are noop in case uni-processor. Since there is only one CPU and only one process will be running on it (at a time ) there is no other process for busy-looping. So for synchronization I just need to disable interrupt for protecting my critical section. Is this understanding of mine correct?
Ignore portability of drivers factor for this discussion.
A large amount of synchronisation code in the kernel compiles way to almost nothing in uni-processor kernels which descries the behaviour you describe. Performance of n-way system is definitely not 2n - and gets worse as the number of CPUs.
You should continue to write your driver with using synchronisation mechanisms for SMP systems - safe in the knowledge that you'll get the correct single-processor case when the kernel is configured for uni-processor.
Disabling interrupts globally is like taking a sledge-hammer to a nut - maybe just disabling pre-emption on the current CPU is enough - which the spinlock does even on uni-processor systems.
If you've not already done so, take a look at Chapter 5 of Linux Device Drivers 3rd Edition - there are a variety of spinlock options depending on the circumstance.
As you have stated that you are running the linux kernel as compiled in SMP mode on Uni-processor system so it's clear that you'll not get any benefit in terms of speed & memory.
As the linux-kernel uses extensive locking for synchronization. But it Uni-Processor mode there may be no need of locking theoretically but there are many cases where its necessary so try to use Locking where its needed but not as much as in SMP.
but you should know it well that Spinlocks are implemented by set of macros, some prevent concurrency with IRQ handlers while the
other ones not.Spinlocks are suitable to protect small pieces of code which are intended to run
for a very short time.
As of your second question, you are trying to remove spinlocks by disabling interrupts for Uni-Processor mode but Spinlock macros are in non-preemptible UP(Uni-Processor) kernels evaluated to empty macros(or some of them to macros just disabling/enabling interrupts). UP kernels with
preemption enabled use spinlocks to disable preemption. For most purposes, pre-emption can be tought of as SMP equivalent. so in UP kernels if you use Spinlocks then they will be just empty macro & i think it will be better to use it.
there are basically four technique for synchronization as..1->Nonpreemptability,2->Atomic Operations,3->Interrupt Disabling,4->Locks.
but as you are saying to disable interrupt for synchronization then remember Because of its simplicity, interrupt disabling is used by kernel functions for implementing a critical region.
This technique does not always prevent kernel control path interleaving.
Critical section should be short because any communication between CPU and I/O is blocked while a kernel control path is running in this section.
so if you need synchronization in Uni-Processor then use semaphore.
How do GDB watchpoints work? Can similar functionality be implemented to harness byte level access at defined locations?
On x86 there are CPU debug registers D0-D3 that track memory address.
This explains how hardware breakpoints are implemented in Linux and also gives details of what processor specific features are used.
Another article on hardware breakpoints.
I believe gdb uses the MMU so that the memory pages containing watched address ranges are marked as protected - then when an exception occurs for a write to a protected pages gdb handles the exception, checks to see whether the address of the write corresponds to a particular watchpoint, and then either resumes or drops to the gdb command prompt accordingly.
You can implement something similar for your own debugging code or test harness using mprotect, although you'll need to implement an exception handler if you want to do anything more sophisticated than just fail on a bad write.
Using the MMU or an MPU (on other processors such as embedded), can be used to implement "hardware watchpoints"; however, some processors (e.g., many Arm implementations) have dedicated watchpoint hardware accessed via a debug port. This has some advantages over using an MMU or MPU.
If you use the MMU or MPU approach:
PRO - There is no special hardware needed for application-class processors because an MMU is built-in to support the needs of Linux or Windows. In the case of specialized realtime-class processors, there is often an MPU.
CON - There will be software overhead handling the exception. This is probably not a problem for an Application class processor (e.g., x86); however, for embedded realtime-application, this could spell disaster.
CON- MMU or MPU faults may happen for other reasons, which means the handler will need to figure our exactly why it faulted by reading various status registers.
PRO - using MMU memory protection faults can often cover many separate variables to watchpoint many variables easily. However, this not normally required of most debugging situations.
If you use dedicated debug watchpoint hardware such as supported by Arm:
PRO - There is no impact on software performance (helps if debugging subtle timing issues). The debug infrastructure is designed to be non-intrusive.
CON - There are a limited number of these hardware units on any particular silicon. For Arm, there may be 0, 2 or 4 of these. So you need to be careful to choose. The units can cover a range of addresses, but there are limits. For some processors, they may even be limited to the region of memory.