Restart a CPU that ends up unresponsive during undervolting - linux-kernel

I'm working on a set of kernel changes that allows me to undervolt my CPU at runtime. One consequence of extreme undervolting that I'm often facing is that the CPU becomes completely unresponsive.
I've tried using functions cpu_up and cpu_down in the hope of asking the kernel to restore the CPU, but to no avail.
Is there any way to recover the CPU from this state? Does the kernel have any routines that can bring back a CPU from this unresponsive state?

First, to successfully benefit from undervolting, it's important that you reduce the voltage by small amounts each time (such as between 5-10 mV). Then after each step of reduction, you should check the changes to one or more hardware error metrics (typically the CPU cache error rate). Generally what happens is that error rate should increase gradually when the voltage is decreased slowly. However, at some point, an error will occur that cannot be corrected through ECC (or whatever hardware correction mechanism is being used by the processor). This is when execution becomes unreliable. Linux responds to such errors by panicking (the system will either automatically reboot or it will just hang). So you may still have chance to detect the error and choose to continue execution, but correctness is not guaranteed anymore even if you immediately increased the voltage back. So that would be a very, very dangerous thing to do. It can get very nasty very quickly. An error might occur while you're handling some another error (maybe because of the code that is handling the error, so the safest thing to do is to abort, see Peter's comment).
Modern processors offer mechanisms to profile and handle correctable and uncorrectable hardware errors. In particular, x86 offers the Machine Check Architecture (MCA). By default, in Linux, when an uncorrectable machine check occurs, the machine check exception handler is invoked, which may abort the system (although it will try to see if it can safely recover somehow). You cannot handle that in user mode without using additional tools.
Here are the different x86 MCE tolerance levels supported by Linux:
struct mca_config mca_cfg __read_mostly = {
.bootlog = -1,
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
* 1: panic or SIGBUS on uncorrected errors, log corrected errors
* 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
* 3: never panic or SIGBUS, log all errors (for testing only)
*/
.tolerant = 1,
.monarch_timeout = -1
};
Note that the default tolerant value is 1. But since you are modifying the kernel, you can change the way Linux handle MCEs either by changing the tolerant level or the handling code itself. You can get started with the machine_check_poll and do_machine_check functions.
User-mode tools that may enable you to profile and potentially responds to machine checks include mcelog and mcedaemon. MCA is discussed in Volume 3 Chapter 15 and Chapter 16 of the Intel manual. For ARM, you can also profile cache ECC errors as discussed in here.
It is very important to understand that different cores of the same chip may behave differently when reducing the voltage beyond the nominal value. This is due to process variation. So don't assume that voltage reductions would work across cores of the same chip or across chips. You're going to have to test that on every core of every chip (in case you have multiple sockets).
I've tried using functions cpu_up and cpu_down in the hope of asking
the kernel to restore the CPU, but to no avail.
These functions are part of the Hotplug CPU infrastructure. Not really useful here.

The answer is CPU dependent. My answer is limited to x86_64 and s390:
Extreme undervolting is essentially unplugging the CPU, to be able to bring it back up you have to make sure that CONFIG_HOTPLUG_CPU = y is configured.
Also, depending on the kernel version you are using you may have different teardown or setup options available to you readily. If you are using 4.x have a look at cpuhp_* routines in <linux/cpuhotplug.h> in particular cpuhp_setup_state_multimay be the one you can use to set things up ... if in doubt look atcpuhp_setup_state_nocallsas well as__cpuhp_setup_state` ... Hopefully this helps :-)

Related

What are the common causes of Prefetch Abort errors on ARM-based devices?

My C program is running on bare metal Raspberry Pi 3B+. It's working fine except I got random freezes that are reported as Prefetch Abort by the CPU itself. The device may work fine for hours then suddently crash. It's doing nothing special before it crashes, so it's not predictable.
The FS register (FSR) is set to 0xD when this error happens, which tells it's a Permission Error : http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0087e/Cihhhged.html
Other registers : FAR is 0xE80000B6, LR is 0xFFFFFFFF, PC is 0xE80000B6, PSR is 0x200001F1
My program uses FIQ and IRQ interrupts, and use the all the four cpu cores.
I don't ask for specific debug here since it would be too complicated to dive into the details, but are you aware of common causes for Prefetch Errors to happen ?
Given that your code is multi-threaded (multi-core, indeed) and the crash is not predictable, I'd say that the prefetch abort is almost certainly being caused by memory corruption due to a race.
It might help if you can find out where the abort is being generated. Bugs like this can be extremely hard to track down though; if the code always crashes in the same place then that could help, but even if it does and you can find out which address is suffering corruption, monitoring that address for rogue writes without affecting the timing of the program (and hence the manifestation of the bug) is essentially impossible.
It is quite likely that the root cause is a buffer overrun, especially given your comments above. Really you should know in advance how big your buffers will need to be, and then make them that size. If whatever algorithm you're using can't guarantee a limit on the amount of buffer it uses, you should add code that performs a runtime check on the buffer and responds appropriately (perhaps a nicely reported error so you know which buffer is overflowing). Using the heap is ok but declaring a large buffer as static is faster and leak-free, providing the function containing the buffer is non-reentrant.
If there are data access races in the mix too, note that you will need more than data barrier instructions to solve these. The data barrier instructions only address consistency problems related to pending memory transactions. They don't prevent register caching of shared data (you need the volatile keyword for that) or simultaneous read-modify-write races (you need mutual exclusion mechanisms for that, either as provided by whatever framework you're using or home-brewed using the STREX and LDREX instructions on armv7).

QueryPerformanceCounter on multi-core processor under Windows 10 behaves erratically

Under Windows, my application makes use of QueryPerformanceCounter (and QueryPerformanceFrequency) to perform "high resolution" timestamping.
Since Windows 10 (and only tested on Intel i7 processors so far), we observe erratic behaviours in the values returned by QueryPerformanceCounter.
Sometimes, the value returned by the call will jump far ahead and then back to its previous value.
It feels as if the thread has moved from one core to another and was returned a different counter value for a lapse of time (no proof, just a gut feeling).
This has never been observed under XP or 7 (no data about Vista, 8 or 8.1).
A "simple" workaround has been to enable the UsePlatformClock boot opiton using BCDEdit (which makes everything behaves wihtout a hitch).
I know about the potentially superior GetSystemTimePreciseAsFileTime but as we still support 7 this is not exactly an option unless we write totatlly different code for different OSes, which we really don't want to do.
Has such behaviour been observed/explained under Windows 10 ?
I'd need much more knowledge about your code but let me highlight few things from MSDN:
When computing deltas, the values [from QueryPerformanceCounter] should be clamped to ensure that any bugs in the timing values do not cause crashes or unstable time-related computations.
And especially this:
Set that single thread to remain on a single processor by using the Windows API SetThreadAffinityMask ... While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for multiple processors, bugs in the BIOS or drivers may result in these routines returning different values as the thread moves from one processor to another. So, it's best to keep the thread on a single processor.
Your case might exploited one of those bugs. In short:
You should query the timestamp always from one thread (setting same CPU affinity to be sure it won't change) and read that value from any other thread (just an interlocked read, no need for fancy synchronizations).
Clamp the calculated delta (at least to be sure it's not negative)...
Notes:
QueryPerformanceCounter() uses, if possible, TSC (see MSDN). Algorithm to synchronize TSC (if available and in your case it should be) is vastly changed from Windows 7 to Windows 8 however note that:
With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results — unless great care is taken to correct the possible flaws: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. Therefore, a program can get reliable results only by limiting itself to run on one specific CPU.
Then, even if in theory QPC is monotonic then you must always call it from the same thread to be sure of this.
Another note: if synchronization is made by software you may read from Intel documentation that:
...It may be difficult for software to do this in a way than ensures that all logical processors will have the same value for the TSC at a given point in time...
Edit: if your application is multithreaded and you can't (or you don't wan't) to set CPU affinity (especially if you need precise timestamping at the cost to have de-synchronized values between threads) then you may use GetSystemTimePreciseAsFileTime() when running on Win8 (or later) and fallback to timeGetTime() for Win7 (after you set granularity to 1 ms with timeBeginPeriod(1) and assuming 1 ms resolution is enough). A very interesting reading: The Windows Timestamp Project.
Edit 2: directly suggested by OP! This, when applicable (because it's a system setting, not local to your application), might be an easy workaround. You can force QPC to use HPET instead of TSC using bcdedit (see MSDN). Latency and resolution should be worse but it's intrinsically safe from above described issues.

Catching source of NMI on x86 Intel Centerton

I am dealing a situation on NetBSD, where an NMI has put my box to DDB.
I understand that NMI could be due some memory related problem. I guess the devices which are memory mapped could also lead me into the same scenario. Please correct me on this.
My understand is that I need to read status of all these devices, probably over pci.
I do not know what and how of any of it.
On receiving an NMI a trap is generated which puts NetBSD to DDB debugger. It is difficult to gain anything from DDB there. My plan is to return from trap without doing anything so that the error will cause a kernel core dump. Also, before returning from trap, I wanted to read the required registers/memory to dump status of the devices involved. This is my plan of action. Let me know if there is a better and right way to do that.
My aim is to understand from experts here and come up with a step-by-step plan to get to the source of NMI.
Intel describes platform-level error handling in a high-level document titled Platform-Level Error Handling Strategies for Intel Systems
That document doesn't specifically cover the Centerton (64-bit Atom) that you mention though (but it does give some good overview of how Intel thinks of hardware error reporting). However since the Centerton is a System-On-a-Chip device, we can find much more about how it works from the device data sheets. In volume one of the Intel Atom Processor S1200 chip datasheet we find the following text:
Internal Non-Maskable Interrupts (NMIs) can be generated by PCI Express ports and internally from the internal IOCHK# signal from the Low Pin Count interface signal LPC_SERIRQ.
We also find that there are external power management error signal pins which can generate a NMI in Atom based systems.
Undoubtably errors from the memory hardware could also be responsible for generating a NMI.
Volume 2 of the S1220 datasheet gives more detail about the many system registers involved in handling error signals.
None of this says much about NetBSD though. I don't think you can expect too much from NetBSD though. It doesn't have enough detailed knowledge of the many x86 systems that it runs on to decode specifics about hardware errors. It may be possible to access enough of the system registers through the NetBSD DDB in-kernel debugger, though I suspect this may be very tedious to do manually.
One avenue you might explore is whether the system BIOS is able to read and interpret the error registers, but unless your system also has a board management controller (unlikely for Atom systems, if I understand correctly), then it's unlikely there's any record of system errors kept somewhere where the BIOS can access them.
NMI - Non Maskable interrupt is generally raised by a hardware watchdog to indicate that CPU is hung and not due to invalid memory accesses (atleast in Mips/powerpc as I've some knowledge in them). Invalid memory accesses have seperate exceptions/interrupts to handle.
One of the cases where CPU is hung is due to dead lock or some similar conditions.
So taking coredump and checking what each core was doing at the time of NMI should be one way to go forward.

Questions about supervisor mode

Reading OS from multiple resources has left be confused about supervisor mode. For example, on Wikipedia:
In kernel mode, the CPU may perform any operation allowed by its architecture ..................
In the other CPU modes, certain restrictions on CPU operations are enforced by the hardware. Typically, certain instructions are not permitted (especially those—including I/O operations—that could alter the global state of the machine), some memory areas cannot be accessed
Does it mean that instructions such as LOAD and STORE are prohibited? or does it mean something else?
I am asking this because on a pure RISC processor, the only instructions that should access IO/memory are LOAD and STORE. A simple program that evaluates some arithmetic expression will thus need supervisor mode to read its operands.
I apologize if it's vague. If possible, can anyone explain it with an example?
I see this question was asked few months back and this should have been answered long back.
I will try to set few things straight before talking about I/O part of your question.
CPU running in "kernel mode" means that OS has permitted CPU to be able to execute few extra instructions. This is done by setting some flag at an appropriate moment. One can think of it as if a digital switch enables or disables specific operations embedded inside a processor.
In RISC machines, LOAD and STORE are generally register related operations. In fact from processor's perspective, traffic to and from main-memory is not really considered an I/O operation. Data transfer between main memory and processor happens very much automatically, by virtue of a pre-programmed page table (unless the required data is NOT found in main memory as well in which case it generally has to do disk I/O). Obviously OS programs this page table well in advance and does its book keeping operations in it.
An I/O operation generally relates to those with other external devices which are reachable through interrupt controller. Whenever an I/O operation completes, the corresponding device raises an interrupt towards processor and this causes OS to immediately change the processor's privilege level appropriately. Processor in turn works out the request raised by interrupt. This interrupt is a program written by OS developers, which may contain certain privileged instructions. This raised privileged level is some times referred as "kernel mode".

Seeking articles on shared memory locking issues

I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.
It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)
The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.
You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.
Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html
I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).
Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.
I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.
I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.
Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.
I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!
If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.
Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.
The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.

Resources