Catching source of NMI on x86 Intel Centerton

Catching source of NMI on x86 Intel Centerton - linux-kernel

I am dealing a situation on NetBSD, where an NMI has put my box to DDB.
I understand that NMI could be due some memory related problem. I guess the devices which are memory mapped could also lead me into the same scenario. Please correct me on this.
My understand is that I need to read status of all these devices, probably over pci.
I do not know what and how of any of it.
On receiving an NMI a trap is generated which puts NetBSD to DDB debugger. It is difficult to gain anything from DDB there. My plan is to return from trap without doing anything so that the error will cause a kernel core dump. Also, before returning from trap, I wanted to read the required registers/memory to dump status of the devices involved. This is my plan of action. Let me know if there is a better and right way to do that.
My aim is to understand from experts here and come up with a step-by-step plan to get to the source of NMI.

Intel describes platform-level error handling in a high-level document titled Platform-Level Error Handling Strategies for Intel Systems
That document doesn't specifically cover the Centerton (64-bit Atom) that you mention though (but it does give some good overview of how Intel thinks of hardware error reporting). However since the Centerton is a System-On-a-Chip device, we can find much more about how it works from the device data sheets. In volume one of the Intel Atom Processor S1200 chip datasheet we find the following text:
Internal Non-Maskable Interrupts (NMIs) can be generated by PCI Express ports and internally from the internal IOCHK# signal from the Low Pin Count interface signal LPC_SERIRQ.
We also find that there are external power management error signal pins which can generate a NMI in Atom based systems.
Undoubtably errors from the memory hardware could also be responsible for generating a NMI.
Volume 2 of the S1220 datasheet gives more detail about the many system registers involved in handling error signals.
None of this says much about NetBSD though. I don't think you can expect too much from NetBSD though. It doesn't have enough detailed knowledge of the many x86 systems that it runs on to decode specifics about hardware errors. It may be possible to access enough of the system registers through the NetBSD DDB in-kernel debugger, though I suspect this may be very tedious to do manually.
One avenue you might explore is whether the system BIOS is able to read and interpret the error registers, but unless your system also has a board management controller (unlikely for Atom systems, if I understand correctly), then it's unlikely there's any record of system errors kept somewhere where the BIOS can access them.

NMI - Non Maskable interrupt is generally raised by a hardware watchdog to indicate that CPU is hung and not due to invalid memory accesses (atleast in Mips/powerpc as I've some knowledge in them). Invalid memory accesses have seperate exceptions/interrupts to handle.
One of the cases where CPU is hung is due to dead lock or some similar conditions.
So taking coredump and checking what each core was doing at the time of NMI should be one way to go forward.

Related

What are the common causes of Prefetch Abort errors on ARM-based devices?

My C program is running on bare metal Raspberry Pi 3B+. It's working fine except I got random freezes that are reported as Prefetch Abort by the CPU itself. The device may work fine for hours then suddently crash. It's doing nothing special before it crashes, so it's not predictable.
The FS register (FSR) is set to 0xD when this error happens, which tells it's a Permission Error : http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0087e/Cihhhged.html
Other registers : FAR is 0xE80000B6, LR is 0xFFFFFFFF, PC is 0xE80000B6, PSR is 0x200001F1
My program uses FIQ and IRQ interrupts, and use the all the four cpu cores.
I don't ask for specific debug here since it would be too complicated to dive into the details, but are you aware of common causes for Prefetch Errors to happen ?

Given that your code is multi-threaded (multi-core, indeed) and the crash is not predictable, I'd say that the prefetch abort is almost certainly being caused by memory corruption due to a race.
It might help if you can find out where the abort is being generated. Bugs like this can be extremely hard to track down though; if the code always crashes in the same place then that could help, but even if it does and you can find out which address is suffering corruption, monitoring that address for rogue writes without affecting the timing of the program (and hence the manifestation of the bug) is essentially impossible.
It is quite likely that the root cause is a buffer overrun, especially given your comments above. Really you should know in advance how big your buffers will need to be, and then make them that size. If whatever algorithm you're using can't guarantee a limit on the amount of buffer it uses, you should add code that performs a runtime check on the buffer and responds appropriately (perhaps a nicely reported error so you know which buffer is overflowing). Using the heap is ok but declaring a large buffer as static is faster and leak-free, providing the function containing the buffer is non-reentrant.
If there are data access races in the mix too, note that you will need more than data barrier instructions to solve these. The data barrier instructions only address consistency problems related to pending memory transactions. They don't prevent register caching of shared data (you need the volatile keyword for that) or simultaneous read-modify-write races (you need mutual exclusion mechanisms for that, either as provided by whatever framework you're using or home-brewed using the STREX and LDREX instructions on armv7).

Restart a CPU that ends up unresponsive during undervolting

I'm working on a set of kernel changes that allows me to undervolt my CPU at runtime. One consequence of extreme undervolting that I'm often facing is that the CPU becomes completely unresponsive.
I've tried using functions cpu_up and cpu_down in the hope of asking the kernel to restore the CPU, but to no avail.
Is there any way to recover the CPU from this state? Does the kernel have any routines that can bring back a CPU from this unresponsive state?

First, to successfully benefit from undervolting, it's important that you reduce the voltage by small amounts each time (such as between 5-10 mV). Then after each step of reduction, you should check the changes to one or more hardware error metrics (typically the CPU cache error rate). Generally what happens is that error rate should increase gradually when the voltage is decreased slowly. However, at some point, an error will occur that cannot be corrected through ECC (or whatever hardware correction mechanism is being used by the processor). This is when execution becomes unreliable. Linux responds to such errors by panicking (the system will either automatically reboot or it will just hang). So you may still have chance to detect the error and choose to continue execution, but correctness is not guaranteed anymore even if you immediately increased the voltage back. So that would be a very, very dangerous thing to do. It can get very nasty very quickly. An error might occur while you're handling some another error (maybe because of the code that is handling the error, so the safest thing to do is to abort, see Peter's comment).
Modern processors offer mechanisms to profile and handle correctable and uncorrectable hardware errors. In particular, x86 offers the Machine Check Architecture (MCA). By default, in Linux, when an uncorrectable machine check occurs, the machine check exception handler is invoked, which may abort the system (although it will try to see if it can safely recover somehow). You cannot handle that in user mode without using additional tools.
Here are the different x86 MCE tolerance levels supported by Linux:
struct mca_config mca_cfg __read_mostly = {
.bootlog = -1,
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
* 1: panic or SIGBUS on uncorrected errors, log corrected errors
* 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
* 3: never panic or SIGBUS, log all errors (for testing only)
*/
.tolerant = 1,
.monarch_timeout = -1
};
Note that the default tolerant value is 1. But since you are modifying the kernel, you can change the way Linux handle MCEs either by changing the tolerant level or the handling code itself. You can get started with the machine_check_poll and do_machine_check functions.
User-mode tools that may enable you to profile and potentially responds to machine checks include mcelog and mcedaemon. MCA is discussed in Volume 3 Chapter 15 and Chapter 16 of the Intel manual. For ARM, you can also profile cache ECC errors as discussed in here.
It is very important to understand that different cores of the same chip may behave differently when reducing the voltage beyond the nominal value. This is due to process variation. So don't assume that voltage reductions would work across cores of the same chip or across chips. You're going to have to test that on every core of every chip (in case you have multiple sockets).
I've tried using functions cpu_up and cpu_down in the hope of asking
the kernel to restore the CPU, but to no avail.
These functions are part of the Hotplug CPU infrastructure. Not really useful here.

The answer is CPU dependent. My answer is limited to x86_64 and s390:
Extreme undervolting is essentially unplugging the CPU, to be able to bring it back up you have to make sure that CONFIG_HOTPLUG_CPU = y is configured.
Also, depending on the kernel version you are using you may have different teardown or setup options available to you readily. If you are using 4.x have a look at cpuhp_* routines in <linux/cpuhotplug.h> in particular cpuhp_setup_state_multimay be the one you can use to set things up ... if in doubt look atcpuhp_setup_state_nocallsas well as__cpuhp_setup_state` ... Hopefully this helps :-)

Questions about supervisor mode

Reading OS from multiple resources has left be confused about supervisor mode. For example, on Wikipedia:
In kernel mode, the CPU may perform any operation allowed by its architecture ..................
In the other CPU modes, certain restrictions on CPU operations are enforced by the hardware. Typically, certain instructions are not permitted (especially those—including I/O operations—that could alter the global state of the machine), some memory areas cannot be accessed
Does it mean that instructions such as LOAD and STORE are prohibited? or does it mean something else?
I am asking this because on a pure RISC processor, the only instructions that should access IO/memory are LOAD and STORE. A simple program that evaluates some arithmetic expression will thus need supervisor mode to read its operands.
I apologize if it's vague. If possible, can anyone explain it with an example?

I see this question was asked few months back and this should have been answered long back.
I will try to set few things straight before talking about I/O part of your question.
CPU running in "kernel mode" means that OS has permitted CPU to be able to execute few extra instructions. This is done by setting some flag at an appropriate moment. One can think of it as if a digital switch enables or disables specific operations embedded inside a processor.
In RISC machines, LOAD and STORE are generally register related operations. In fact from processor's perspective, traffic to and from main-memory is not really considered an I/O operation. Data transfer between main memory and processor happens very much automatically, by virtue of a pre-programmed page table (unless the required data is NOT found in main memory as well in which case it generally has to do disk I/O). Obviously OS programs this page table well in advance and does its book keeping operations in it.
An I/O operation generally relates to those with other external devices which are reachable through interrupt controller. Whenever an I/O operation completes, the corresponding device raises an interrupt towards processor and this causes OS to immediately change the processor's privilege level appropriately. Processor in turn works out the request raised by interrupt. This interrupt is a program written by OS developers, which may contain certain privileged instructions. This raised privileged level is some times referred as "kernel mode".

Atomic operations in ARM strex and ldrex - can they work on I/O registers?

Suppose I'm modifying a few bits in a memory-mapped I/O register, and it's possible that another process or and ISR could be modifying other bits in the same register.
Can ldrex and strex be used to protect against this? I mean, they can in principle because you can ldrex, and then change the bit(s), and strex it back, and if the strex fails it means another operation may have changed the reg and you have to start again. But can the strex/ldrex mechanism be used on a non-cacheable area?
I have tried this on raspberry pi, with an I/O register mapped into userspace, and the ldrex operation gives me a bus error. If I change the ldrex/strex to a simple ldr/str it works fine (but is not atomic any more...) Also, the ldrex/strex routines work fine on ordinary RAM. Pointer is 32-bit aligned.
So is this a limitation of the strex/ldrex mechanism? or a problem with the BCM2708 implementation, or the way the kernel has set it up? (or somethinge else- maybe I've mapped it wrong)?

Thanks for mentioning me...
You do not use ldrex/strex pairs on the resource itself. Like swp or test and set or whatever your instruction set supports (for arm it is swp and more recently strex/ldrex). You use these instructions on ram, some ram location agreed to by all the parties involved. The processes sharing the resource use the ram location to fight over control of the resource, whoever wins, gets to then actually address the resource. You would never use swp or ldrex/strex on a peripheral itself, that makes no sense. and I could see the memory system not giving you an exclusive okay response (EXOKAY) which is what you need to get out of the ldrex/strex infinite loop.
You have two basic methods for sharing a resource (well maybe more, but here are two). One is you use this shared memory location and each user of the shared resource, fights to win control over the memory location. When you win you then talk to the resource directly. When finished give up control over the shared memory location.
The other method is you have only one piece of software allowed to talk to the peripheral, nobody else is allowed to ever talk to the peripheral. Anyone wishing to have something done on the peripheral asks the one resource to do it for them. It is like everyone being able to share the soft drink fountain, vs the soft drink fountain is behind the counter and only the soft drink fountain employee is allowed to use the soft drink fountain. Then you need a scheme either have folks stand in line or have folks take a number and be called to have their drink filled. Along with the single resource talking to the peripheral you have to come up with a scheme, fifo for example, to essentially make the requests serial in nature.
These are both on the honor system. You expect nobody else to talk to the peripheral who is not supposed to talk to the peripheral, or who has not won the right to talk to the peripheral. If you are looking for hardware solutions to prevent folks from talking to it, well, use the mmu but now you need to manage the who won the lock and how do they get the mmu unblocked (without using the honor system) and re-blocked in a way that
Situations where you might have an interrupt handler and a foreground task sharing a resource, you have one or the other be the one that can touch the resource, and the other asks for requests. for example the resource might be interrupt driven (a serial port for example) and you have the interrupt handlers talk to the serial port hardware directly, if the application/forground task wants to have something done it fills out a request (puts something in a fifo/buffer) the interrupt then looks to see if there is anything in the request queue, and if so operates on it.
Of course there is the, disable interrupts and re-enable critical sections, but those are scary if you want your interrupts to have some notion of timing/latency...Understand what you are doing and they can be used to solve this app+isr two user problem.
ldrex/strex on non-cached memory space:
My extest perhaps has more text on the when you can and cant use ldrex/strex, unfortunately the arm docs are not that good in this area. They tell you to stop using swp, which implies you should use strex/ldrex. But then switch to the hardware manual which says you dont have to support exclusive operations on a uniprocessor system. Which says two things, ldrex/strex are meant for multiprocessor systems and meant for sharing resources between processors on a multiprocessor system. Also this means that ldrex/strex is not necessarily supported on uniprocessor systems. Then it gets worse. ARM logic generally stops either at the edge of the processor core, the L1 cache is contained within this boundary it is not on the axi/amba bus. Or if you purchased/use the L2 cache then the ARM logic stops at the edge of that layer. Then you get into the chip vendor specific logic. That is the logic that you read the hardware manual for where it says you dont NEED to support exclusive accesses on uniprocessor systems. So the problem is vendor specific. And it gets worse, ARM's L1 and L2 cache so far as I have found do support ldrex/strex, so if you have the caches on then ldrex/strex will work on a system whose vendor code does not support them. If you dont have the cache on that is when you get into trouble on those systems (that is the extest thing I wrote).
The processors that have ldrex/strex are new enough to have a big bank of config registers accessed through copressor reads. buried in there is a "swp instruction supported" bit to determine if you have a swap. didnt the cortex-m3 folks run into the situation of no swap and no ldrex/strex?
The bug in the linux kernel (there are many others as well for other misunderstandings of arm hardware and documentation) is that on a processor that supports ldrex/strex the ldrex/strex solution is chosen without determining if it is multiprocessor, so you can (and I know of two instances) get into an infinite ldrex/strex loop. If you modify the linux code so that it uses the swp solution (there is code there for either solution) they linux will work. why only two people have talked about this on the internet that I know of, is because you have to turn off the caches to have it happen (so far as I know), and who would turn off both caches and try to run linux? It actually takes a fair amount of work to succesfully turn off the caches, modifications to linux are required to get it to work without crashing.
No, I cant tell you the systems, and no I do not now nor ever have worked for ARM. This stuff is all in the arm documentation if you know where to look and how to interpret it.

Generally, the ldrex and strex need support from the memory systems. You may wish to refer to some answers by dwelch as well as his extext application. I would believe that you can not do this for memory mapped I/O. ldrex and strex are intended more for Lock Free algorithms, in normal memory.
Generally only one driver should be in charge of a bank of I/O registers. Software will make requests to that driver via semaphores, etc which can be implement with ldrex and strex in normal SDRAM. So, you can inter-lock these I/O registers, but not in the direct sense.
Often, the I/O registers will support atomic access through write one to clear, multiplexed access and other schemes.
Write one to clear - typically use with hardware events. If code handles the event, then it writes only that bit. In this way, multiple routines can handle different bits in the same register.
Multiplexed access - often an interrupt enable/disable will have a register bitmap. However, there are also alternate register that you can write the interrupt number to which enable or disable a particular register. For instance, intmask maybe two 32 bit registers. To enable int3, you could mask 1<<3 to the intmask or write only 3 to an intenable register. They intmask and intenable are hooked to the same bits via hardware.
So, you can emulate an inter-lock with a driver or the hardware itself may support atomic operations through normal register writes. These schemes have served systems well for quiet some time before people even started to talk about lock free and wait free algorithms.

Like previous answers state, ldrex/strex are not intended for accessing the resource itself, but rather for implementing the synchronization primitives required to protect it.
However, I feel the need to expand a bit on the architectural bits:
ldrex/strex (pronounced load-exclusive/store-exclusive) are supported by all ARM architecture version 6 and later processors, minus the M0/M1 microcontrollers (ARMv6-M).
It is not architecturally guaranteed that load-exclusive/store-exclusive will work on memory types other than "Normal" - so any clever usage of them on peripherals would not be portable.
The SWP instruction isn't being recommended against simply because its very nature is counterproductive in a multi-core system - it was deprecated in ARMv6 and is "optional" to implement in certain ARMv7-A revisions, and most ARMv7-A processors already require it to be explicitly enabled in the cp15 SCTLR. Linux by default does not, and instead emulates the operation through the undef handler using ... load-exclusive and store-exclusive (what #dwelch refers to above). So please don't recommend SWP as a valid alternative if you are expecting code to be portable across ARMv7-A platforms.
Synchronization with bus masters not in the inner-shareable domain (your cache-coherency island, as it were) requires additional external hardware - referred to as a global monitor - in order to track which masters have requested exclusive access to which regions.
The "not required on uniprocessor systems" bit sounds like the ARM terminology getting in the way. A quad-core Cortex-A15 is considered one processor... So testing for "uniprocessor" in Linux would not make one iota of a difference - the architecture and the interconnect specifications remain the same regardless, and SWP is still optional and may not be present at all.
Cortex-M3 supports ldrex/strex, but its interconnect (AHB-lite) does not support propagating it, so it cannot use it to synchronize with external masters. It does not support SWP, never introduced in the Thumb instruction set, which its interconnect would also not be able to propagate.

If the chip in question has a toggle register (which is essentially XORed with the output latch when written to) there is a work around.
load port latch
mask off unrelated bits
xor with desired output
write to toggle register
as long as two processes do not modify the same pins (as opposed to "the same port") there is no race condition.
In the case of the bcm2708 you could choose an output pin whose neighbors are either unused or are never changed and write to GPFSELn in byte mode. This will however only ensure that you will not corrupt others. If others are writing in 32 bit mode and you interrupt them they will still corrupt you. So its kind of a hack.
Hope this helps

Hardware watchpoints - how do they work?

How do GDB watchpoints work? Can similar functionality be implemented to harness byte level access at defined locations?

On x86 there are CPU debug registers D0-D3 that track memory address.
This explains how hardware breakpoints are implemented in Linux and also gives details of what processor specific features are used.
Another article on hardware breakpoints.

I believe gdb uses the MMU so that the memory pages containing watched address ranges are marked as protected - then when an exception occurs for a write to a protected pages gdb handles the exception, checks to see whether the address of the write corresponds to a particular watchpoint, and then either resumes or drops to the gdb command prompt accordingly.
You can implement something similar for your own debugging code or test harness using mprotect, although you'll need to implement an exception handler if you want to do anything more sophisticated than just fail on a bad write.

Using the MMU or an MPU (on other processors such as embedded), can be used to implement "hardware watchpoints"; however, some processors (e.g., many Arm implementations) have dedicated watchpoint hardware accessed via a debug port. This has some advantages over using an MMU or MPU.
If you use the MMU or MPU approach:
PRO - There is no special hardware needed for application-class processors because an MMU is built-in to support the needs of Linux or Windows. In the case of specialized realtime-class processors, there is often an MPU.
CON - There will be software overhead handling the exception. This is probably not a problem for an Application class processor (e.g., x86); however, for embedded realtime-application, this could spell disaster.
CON- MMU or MPU faults may happen for other reasons, which means the handler will need to figure our exactly why it faulted by reading various status registers.
PRO - using MMU memory protection faults can often cover many separate variables to watchpoint many variables easily. However, this not normally required of most debugging situations.
If you use dedicated debug watchpoint hardware such as supported by Arm:
PRO - There is no impact on software performance (helps if debugging subtle timing issues). The debug infrastructure is designed to be non-intrusive.
CON - There are a limited number of these hardware units on any particular silicon. For Arm, there may be 0, 2 or 4 of these. So you need to be careful to choose. The units can cover a range of addresses, but there are limits. For some processors, they may even be limited to the region of memory.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio