How to determine cause of CPU halt - linux-kernel

I'm working on a research project that involves developing a new DVFS algorithm for Intel machines. To do this we're utilizing some of the code from the intel_pstates driver to actually determine the pstate for each individual CPU and using the wmsr function to set that value to registers (IA32_PERF_CTL).
The problem we're encountering is when we run benchmarks, the CPU will randomly halt (not kernel panic or system hang).
Is there a way we can setup a trace or another method to record the value that was trying to be written to the MSRs?

Related

Do you need a realtime operating system in order to ensure your program is never taken off the CPU?

If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Example:
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.
Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.

Restart a CPU that ends up unresponsive during undervolting

I'm working on a set of kernel changes that allows me to undervolt my CPU at runtime. One consequence of extreme undervolting that I'm often facing is that the CPU becomes completely unresponsive.
I've tried using functions cpu_up and cpu_down in the hope of asking the kernel to restore the CPU, but to no avail.
Is there any way to recover the CPU from this state? Does the kernel have any routines that can bring back a CPU from this unresponsive state?
First, to successfully benefit from undervolting, it's important that you reduce the voltage by small amounts each time (such as between 5-10 mV). Then after each step of reduction, you should check the changes to one or more hardware error metrics (typically the CPU cache error rate). Generally what happens is that error rate should increase gradually when the voltage is decreased slowly. However, at some point, an error will occur that cannot be corrected through ECC (or whatever hardware correction mechanism is being used by the processor). This is when execution becomes unreliable. Linux responds to such errors by panicking (the system will either automatically reboot or it will just hang). So you may still have chance to detect the error and choose to continue execution, but correctness is not guaranteed anymore even if you immediately increased the voltage back. So that would be a very, very dangerous thing to do. It can get very nasty very quickly. An error might occur while you're handling some another error (maybe because of the code that is handling the error, so the safest thing to do is to abort, see Peter's comment).
Modern processors offer mechanisms to profile and handle correctable and uncorrectable hardware errors. In particular, x86 offers the Machine Check Architecture (MCA). By default, in Linux, when an uncorrectable machine check occurs, the machine check exception handler is invoked, which may abort the system (although it will try to see if it can safely recover somehow). You cannot handle that in user mode without using additional tools.
Here are the different x86 MCE tolerance levels supported by Linux:
struct mca_config mca_cfg __read_mostly = {
.bootlog = -1,
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
* 1: panic or SIGBUS on uncorrected errors, log corrected errors
* 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
* 3: never panic or SIGBUS, log all errors (for testing only)
*/
.tolerant = 1,
.monarch_timeout = -1
};
Note that the default tolerant value is 1. But since you are modifying the kernel, you can change the way Linux handle MCEs either by changing the tolerant level or the handling code itself. You can get started with the machine_check_poll and do_machine_check functions.
User-mode tools that may enable you to profile and potentially responds to machine checks include mcelog and mcedaemon. MCA is discussed in Volume 3 Chapter 15 and Chapter 16 of the Intel manual. For ARM, you can also profile cache ECC errors as discussed in here.
It is very important to understand that different cores of the same chip may behave differently when reducing the voltage beyond the nominal value. This is due to process variation. So don't assume that voltage reductions would work across cores of the same chip or across chips. You're going to have to test that on every core of every chip (in case you have multiple sockets).
I've tried using functions cpu_up and cpu_down in the hope of asking
the kernel to restore the CPU, but to no avail.
These functions are part of the Hotplug CPU infrastructure. Not really useful here.
The answer is CPU dependent. My answer is limited to x86_64 and s390:
Extreme undervolting is essentially unplugging the CPU, to be able to bring it back up you have to make sure that CONFIG_HOTPLUG_CPU = y is configured.
Also, depending on the kernel version you are using you may have different teardown or setup options available to you readily. If you are using 4.x have a look at cpuhp_* routines in <linux/cpuhotplug.h> in particular cpuhp_setup_state_multimay be the one you can use to set things up ... if in doubt look atcpuhp_setup_state_nocallsas well as__cpuhp_setup_state` ... Hopefully this helps :-)

Detecting AES-NI CPU instructions

Recent Intel and AMD CPUs support specific AES instructions that increase the performance of encryption and decryption.
Is it possible to to detect when these instructions are called? For example by writing a kernel module that monitors the instructions that are sent to the CPU? Or is the kernel still to high-level?
My understanding is that instructions like AESENC require no special privileges, so you won't be able to trap them with one of the usual fault handlers even in the kernel. You might be able to do it with a modified version of QEMU or VirtualBox, but that would be a bit of a pain.
I'm guessing you are trying to see if a software package supports AES-NI?
The currently accepted answer is in some sense technically correct. It does not however answer your question.
Theoretically it would be possible to monitor when these instructions are used by any process. This would however create an infeasible amount of overhead. You would basically have to do a conditional branch every time any instruction is executed (!). It can be achieved for example by using an in-target probe, this is however a very costly and non scalable solution.
A more realistic but less precise solution is to do program counter sampling. You can check which instruction is being executed at a given moment by looking at a processes program counter. The program counter of a process can indeed be accessed from a kernel module. If you sample at a high enough resolution you can get a decent indicator of whether or not AES-NI instructions are being used.

cpu_idle_loop vs halt/wfe/sevl instructions

Whenever a cpu is idle, it executes the cpu_idle_loop.
I am curios to know about the advantages of this loop when compared to halt [x86] or wfe/wfi instruction in arm ?
Is there any power consumption advantages ?
wfe / wfi are just instructions, which can make core into a low power mode, but that can't affect the clocks into the core etc. If a core is getting power at this time leakage will be still there, which matters in battery powered devices a lot.
In a function like cpu_idle_loop, you can control more power into the core since you know what affects what and can also flush caches and reduce power used by them etc. You can also totally cut power to the core removing or reducing leakage to the minimum possible. In a multicore system, last core going to idle can power down platform / board into a even more power preserving state.
wfe / wfi is good for avoiding core to waste power while waiting which is also good for heat not dispatched. Must have to implement mutexes / semaphores but a SOC is consisting of many elements these days and kernel can inform the hardware when most of it is not needed rather than just idling efficiently a single core.
On top of the power advantage pointed out by other users, I would like to point out another less noticed advantage of using WFI's. Consider the case when our kernel is being run as a virtual machine on top of another Host operating system. The Host OS would have marked WFI instructions as trap. When a WFI instruction is executed by a Guest OS, control is immediately transferred (Trapped) to Host OS. This allows the host to efficiently schedule other OS's in its ready queue. If the Guest OS were using a busy IDLE loop (instead of WFI), time slice allotted to the Guest OS has to expire before the Host OS can schedule in another Guest OS, this leads to wasted CPU cycles.

How do I figure out whether my process is CPU bound, I/O bound, Memory bound or

I'm trying to speed up the time taken to compile my application and one thing I'm investigating is to check what resources, if any, I can add to the build machine to speed things up. To this end, how do I figure out if I should invest in more CPU, more RAM, a better hard disk or whether the process is being bound by some other resource? I already saw this (How to check if app is cpu-bound or memory-bound?) and am looking for more tips and pointers.
What I've tried so far:
Time the process on the build machine vs. on my local machine. I found that the build machine takes twice the time as my machine.
Run "Resource Monitor" and look at the CPU usage, Memory usage and Disk usage while the process is running - while doing this, I have trouble interpreting the numbers, mainly because I don't understand what each column means and how that translates to a Virtual Machine vs. a physical box and what it means with multi-CPU boxes.
Start > Run > perfmon.exe
Performance Monitor can graph many system metrics that you can use to deduce where the bottlenecks are including cpu load, io operations, pagefile hits and so on.
Additionally, the Platform SDK now includes a tool called XPerf that can provide information more relevant to developers.
Random-pausing will tell you what is your percentage split between CPU and I/O time.
Basically, if you grab 10 random stackshots, and if 80% (for example) of the time is in I/O, then on 8+/-1.3 samples the stack will reach into the system routine that reads or writes a buffer.
If you want higher precision, take more samples.

Resources