What does the field irq indicate in the kvm_inj_virq event? - linux-kernel

I have generated trace on an intel machine through trace-cmd(ftrace) while running virtual machines through KVM/QEMU hypervisor. In the kvm_inj_virq event, the irq field is a number indicating the kind of interrupt that's getting inserted. But I need to know the exact meaning of this number like whether it's a network, timer, disk or any other kind of interrupt.
I have attached an image of the trace for reference.
Sample Trace
I tried searching for the IRQ numbers and their corresponding meaning. I found the first 16 mappings here, but I need to know the rest of them.

Related

Cannot setup Cortex M7 ITM properly on STM32H7

I'm working on STM32H753 (STM32H753I-EVAL2 board) using STM32CubeIDE and I'm trying to setup the ITM.
I've started by enabling SWV in the Debugger setting (of course I selected SWD) with Core Clock 400MHz (my CPU clock) and SWO clock 2MHz.
Then in my code I defined the following macro:
#define ITM_Port(n) (*((volatile unsigned long *)(0xE0000000 + 4*n)))
and call this macro as follows, at the location of my code where I want to get a timestamp.
ITM_Port(20) = 0x10;
Finally, in the debug session, I enable ITM stimulus port number 20 and Timerstamp, and launch the SWV Trace Log.
However I don't understand the output:
If I remove the calls to ITM_Port, the trace is empty...
I checked the registers ITM_TCR and ITM_TER and they look correct. Stimulus port 20 is indeed enabled in TER. In TCR, bits ITMENA, TSENA, SYNCENA and TXENA are set.
I looked at assembly level (that looks correct) and I noticed that the store instruction that is supposed to write 0x10 into ITM_STIM20 has no effect,the register is not modified. Is there something to unlock / enable ?
I also configured the GPIO PB3 with alternate function SWO.
Any idea ?
[...] I noticed that the store instruction that is supposed to write 0x10 into ITM_STIM20 has no effect, the register is not modified.
Please re-check the read/write semantics of this register in the
Reference Manual, page 3222:
Write data is output on the trace bus as a software event packet. When reading, bit 0 is a
FIFOREADY indicator:
0: Stimulus port buffer is full (or port is disabled)
1: Stimulus port can accept new write data
Therefore, I don't believe that there must be a mistake around this register.
The trace log screenshot in the question shows plenty of ITM trace packets at other ports (24, 25, 26, 28, 29, 30, 31).
Please take care not to overdo the ITM trace packet creation:
The question refers to SWV Trace Log, so the trace packets of different ITM ports must all pass through the same SWO line.
That interface is not very fast (while the CPU of STM32H7 certainly is!), so software-triggered ITM packet creation can easily choke this bottleneck so that packets are discarded.
The question doesn't contain the surrounding code where trace packets are created, but my guess is that during the analysis, inserting additional ITM packet triggers at a finer level (inside loops or so) increased the port traffic until lost packets weren't even noticeable.
The easiest way out may be to remove parts of the ITM triggers (or, to activate only few ITM channels at a time, which will filter packets in ITM before they are transmitted through SWO) and measure different aspects at a time, repeating the measurement with different ITM channel selection.
The second easiest way is to spend a few k$ on a debug adapter that supports the synchronous trace port. This feature is only supported by high-end adapter variants such as J-Trace, Lauterbach, etc. - it is usually targeted at ETM tracing, but you can also use the parallel TPIU interface to output ITM data, probably at a higher rate.
This strategy isn't the most elegant in the described situation - please consider the other way first!

Dismiss or Handle Data Abort when AXI transaction replies an error

Background
I have an ZynqMP system which has four Cortex-A53 cores (PS) along with FPGA logic (PL). They transfer data via AXI bus.
I've placed some Xilinx AXI Quad SPI in my design. Linux which runs on PS successfully probes them, and starts a daemons which periodically (333 Hz) ask MCUs on SPIs to reply their data chunk (~ up to around 500 bytes, split in every 64 bytes.)
They works nicely for a while (median 50 minutes) but suddenly the readl_relaxed() in SPI driver causes Synchronous External Abort which leads an Kernel Panic. It seems to be an AXI's error reply according to ARM TRM, and might be recoverable because it's "synchronous" which means the registers are not corrupted (in my understanding.)
After some search I found the do_sea() func that handles SEA and also found that there's no chance to recover from it according to the implementation.
I want the AXI error to be handled like: discard the read, return SIGBUS and lead the process to be killed, etc.
Of course I'm debugging the Abort and finding why it occurs but at present I have no clue.
Question
So my questions are:
Why SEAs are not recoverable in Linux arm64 implementation?
If I can "handle" or "ignore" it, how do I modify Linux kernel code (I know it's stupid but I'd like to know if there's a way.)
What can reply error in Quad SPI IP? The readl_relaxed I mentioned above reads Rx data FIFO.
1) I’ve never ventured down this path, but it looks to me like they are recoverable if the inf->fn returns 0; which means that ghes_notify_sea() must return 0; thus one of the SEA error sources successfully reported an error.
2) I think you need a bit more info. I would start by changing
drivers/acpi/apei/ghes.c:732
from:
rc = ghes_read_estatus(ghes, 0);
to:
rc = ghes_read_estatus(ghes, 1);
which should get you a bit more information when the error happens.
Armed with that information, you need to find out if you have a malfunctioning handler, or a missing one. Either way, this is the place to address it.
3) You are dealing with an ACPI implementation. There are 155 kloc in the kernel plus unknown quantity in the firmware and hardware. The kernel code doesn’t appear to handle whichever condition you are running into. First you need to determine which of these suspects is involved and what interactions are failing before you can dig out the root cause.
Happy Digging!

Registering interrupt with irq from pci_irq_vector(9) function results in "No irq handler for this function"?

I am writing a device driver that services the interrupts from the device. The device has only one MSI interrupt vector, so I poll the irq with pci_irq_vector(dev, 0), receive the irq, and register the interrupt. This is shown in the following code snippet (equivalent to what I have minus error handling):
retval = pci_alloc_irq_vectors(dev, 1, 1, PCI_IRQ_MSI);
irq = pci_irq_vector(dev, 0);
retval = request_irq(irq, irq_fnc, 0, "name", dev);
This all completes successfully and without warning (at least with dmesg). Yet when the interrupt comes in, I get the error.
kernel:do_IRQ: 0.xxx No irq handler for this vector (irq -1)
The xxx appears to be an arbitrary number that changes every time the driver is loaded, but does not match the irq number. Instead, it matches the last two hex digits of the message data sent with the MSI interrupt as read from the MSI capability structure. Trying to request an irq of this number returns EINVAL which I think means that it's not associated with any PCI device. What does this number mean anyway?
Something that may be important to note, I am actually manually triggering this interrupt from the host side due to limitations with the device. I am reading the interrupt address and data from the capability structure then instructing the device to write the data to that address.
How would I go about further debugging this? Does anything from my description stand out as suspicious? Any help would be appreciated.
Does this particular irq show when you type cat /proc/interrupts? Maybe you can get the correct irq number from there, as well as other info like where it is attached and what driver is associated with this interrupt line!
So the problem ended up being in the order of things. To manually create the interrupt, I had read the config space for the interrupt address and data before allocating interrupts. While obvious in retrospect, allocating the irq vectors for the device writes the appropriate data to the config space. Hence, using the preexisting value in the message data field would point to an irq vector that does not exist.

How do I write to a __user memory from within the top half of an interrupt handler?

I am working on a proprietary device driver. The driver is implemented as a kernel module. This module is then coupled with an user-space process.
It is essential that each time the device generates an interrupt, the driver updates a set of counters directly in the address space of the user-space process from within the top half of the interrupt handler. The driver knows the PID and the task_struct of the user-process and is also aware of the virtual address where the counters lie in the user-process context. However, I am having trouble in figuring out how code running in the interrupt context could take up the mm context of the user-process and write to it. Let me sum up what I need to do:
Get the address of the physical page and offset corresponding to the virtual address of the counters in the context of the user-process.
Set up mappings in the page table and write to the physical page corresponding to the counter.
For this, I have tried the following:
Try to take up the mm context of the user-task, like below:
use_mm(tsk->mm);
/* write to counters. */
unuse_mm(tsk->mm);
This apparently causes the entire system to hang.
Wait for the interrupt to occur when our user-process was the
current process. Then use copy_to_user().
I'm not much of an expert on kernel programming. If there's a good way to do this, please do advise and thank you in advance.
Your driver should be the one, who maps kernel's memory for user space process. E.g., you may implement .mmap callback for struct file_operation for your device.
Kernel driver may write to kernel's address, which it have mapped, at any time (even in interrupt handler). The user-space process will immediately see all modifications on its side of the mapping (using address obtained with mmap() system call).
Unix's architecture frowns on interrupt routines accessing user space
because a process could (in theory) be swapped out when the interrupt occurs. 
If the process is running on another CPU, that could be a problem, too. 
I suggest that you write an ioctl to synchronize the counters,
and then have the the process call that ioctl
every time it needs to access the counters.
Outside of an interrupt context, your driver will need to check the user memory is accessible (using access_ok), and pin the user memory using get_user_pages or get_user_pages_fast (after determining the page offset of the start of the region to be pinned, and the number of pages spanned by the region to be pinned, including page alignment at both ends). It will also need to map the list of pages to kernel address space using vmap. The return address from vmap, plus the offset of the start of the region within its page, will give you an address that your interrupt handler can access.
At some point, you will want to terminate access to the user memory, which will involve ensuring that your interrupt routine no longer accesses it, a call to vunmap (passing the pointer returned by vmap), and a sequence of calls to put_page for each of the pages pinned by get_user_pages or get_user_pages_fast.
I don't think what you are trying to do is possible. Consider this situation:
(assuming how your device works)
Some function allocates the user-space memory for the counters and
supplies its address in PROCESS X.
A switch occurs and PROCESS Y executes.
Your device interrupts.
The address for your counters is inaccessible.
You need to schedule a kernel mode asynchronous event (lower half) that will execute when PROCESS X is executing.

The difference between exclude_hv and exclude_host in perf

In kernel 3.11.0, in the struct perf_event_attr, there are three members named exclude_hv/exclude_host/exclude_guest.
I know the exclude_host field is to exclude events generated by the host when running kvm. But what is the meaning of exclude_hv? Is it used in the Xen?
What is the mechanism in hardware that supports the function of exclude_host? As far as I know, in the performance monitoring select registers, there are no such bits that control the event counter to exclude events generated by the host.
This is a bit old but for those looking at the answer, as me:
exclude_hv: do not count events that occur in the hypervisor.
The distinction between events occurred in user space, kernel, hypervisor, host, etc is done in software. The kernel and/or hypervisor will retire and replace the event count and configuration on each change of context.
Here is an excellent description of perf_events, which is the kernel module that handles the performance counters.

Resources