Dismiss or Handle Data Abort when AXI transaction replies an error - linux-kernel

Background
I have an ZynqMP system which has four Cortex-A53 cores (PS) along with FPGA logic (PL). They transfer data via AXI bus.
I've placed some Xilinx AXI Quad SPI in my design. Linux which runs on PS successfully probes them, and starts a daemons which periodically (333 Hz) ask MCUs on SPIs to reply their data chunk (~ up to around 500 bytes, split in every 64 bytes.)
They works nicely for a while (median 50 minutes) but suddenly the readl_relaxed() in SPI driver causes Synchronous External Abort which leads an Kernel Panic. It seems to be an AXI's error reply according to ARM TRM, and might be recoverable because it's "synchronous" which means the registers are not corrupted (in my understanding.)
After some search I found the do_sea() func that handles SEA and also found that there's no chance to recover from it according to the implementation.
I want the AXI error to be handled like: discard the read, return SIGBUS and lead the process to be killed, etc.
Of course I'm debugging the Abort and finding why it occurs but at present I have no clue.
Question
So my questions are:
Why SEAs are not recoverable in Linux arm64 implementation?
If I can "handle" or "ignore" it, how do I modify Linux kernel code (I know it's stupid but I'd like to know if there's a way.)
What can reply error in Quad SPI IP? The readl_relaxed I mentioned above reads Rx data FIFO.

1) I’ve never ventured down this path, but it looks to me like they are recoverable if the inf->fn returns 0; which means that ghes_notify_sea() must return 0; thus one of the SEA error sources successfully reported an error.
2) I think you need a bit more info. I would start by changing
drivers/acpi/apei/ghes.c:732
from:
rc = ghes_read_estatus(ghes, 0);
to:
rc = ghes_read_estatus(ghes, 1);
which should get you a bit more information when the error happens.
Armed with that information, you need to find out if you have a malfunctioning handler, or a missing one. Either way, this is the place to address it.
3) You are dealing with an ACPI implementation. There are 155 kloc in the kernel plus unknown quantity in the firmware and hardware. The kernel code doesn’t appear to handle whichever condition you are running into. First you need to determine which of these suspects is involved and what interactions are failing before you can dig out the root cause.
Happy Digging!

Related

What could cause EDAC errors without reporting them, or without actual ECC errors?

I've got a ZynqU+ that I've built and am running embedded linux on. Everything boots fine, and initially runs fine. One problem though is that I see the ue_count in /sys/devices/system/edac/mc/mc0/ is incremented to 13 (ce_count is 0) every time I boot the board. There are no EDAC messages in dmesg or syslog mentioning encountering an uncorrectable error, and investigating the zynqs DDR module's registers (https://www.xilinx.com/htmldocs/registers/ug1087/ug1087-zynq-ultrascale-registers.html search for "DDRC Module"), the status registers containing CE & UE counts are 0 along with all related registers
Additionally, if I stress the system with a bunch of constant read/write operations just to a temp folder I will eventually (10-30mins) see EDAC errors printed out to the console. This will often be followed by a kernel panic, but if the system does not panic investigating the previous locations above I can see my ce_count, ue_count have incremented, syslog now has EDAC error messages in it, and the Zynq DDRC module's registers contain values where they were 0 before (Interestingly, not the CE & UE count register, that remains 0, perhaps EDAC clears it after reporting it?)
I have tested this build across half a dozen different boards and they are all showing the exact same behavior. I have trouble believing these ECC errors are real because of that, but I'm not really sure what other explanation there could be. Perhaps I miss-configured something in linux?
The 13 ue_count on boot really mystifies me though, how can EDAC increment that without reporting any errors, how can it increment that while the zynq's module that it's registered too does not contain any signs of ECC activity?
Any advice on things to check, diagnostics to perform, experience with ECC errors, or anything really would be helpful, as I'm mostly at a loss on this problem.

What is the difference within the compiler between debugging and running the code? (STM32)

somehow when i am running my code, it seems like one GPIO Port isn't being initialized, meanwhile if i am debugging, it is.
I am initializing two sensors:
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
Sensor Heater 1 is not getting any Information, Sensor Heater 2 is getting Informations. Now if i swap the Name of the Heaters:
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
and run the code in the debugger, Sensor Heater 1 and 2 are getting Informations.
How can this happen? I was thinking about a timing problem, but since it is working in the debugger, i don't really know what to do.
Provided that you are debugging and/or running the same binary. Debugging is mostly the same as running except if you halt the processor (es breakpoints).
In that case...
some peripherals could continue to run or be halted togheder with the cpu, the behaviour is some cases can be configured. (timers, watchdog...)
some interrupts can be lost.
some hardware buffers can overflow and data can be lost (if you don't use any flow control in your IO)
How do you run the code in debug mode? Do you have breakpoints somewhere?
You (OP) are right about it being most likely a timing problem, and probably related to physical SPI transmission. Because your line of code to send/receive something over SPI has already executed in the MCU, but physically the bits and bytes are still being transmitted on the line, while MCU is already calling the next SPI function, so one of the transmissions will fail. Try adding some delay after SPI transmission code. If things work after that, then it's the timing of SPI peripheral, and you need to add a check that there is no SPI transmission already in place before you call a functions to send/receive something.
You can do while(transmission) (pseudocode, replace with actual check if SPI transmission is going on) to wait until the previous transmission ends to call the next one.

PCIe UIO multi-DWORD access issues

I have an Intel FPGA PCIe endpoint. It shows up correctly in lspci and all of the lspci -vv information looks correct (memory map, IRQ, BAR0 size all look OK). I want to stream some data over BAR0 and read/write status registers inside of my IP. My host machine has an Intel x86_64 CPU and running a Debian OS.
What I"m currently doing is this:
open() call to /sys/class/uio/uio0/device/resource0 -> returns a file descriptor.
mmap 4 KB on that file descriptor with PROT_READ + PROT_WRITE protection, and MAP_SHARED flags. Offset is 0. --> returns a pointer.
In a loop, set offsets relative to the pointer to random numbers. Do this for ~1000 bytes, 4 bytes at a time. After each write to the pointer, call msync.
Read back the pointer one DWORD at a time.
Read back the pointer in bulk using memcpy.
The outcome of step #4 is that most of the data looks correct, but some is not (which is strange). The outcome of number 5 is that I get 0xffff_ffff for everything, which is even stranger.
If I try to replace step #3 with a single memset / msync sequence, the program hangs for a little while and then returns a bus error. After this, lspci states that BAR0 is disabled and I can no longer interact with it.
Any ideas what I'm doing wrong? It could be a HW issue, but the HW is really simple right now. I configured the FPGA to act purely as a slave device, with its read response being the registered write data coming across the BAR0 interface. The IP I am working with only has a read-response-valid line (no write response valid, somehow) which I have hard-coded to 1. Seems that burst sizes more than 1 cause some kinds of issues with the PCIe core as seen in the memcpy/memset issues, but I don't see why that would be the case.
EDIT/UPDATE:
I was able to work around this. Supposedly the MMIO is only for 32b, so writing a loop that access the pointer 32b at a time is the solution here.

Debian UART Dropping Bytes

Situation:
I'm trying to manipulate/hack a Debian kernel to be able to use 9-bit UART via the parity hack that you can find reference to everywhere on the NET. Now for some reason Tx is working just fine, but when we look at Rx we're dropping bytes. We're expecting 9 bytes back, including the crc, but we get a variable amount back between 4-7. It will vary with the exact same code run a few times in a row, so it almost appears to be buffer related. Hooking it up to the logic analyzer the response is proper(9 expected bytes) from the slave device, but it is getting "mangled" somewhere low level that I can't track down.
Attempted:
So I started dumping all the bytes(char) that come in here:
static void serial_omap_rdi(struct uart_omap_port *up, unsigned int lsr)
Which still results in only seeing 4-7 bytes of bad data. I'm assuming this is due to parity/framing/FIFO errors but I just can't seem to get any deeper to where I can see exactly where it's failing.
Questions:
Has anyone seen anything like this type of discarded data on the UART line with a linux kernel? If so would you know any possible avenues to take in tracking it down?
Is there anyplace in the kernel I can look to actually dump the bit-by-bit data that is coming into the UART Rx line?
Any help is greatly appreciated as I'm at my wits end understanding how/where exactly the kernel is discarding these bytes...
Platform:
BBB Debian Linux 3.8.13-bone53 armv7l GNU/Linux
16750 UART (I Believe)

Creating new task in FreeRTOS for USART reception

I am using EVK1105 development board with AVR Studio 5 as development IDE for my AVR project.
I am using FreeRTOS in it. I have 3 USART ports on this board. One external module is connected to my AVR32 board via USART-RS232 mode. It sends me continuous serial data to my board on USART0 with 19230 baudrate, 7-databits, odd parity, stopbit-1 and normal-channel mode. I created a new task for this purpose. After each 9 data bytes it sends '\n' and '\r'. So in my task I keep on collecting the 9 databytes in a string buffer and then transmit it on USART1. I am using polling method to collect data from USAR0 which is receiving port. But I am facing problem in receiving data. I don't know if its timing issue or something or the scheduler switches the task while collects the data. But I don't get the required data.
Following are things I have already checked as troubleshooting
1. Connected my external module to my PC hyper-terminal which gives me perfect result.
2. Implemented the same thing of using receiving from USART0 and whatever received is transmitted to USART1 as without FreeRTOS. Its works fine.
Please suggest some idea what may be wrong. I am using a queue to communicate between Tx and Rx task to pass the string buffer from USART0 to USART1. Is it problem in handling queue? How can I troubleshoot the queue?
I am using a delay of 50ms in my infinite task loop in Rx Task. Can it create a problem? If I don't use any delay the OS crashes. Please suggest some good practices to create a new task in FreeRTOS so that I will not get any timing issue.
For such a use case, I would not use a polling method with 50ms delay to retrieve data from UART peripheral. You can easily lose received data depending on the system load and UART reception buffer size.
At least use an interrupt on UART data reception that copies every received byte into a local buffer that will be read by your TX thread.
You can have an even better solution using a DMA channel to receive your data frame and be notified when 9 bytes have been received. I don't know if your AVR device has a DMA peripheral or not.
Are you still working on this? The statement of your problem is vague, but there I have several suggestions/leading questions.
1) You may want some documents to see what the registers are
Get the giant datasheet pdfs at
http://www.atmel.com/dyn/products/product_docs.asp?category_id=163&family_id=607&subfamily_id=2138&part_id=4117
2) In this and an earlier post you state that you have, in some cases, been able to RX data. You will need to find the USART HW initialization code from those example projects and get them into the freeRTOS example project. In particular calls to
gpio_enable_module() with {AVR32_USART0_RXD_0_0_PIN, AVR32_USART0_RXD_0_0_FUNCTION}
To connect to USART to CPU
and i believe
InitRs232()
Just doing this requires poking around a lot of code - there's alot of dependencies.
2) What function are you calling to retrieve data from USART0? 19kbaud is approximately 2000bytes/sec or 1 byte/0.5ms, so 50ms polling is not nearly enough. I'd suggest that your RX task poll continuously (never sleep explicitly) but at a lower priority than the TX task.
3) Concentrate on debugging the RX task at the call to retrieve data. Use the debugger to look at the hardware registers for the usart. In particular
USART0 cr register AVR32_USART_CR_RXEN_MASK should be set to enable RX
USART0 csr register AVR32_USART_CSR_RXRDY_MASK will indicate if there is new data there
You can also check the overlow flag to see if you have missed some data.
When the read of USART0 rhr occurs it should be a byte that you sent.
If you are still working on this I can look into this a bit more.

Resources