What could cause EDAC errors without reporting them, or without actual ECC errors? - embedded-linux

I've got a ZynqU+ that I've built and am running embedded linux on. Everything boots fine, and initially runs fine. One problem though is that I see the ue_count in /sys/devices/system/edac/mc/mc0/ is incremented to 13 (ce_count is 0) every time I boot the board. There are no EDAC messages in dmesg or syslog mentioning encountering an uncorrectable error, and investigating the zynqs DDR module's registers (https://www.xilinx.com/htmldocs/registers/ug1087/ug1087-zynq-ultrascale-registers.html search for "DDRC Module"), the status registers containing CE & UE counts are 0 along with all related registers
Additionally, if I stress the system with a bunch of constant read/write operations just to a temp folder I will eventually (10-30mins) see EDAC errors printed out to the console. This will often be followed by a kernel panic, but if the system does not panic investigating the previous locations above I can see my ce_count, ue_count have incremented, syslog now has EDAC error messages in it, and the Zynq DDRC module's registers contain values where they were 0 before (Interestingly, not the CE & UE count register, that remains 0, perhaps EDAC clears it after reporting it?)
I have tested this build across half a dozen different boards and they are all showing the exact same behavior. I have trouble believing these ECC errors are real because of that, but I'm not really sure what other explanation there could be. Perhaps I miss-configured something in linux?
The 13 ue_count on boot really mystifies me though, how can EDAC increment that without reporting any errors, how can it increment that while the zynq's module that it's registered too does not contain any signs of ECC activity?
Any advice on things to check, diagnostics to perform, experience with ECC errors, or anything really would be helpful, as I'm mostly at a loss on this problem.

Related

What is the difference within the compiler between debugging and running the code? (STM32)

somehow when i am running my code, it seems like one GPIO Port isn't being initialized, meanwhile if i am debugging, it is.
I am initializing two sensors:
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
Sensor Heater 1 is not getting any Information, Sensor Heater 2 is getting Informations. Now if i swap the Name of the Heaters:
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
and run the code in the debugger, Sensor Heater 1 and 2 are getting Informations.
How can this happen? I was thinking about a timing problem, but since it is working in the debugger, i don't really know what to do.
Provided that you are debugging and/or running the same binary. Debugging is mostly the same as running except if you halt the processor (es breakpoints).
In that case...
some peripherals could continue to run or be halted togheder with the cpu, the behaviour is some cases can be configured. (timers, watchdog...)
some interrupts can be lost.
some hardware buffers can overflow and data can be lost (if you don't use any flow control in your IO)
How do you run the code in debug mode? Do you have breakpoints somewhere?
You (OP) are right about it being most likely a timing problem, and probably related to physical SPI transmission. Because your line of code to send/receive something over SPI has already executed in the MCU, but physically the bits and bytes are still being transmitted on the line, while MCU is already calling the next SPI function, so one of the transmissions will fail. Try adding some delay after SPI transmission code. If things work after that, then it's the timing of SPI peripheral, and you need to add a check that there is no SPI transmission already in place before you call a functions to send/receive something.
You can do while(transmission) (pseudocode, replace with actual check if SPI transmission is going on) to wait until the previous transmission ends to call the next one.

Dismiss or Handle Data Abort when AXI transaction replies an error

Background
I have an ZynqMP system which has four Cortex-A53 cores (PS) along with FPGA logic (PL). They transfer data via AXI bus.
I've placed some Xilinx AXI Quad SPI in my design. Linux which runs on PS successfully probes them, and starts a daemons which periodically (333 Hz) ask MCUs on SPIs to reply their data chunk (~ up to around 500 bytes, split in every 64 bytes.)
They works nicely for a while (median 50 minutes) but suddenly the readl_relaxed() in SPI driver causes Synchronous External Abort which leads an Kernel Panic. It seems to be an AXI's error reply according to ARM TRM, and might be recoverable because it's "synchronous" which means the registers are not corrupted (in my understanding.)
After some search I found the do_sea() func that handles SEA and also found that there's no chance to recover from it according to the implementation.
I want the AXI error to be handled like: discard the read, return SIGBUS and lead the process to be killed, etc.
Of course I'm debugging the Abort and finding why it occurs but at present I have no clue.
Question
So my questions are:
Why SEAs are not recoverable in Linux arm64 implementation?
If I can "handle" or "ignore" it, how do I modify Linux kernel code (I know it's stupid but I'd like to know if there's a way.)
What can reply error in Quad SPI IP? The readl_relaxed I mentioned above reads Rx data FIFO.
1) I’ve never ventured down this path, but it looks to me like they are recoverable if the inf->fn returns 0; which means that ghes_notify_sea() must return 0; thus one of the SEA error sources successfully reported an error.
2) I think you need a bit more info. I would start by changing
drivers/acpi/apei/ghes.c:732
from:
rc = ghes_read_estatus(ghes, 0);
to:
rc = ghes_read_estatus(ghes, 1);
which should get you a bit more information when the error happens.
Armed with that information, you need to find out if you have a malfunctioning handler, or a missing one. Either way, this is the place to address it.
3) You are dealing with an ACPI implementation. There are 155 kloc in the kernel plus unknown quantity in the firmware and hardware. The kernel code doesn’t appear to handle whichever condition you are running into. First you need to determine which of these suspects is involved and what interactions are failing before you can dig out the root cause.
Happy Digging!

Call to ExAllocatePoolWithTag never returns

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

Debian UART Dropping Bytes

Situation:
I'm trying to manipulate/hack a Debian kernel to be able to use 9-bit UART via the parity hack that you can find reference to everywhere on the NET. Now for some reason Tx is working just fine, but when we look at Rx we're dropping bytes. We're expecting 9 bytes back, including the crc, but we get a variable amount back between 4-7. It will vary with the exact same code run a few times in a row, so it almost appears to be buffer related. Hooking it up to the logic analyzer the response is proper(9 expected bytes) from the slave device, but it is getting "mangled" somewhere low level that I can't track down.
Attempted:
So I started dumping all the bytes(char) that come in here:
static void serial_omap_rdi(struct uart_omap_port *up, unsigned int lsr)
Which still results in only seeing 4-7 bytes of bad data. I'm assuming this is due to parity/framing/FIFO errors but I just can't seem to get any deeper to where I can see exactly where it's failing.
Questions:
Has anyone seen anything like this type of discarded data on the UART line with a linux kernel? If so would you know any possible avenues to take in tracking it down?
Is there anyplace in the kernel I can look to actually dump the bit-by-bit data that is coming into the UART Rx line?
Any help is greatly appreciated as I'm at my wits end understanding how/where exactly the kernel is discarding these bytes...
Platform:
BBB Debian Linux 3.8.13-bone53 armv7l GNU/Linux
16750 UART (I Believe)

TI MSP430 Interrupt Problems After UART Code Port

I am using the MSP430F2013 processor for an application, which doesn't have a UART. I need a UART, and so I used the TI's sample code "msp430x20x3_ta_uart2400.c" to emulate one using the Timer module. This all worked fine (compiled with IAR Embedded Workbench), having tested it using PuTTY to transmit characters to a development board and a loopback to echo them to the terminal.
That was a de-risking exercise, and now I've come to port that code into my application's state machine. Having done this, I'm having issues surrounding the timer interrupts and low power sleep modes. Here's the snippet of my code around the entry into the low power (sleep) mode:
// Prepare the UART to receive one byte.
prepare_receiver();
// Enter low power mode 1.
__bis_SR_register(LPM1_bits + GIE);
// Check whether the full message has been received.
if(true == get_message_complete())
{
process_event(e_euart_message_received, NULL);
}
What I'm seeing on the debugger (C-Spy) is that sometimes it will execute the bis_SR_register() line on first entry and then go to the if statement, i.e., ignoring the fact that I've asked it to go to sleep. On other occasions, when it does go to sleep when it should, the ISR triggers correctly and eventually brings me back to the if statement to continue program execution (as I'm expecting). However, if I try to step to the next statement, the application freezes on that first line, i.e., I can't advance.
I can't think of anything functionally different from TI's example that I'm doing, so I figure my problem must be something to do with how I've ported it. For example, my Timer ISR and the code I've posted here are in different compilation units - would this sort of decision have any bearing on things? I'm aware my question might be a little vague but unfortunately I can't post all of my code, so instead I'm looking for someone with MSP experience who might be able to suggest some things to look at or some potential pitfalls that I may have fallen into.
Debugging interrupts with C-Spy in Low Power Mode is going to be tricky. According to Section A.3 Debugging (C-Spy) - IAR User's Guide:
5) C-SPY can debug applications that utilize interrupts and low power modes
But there are some "gotchas" that you should be aware of that may be causing your headaches.
In particular:
14) When C-SPY has control of the device, the CPU is ON (that is, it is not in low-power mode) regardless of the settings of the low-power
mode bits in the status register. Any low-power mode conditions are
restored prior to Step or Go. Consequently, do not measure the power
consumed by the device while C-SPY has control of the device. Instead,
run your application using Go with JTAG released
19) C-SPY utilizes the system clock to control the device during
debugging. Therefore, device counters, etc., that are clocked by the
Main System Clock (MCLK) are affected when C-SPY has control of the
device. Special precautions are taken to minimize the effect upon the
Watchdog Timer. The CPU core registers are preserved. All other clock
sources (SMCLK, ACLK) and peripherals continue to operate normally
during emulation. In other words, the Flash Emulation Tool is a
partially intrusive tool.
Devices that support clock control (Emulator
→ Advanced → Clock Control) can further minimize these
effects by selecting to stop the clock(s) during debugging
24) Peripheral bits that are cleared when read during normal program
execution (that is, interrupt flags) are cleared when read while being
debugged (that is, memory dump, peripheral registers).
When using certain MSP430 devices (such as MSP430F15x, MSP430F16x,
MSP430F43x, and MSP430F44x devices), bits do not behave this way
(that is, the bits are not cleared by C-SPY read operations).
26) While single stepping with active and enabled interrupts, it can
appear that only the interrupt service routine (ISR) is active (that
is, the non-ISR code never appears to execute, and the single step
operation always stops on the first line of the ISR). However, this
behavior is correct because the device always processes an active and
enabled interrupt before processing non-ISR (that is, mainline) code.
A workaround for this behavior is, while within the ISR, to disable
the GIE bit on the stack so that interrupts are disabled after exiting
the ISR. This permits the non-ISR code to be debugged (but without
interrupts). Interrupts can later be reenabled by setting GIE in the
status register in the Register window.
On devices with the clock control emulation feature, it may be possible
to suspend a clock between single steps and delay an interrupt request
(Emulator → Advanced → Clock Control).
One thing to try is commenting out all the low power code and seeing if your UART code works like that. Then go back and try re-enabling the low power mode.
The answer to this question lies in the debugging setup and more specifically what types of breakpoints are being used. I had quite a complex series of macros that were running on program upload, which set various hooks into memory for testing purposes. These hooks relied on software breakpoints being created, which would then call functions outside of the application. I have seen no problem in using these breakpoints in normal use, however their existence means that the debugging session doesn't run in real-time (i.e., the device is under control of the host PC). This, for a reason yet not completely known to me, caused problems when trying to debug interrupts and low power modes. (I suspect that if I was to look a bit deeper, I would see the need to use clock control whilst debugging, but I'll save that for another day).
So, to solve this problem and allow me to debug my interrupt and low power mode heavy code, which I'd ported into my larger application state machine, I had to do the following:
Disable software breakpoints within IAR.They're not actually enabled by default, but if you've been doing clever things with macros like I had, you probably would've needed to enable them, since there just aren't enough hardware breakpoints available in most MSP430s (for instance, I only have two in the MSP430F2013, and C-SPY more often than not hogs one of those!). The obvious downside to this is that debugging becomes a bit more laborious, but at least it's reliable.
Remove links to .mac Macro files.In other words, if you're using macros, don't. In my case, this meant that I had to hack some state machine logic in order to force myself down a certain route (that previously the macro had been doing for me). This clearly isn't ideal, but it will allow you to debug the interrupt/low power mode code. The macros can then be re-enabled afterwards.
So it turned out that there wasn't a problem with my port after all. I'm not particularly happy with this hacky solution, but at least it's a step forward. If I have the time, I'll investigate to see if I can work out a way of using software breakpoints and add to this answer.

Resources