Debian UART Dropping Bytes - linux-kernel

Situation:
I'm trying to manipulate/hack a Debian kernel to be able to use 9-bit UART via the parity hack that you can find reference to everywhere on the NET. Now for some reason Tx is working just fine, but when we look at Rx we're dropping bytes. We're expecting 9 bytes back, including the crc, but we get a variable amount back between 4-7. It will vary with the exact same code run a few times in a row, so it almost appears to be buffer related. Hooking it up to the logic analyzer the response is proper(9 expected bytes) from the slave device, but it is getting "mangled" somewhere low level that I can't track down.
Attempted:
So I started dumping all the bytes(char) that come in here:
static void serial_omap_rdi(struct uart_omap_port *up, unsigned int lsr)
Which still results in only seeing 4-7 bytes of bad data. I'm assuming this is due to parity/framing/FIFO errors but I just can't seem to get any deeper to where I can see exactly where it's failing.
Questions:
Has anyone seen anything like this type of discarded data on the UART line with a linux kernel? If so would you know any possible avenues to take in tracking it down?
Is there anyplace in the kernel I can look to actually dump the bit-by-bit data that is coming into the UART Rx line?
Any help is greatly appreciated as I'm at my wits end understanding how/where exactly the kernel is discarding these bytes...
Platform:
BBB Debian Linux 3.8.13-bone53 armv7l GNU/Linux
16750 UART (I Believe)

Related

What could cause EDAC errors without reporting them, or without actual ECC errors?

I've got a ZynqU+ that I've built and am running embedded linux on. Everything boots fine, and initially runs fine. One problem though is that I see the ue_count in /sys/devices/system/edac/mc/mc0/ is incremented to 13 (ce_count is 0) every time I boot the board. There are no EDAC messages in dmesg or syslog mentioning encountering an uncorrectable error, and investigating the zynqs DDR module's registers (https://www.xilinx.com/htmldocs/registers/ug1087/ug1087-zynq-ultrascale-registers.html search for "DDRC Module"), the status registers containing CE & UE counts are 0 along with all related registers
Additionally, if I stress the system with a bunch of constant read/write operations just to a temp folder I will eventually (10-30mins) see EDAC errors printed out to the console. This will often be followed by a kernel panic, but if the system does not panic investigating the previous locations above I can see my ce_count, ue_count have incremented, syslog now has EDAC error messages in it, and the Zynq DDRC module's registers contain values where they were 0 before (Interestingly, not the CE & UE count register, that remains 0, perhaps EDAC clears it after reporting it?)
I have tested this build across half a dozen different boards and they are all showing the exact same behavior. I have trouble believing these ECC errors are real because of that, but I'm not really sure what other explanation there could be. Perhaps I miss-configured something in linux?
The 13 ue_count on boot really mystifies me though, how can EDAC increment that without reporting any errors, how can it increment that while the zynq's module that it's registered too does not contain any signs of ECC activity?
Any advice on things to check, diagnostics to perform, experience with ECC errors, or anything really would be helpful, as I'm mostly at a loss on this problem.

What is the difference within the compiler between debugging and running the code? (STM32)

somehow when i am running my code, it seems like one GPIO Port isn't being initialized, meanwhile if i am debugging, it is.
I am initializing two sensors:
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
Sensor Heater 1 is not getting any Information, Sensor Heater 2 is getting Informations. Now if i swap the Name of the Heaters:
struct MAX31856_t max31856_temperature_sensor_heater_2 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(
TEMP_SENSOR_0_CS_GPIO_Port, TEMP_SENSOR_0_CS_Pin), &spi1));
struct MAX31856_t max31856_temperature_sensor_heater_1 = MAX31856_TPL( SPI_DEV_TPL( IO_PIN_TPL(TEMP_SENSOR_1_CS_GPIO_Port, TEMP_SENSOR_1_CS_Pin), &spi1));
and run the code in the debugger, Sensor Heater 1 and 2 are getting Informations.
How can this happen? I was thinking about a timing problem, but since it is working in the debugger, i don't really know what to do.
Provided that you are debugging and/or running the same binary. Debugging is mostly the same as running except if you halt the processor (es breakpoints).
In that case...
some peripherals could continue to run or be halted togheder with the cpu, the behaviour is some cases can be configured. (timers, watchdog...)
some interrupts can be lost.
some hardware buffers can overflow and data can be lost (if you don't use any flow control in your IO)
How do you run the code in debug mode? Do you have breakpoints somewhere?
You (OP) are right about it being most likely a timing problem, and probably related to physical SPI transmission. Because your line of code to send/receive something over SPI has already executed in the MCU, but physically the bits and bytes are still being transmitted on the line, while MCU is already calling the next SPI function, so one of the transmissions will fail. Try adding some delay after SPI transmission code. If things work after that, then it's the timing of SPI peripheral, and you need to add a check that there is no SPI transmission already in place before you call a functions to send/receive something.
You can do while(transmission) (pseudocode, replace with actual check if SPI transmission is going on) to wait until the previous transmission ends to call the next one.

PCIe UIO multi-DWORD access issues

I have an Intel FPGA PCIe endpoint. It shows up correctly in lspci and all of the lspci -vv information looks correct (memory map, IRQ, BAR0 size all look OK). I want to stream some data over BAR0 and read/write status registers inside of my IP. My host machine has an Intel x86_64 CPU and running a Debian OS.
What I"m currently doing is this:
open() call to /sys/class/uio/uio0/device/resource0 -> returns a file descriptor.
mmap 4 KB on that file descriptor with PROT_READ + PROT_WRITE protection, and MAP_SHARED flags. Offset is 0. --> returns a pointer.
In a loop, set offsets relative to the pointer to random numbers. Do this for ~1000 bytes, 4 bytes at a time. After each write to the pointer, call msync.
Read back the pointer one DWORD at a time.
Read back the pointer in bulk using memcpy.
The outcome of step #4 is that most of the data looks correct, but some is not (which is strange). The outcome of number 5 is that I get 0xffff_ffff for everything, which is even stranger.
If I try to replace step #3 with a single memset / msync sequence, the program hangs for a little while and then returns a bus error. After this, lspci states that BAR0 is disabled and I can no longer interact with it.
Any ideas what I'm doing wrong? It could be a HW issue, but the HW is really simple right now. I configured the FPGA to act purely as a slave device, with its read response being the registered write data coming across the BAR0 interface. The IP I am working with only has a read-response-valid line (no write response valid, somehow) which I have hard-coded to 1. Seems that burst sizes more than 1 cause some kinds of issues with the PCIe core as seen in the memcpy/memset issues, but I don't see why that would be the case.
EDIT/UPDATE:
I was able to work around this. Supposedly the MMIO is only for 32b, so writing a loop that access the pointer 32b at a time is the solution here.

Dismiss or Handle Data Abort when AXI transaction replies an error

Background
I have an ZynqMP system which has four Cortex-A53 cores (PS) along with FPGA logic (PL). They transfer data via AXI bus.
I've placed some Xilinx AXI Quad SPI in my design. Linux which runs on PS successfully probes them, and starts a daemons which periodically (333 Hz) ask MCUs on SPIs to reply their data chunk (~ up to around 500 bytes, split in every 64 bytes.)
They works nicely for a while (median 50 minutes) but suddenly the readl_relaxed() in SPI driver causes Synchronous External Abort which leads an Kernel Panic. It seems to be an AXI's error reply according to ARM TRM, and might be recoverable because it's "synchronous" which means the registers are not corrupted (in my understanding.)
After some search I found the do_sea() func that handles SEA and also found that there's no chance to recover from it according to the implementation.
I want the AXI error to be handled like: discard the read, return SIGBUS and lead the process to be killed, etc.
Of course I'm debugging the Abort and finding why it occurs but at present I have no clue.
Question
So my questions are:
Why SEAs are not recoverable in Linux arm64 implementation?
If I can "handle" or "ignore" it, how do I modify Linux kernel code (I know it's stupid but I'd like to know if there's a way.)
What can reply error in Quad SPI IP? The readl_relaxed I mentioned above reads Rx data FIFO.
1) I’ve never ventured down this path, but it looks to me like they are recoverable if the inf->fn returns 0; which means that ghes_notify_sea() must return 0; thus one of the SEA error sources successfully reported an error.
2) I think you need a bit more info. I would start by changing
drivers/acpi/apei/ghes.c:732
from:
rc = ghes_read_estatus(ghes, 0);
to:
rc = ghes_read_estatus(ghes, 1);
which should get you a bit more information when the error happens.
Armed with that information, you need to find out if you have a malfunctioning handler, or a missing one. Either way, this is the place to address it.
3) You are dealing with an ACPI implementation. There are 155 kloc in the kernel plus unknown quantity in the firmware and hardware. The kernel code doesn’t appear to handle whichever condition you are running into. First you need to determine which of these suspects is involved and what interactions are failing before you can dig out the root cause.
Happy Digging!

UART works in one ATmega128 board but fails with same hex file in another

I have been working with ATmega128 and other such series for about 2 years and used a UART library for serial transmission. I am pretty sure the library is correct because I have used it hundreds of times but from the past few months I cannot do UART on my ATmega128. I am sure that my hardware is correct, I am sure of my code and to add to it, the same hex file runs good in other two ATmega boards but not in other boards.
PORTs are giving 5V output when all pins coded to give output.
Statements execute before any UART function occurs and after that it stops (does nothing, no UART, not even statements after UART)
I tried copy pasting UART code completely in main.c and it worked then.
Please help! I have no idea what is going on.
Well, after a lot of tussle, I finally found the problem and solved it. When I checked the fuse bits of the other microcontroller (the working one), I found Extended Fuse bits different from the not working microcontroller. I changed the Extended Fuse bits to 0xFF and problem solved.

Resources