PCIe UIO multi-DWORD access issues

PCIe UIO multi-DWORD access issues - fpga

I have an Intel FPGA PCIe endpoint. It shows up correctly in lspci and all of the lspci -vv information looks correct (memory map, IRQ, BAR0 size all look OK). I want to stream some data over BAR0 and read/write status registers inside of my IP. My host machine has an Intel x86_64 CPU and running a Debian OS.
What I"m currently doing is this:
open() call to /sys/class/uio/uio0/device/resource0 -> returns a file descriptor.
mmap 4 KB on that file descriptor with PROT_READ + PROT_WRITE protection, and MAP_SHARED flags. Offset is 0. --> returns a pointer.
In a loop, set offsets relative to the pointer to random numbers. Do this for ~1000 bytes, 4 bytes at a time. After each write to the pointer, call msync.
Read back the pointer one DWORD at a time.
Read back the pointer in bulk using memcpy.
The outcome of step #4 is that most of the data looks correct, but some is not (which is strange). The outcome of number 5 is that I get 0xffff_ffff for everything, which is even stranger.
If I try to replace step #3 with a single memset / msync sequence, the program hangs for a little while and then returns a bus error. After this, lspci states that BAR0 is disabled and I can no longer interact with it.
Any ideas what I'm doing wrong? It could be a HW issue, but the HW is really simple right now. I configured the FPGA to act purely as a slave device, with its read response being the registered write data coming across the BAR0 interface. The IP I am working with only has a read-response-valid line (no write response valid, somehow) which I have hard-coded to 1. Seems that burst sizes more than 1 cause some kinds of issues with the PCIe core as seen in the memcpy/memset issues, but I don't see why that would be the case.
EDIT/UPDATE:
I was able to work around this. Supposedly the MMIO is only for 32b, so writing a loop that access the pointer 32b at a time is the solution here.

Related

Can I call dma_map_single() on DeviceB using an addresses returned from dma_alloc_coherent on DeviceA?

I am writing custom linux driver that needs to DMA memory around between multiple PCIE devices. I have the following situation:
I'm using dma_alloc_coherent to allocate memory for DeviceA
I then use DeviceA to fill the memory buffer.
Everything is fine so far but at this point I would like to DMA the
memory to DeviceB and I'm not sure the proper way of doing it.
For now I am calling dma_map_single for DeviceB using the
address returned from dma_alloc_coherent called on DeviceA. This
seems to work fine in x86_64 but it feels like I'm breaking the rules
because:
dma_map_single is supposed to be called with memory allocated from kmalloc ("and friends"). Is it problem being called with an address returned from another device's dma_alloc_coherent call?
If #1 is "ok", then I'm still not sure if it is necessary to call the dma_sync_* functions which are needed for dma_map_single memory. Since the memory was originally allocated from dma_alloc_coherent, it should be uncached memory so I believe the answer is "dma_sync_* calls are not necessary", but I am not sure.
I'm worried that I'm just getting lucky having this work and a future
kernel update will break me since it is unclear if I'm following the API rules correctly.
My code eventually will have to run on ARM and PPC too, so I need to make sure I'm doing things in a platform independent manner instead of getting by with some x86_64 architecture hack.
I'm using this as a reference:
https://www.kernel.org/doc/html/latest/core-api/dma-api.html

dma_alloc_coherent() acts similarly to __get_free_pages() but as size granularity rather page, so no issue I would guess here.
First call dma_mapping_error() after dma_map_single() for any platform specific issue. dma_sync_*() helpers are used by streaming DMA operation to keep device and CPU in sync. At minimum dma_sync_single_for_cpu() is required as device modified buffers access state need to be sync before CPU use it.

Using SetFilePointer to change the location to write in the sector doesn't work?

I'm using SetFilePointer to rewrite the second half of the MBR with something, its a user-mode application and i opened a handle to PhysicalDrive
At first i tried to set the size parameter in WriteFile to 256 but the writefile gave the INVALID_PARAMETER error, as it turns out based on some search on other questions here it seems like this is because we are forced to write in multiplicand of the sector size when the handle is PhysicalDrive for some reason
then i tried to set the filePointer to 256, and Write 512 bytes, both of them return no error, but for some unknown reason it writes from the beginning of the sector! as if the SetFilePointer didn't work even tho the return value of SetFilePointer is OK and it returns 256
So my questions is :
Why the write size have to be multiplicand of sector size when the handle is PhysicalDrive? which other device handles are like this?
Why is this happening and when I set the file pointer to 256, WriteFile still writes from the start?
isn't this really redundant, considering that even if I want to change 1 byte then I have to read the entire sector, change the one byte and then write it back, instead of just writing 1 byte, it seems like 10 times more overhead! isn't there a faster way to write a few bytes in a sector?

I think you are mixing the file system and the storage (block device). File system stays above storage device stack. If your code obtains a handle to a file system device, you can write byte by byte. But if you are accessing storage device stack, you can only write sector by sector (or block size).
Directly writing to block device is definitely slow as you discovered. However, in most cases, people just talk to file systems. Most file system drivers maintain cache and use algorithms for both read and write to improve performance.
Can't comment on file pointer based offset before seeing the actual code. But I guess it might be not sector aligned or it's not used at all.

Dismiss or Handle Data Abort when AXI transaction replies an error

Background
I have an ZynqMP system which has four Cortex-A53 cores (PS) along with FPGA logic (PL). They transfer data via AXI bus.
I've placed some Xilinx AXI Quad SPI in my design. Linux which runs on PS successfully probes them, and starts a daemons which periodically (333 Hz) ask MCUs on SPIs to reply their data chunk (~ up to around 500 bytes, split in every 64 bytes.)
They works nicely for a while (median 50 minutes) but suddenly the readl_relaxed() in SPI driver causes Synchronous External Abort which leads an Kernel Panic. It seems to be an AXI's error reply according to ARM TRM, and might be recoverable because it's "synchronous" which means the registers are not corrupted (in my understanding.)
After some search I found the do_sea() func that handles SEA and also found that there's no chance to recover from it according to the implementation.
I want the AXI error to be handled like: discard the read, return SIGBUS and lead the process to be killed, etc.
Of course I'm debugging the Abort and finding why it occurs but at present I have no clue.
Question
So my questions are:
Why SEAs are not recoverable in Linux arm64 implementation?
If I can "handle" or "ignore" it, how do I modify Linux kernel code (I know it's stupid but I'd like to know if there's a way.)
What can reply error in Quad SPI IP? The readl_relaxed I mentioned above reads Rx data FIFO.

1) I’ve never ventured down this path, but it looks to me like they are recoverable if the inf->fn returns 0; which means that ghes_notify_sea() must return 0; thus one of the SEA error sources successfully reported an error.
2) I think you need a bit more info. I would start by changing
drivers/acpi/apei/ghes.c:732
from:
rc = ghes_read_estatus(ghes, 0);
to:
rc = ghes_read_estatus(ghes, 1);
which should get you a bit more information when the error happens.
Armed with that information, you need to find out if you have a malfunctioning handler, or a missing one. Either way, this is the place to address it.
3) You are dealing with an ACPI implementation. There are 155 kloc in the kernel plus unknown quantity in the firmware and hardware. The kernel code doesn’t appear to handle whichever condition you are running into. First you need to determine which of these suspects is involved and what interactions are failing before you can dig out the root cause.
Happy Digging!

Intel Pin Tool: Get instruction from address

I'm using Intel's Pin Tool to do some binary instrumentation, and was wondering if there an API to get the instruction byte code at a given address.
Something like:
instruction = getInstructionatAddr(addr);
where addr is the desired address.
I know the function Instruction (used in many of the simple/manual examples) given by Pin gets the instruction, but I need to know the instructions at other addresses. I perused the web with no avail. Any help would be appreciated!
CHEERS

wondering if there an API to get the instruction byte code at a given
address
Yes, it's possible but in a somewhat contrived way: with PIN you are usually interested in what is executed (or manipulated through the executed instructions), so everything outside the code / data flow is not of any interest for PIN.
PIN is using (and thus ships with) Intel XED which is an instruction encoder / decoder.
In your PIN installation you should have and \extra folder with two sub-directories: xed-ia32 and xed-intel64 (choose the one that suits your architecture). The main include file for XED is xed-interface.h located in the \include folder of the aforementioned directories.
In your Pintool, given any address in the virtual space of your pintooled program, use the PIN_SafeCopy function to read the program memory (and thus bytes at the given address). The advantage of PIN_SafeCopy is that it fails graciously even if it can't read the memory, and can read "shadowed" parts of the memory.
Use XED to decode the instruction bytes for you.
For an example of how to decode an instruction with XED, see the first example program.
As the small example uses an hardcoded buffer (namely itext in the example program), replace this hardcoded buffer with the destination buffer you used in PIN_SafeCopy.
Obviously, you should make sure that the memory you are reading really contains code.
AFAIK, it is not possible to get an INS type (the usual type describing an instruction in PIN) from an arbitrary address as only addresses in the code flow will "generate" an INS type.
As a side note:
I know the function Instruction (used in many of the simple/manual
examples) given by Pin gets the instruction
The Instruction routine used in many PIN example is called an "Instrumentation routine": its name is not relevant in itself.

Pin_SafeCopy may help you. This API could copy memory content from the address space of target process to one specified buffer.

Physical Memory Allocation in Kernel

I am writting a Kernel Module that is going to trigger and external PCIe device to read a block of data from my internel memory. To do this I need to send the PCIe device a pointer to the physical memory address of the data that I would like to send. Ultimately this data is going to be written from Userspace to the kernel with the write() function (userspace) and copy_from_user() (kernel space). As I understand it, the address that my kernel module will see is still a virtual memory address. I need a way to get the physical address of it so that the PCIe device can find it.
1) Can I just use mmap() from userspace and place my data in a known location in DDR memory, instead of using copy_from_user()? I do not want to accidently overwrite another processes data in memory though.
2) My kernel module reserves PCIe data space at initialization using ioremap_nocache(), can I do the same from my kernel module or is it a bad idea to treat this memory as io memory? If I can, what would happen if the memory that I try to reserve is already in use? I do not want to hard code a static memory location and then find out that it is in use.
Thanks in advance for you help.

You don't choose a memory location and put your data there. Instead, you ask the kernel to tell you the location of your data in physical memory, and tell the board to read that location. Each page of memory (4KB) will be at a different physical location, so if you are sending more data than that, your device likely supports "scatter gather" DMA, so it can read a sequence of pages at different locations in memory.
The API is this: dma_map_page() to return a value of type dma_addr_t, which you can give to the board. Then dma_unmap_page() when the transfer is finished. If you're doing scatter-gather, you'll put that value instead in the list of descriptors that you feed to the board. Again if scatter-gather is supported, dma_map_sg() and friends will help with this mapping of a large buffer into a set of pages. It's still your responsibility to set up the page descriptors in the format expected by your device.
This is all very well written up in Linux Device Drivers (Chapter 15), which is required reading. http://lwn.net/images/pdf/LDD3/ch15.pdf. Some of the APIs have changed from when the book was written, but the concepts remain the same.
Finally, mmap(): Sure, you can allocate a kernel buffer, mmap() it out to user space and fill it there, then dma_map that buffer for transmission to the device. This is in fact probably the cleanest way to avoid copy_from_user().

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio