How to boot ddr memory of an FPGA? - fpga

I have nexys 4 ddr board which has 128MiB on board memory and I access it via IP inside Vivado named Memory Interface Generator. But for example unlike BRAM IP which has a .coe file that initialize BRAMs of the board, here for ddr memory of the board I cannot find a way to initialize it with some data. I have a Ibex processor that utilize this memory as its main memory but now I don't know how to put compiled codes that I have written inside this ddr2 memory. Can anyone help? Is there a way to boot these memories with some initial data easily like BRAMs?

For using an external memory like DDR2 as your processor main memory, you have to use a boot loader. Boot loader programs are small and can be run on the BRAM inside the FPGA. When the board is powered up, it reads the main program from external non-volatile memory (like SPI Flash) and loads it on the external DDR2.
I am not familiar with the processor you are using, but Xilinx has a template SREC-Bootloader in Vitis for its Microblaze processors. You can use that as start point and write your own bootloader.

Related

How to Write data from FPGA to DDR3 memory without PS Logic

I'm using zynq7000 family fpga, i want to write data from my fpga to micron ddr3 sdram memory without using PS logic (only using PL) I'm new to memory based designs may i get any help to design the logic using PL or any references.
Thanks in advance.
The biggest question is this: how is your "micron ddr3 sdram" physically connected to the FPGA? Is it pinned out for the PS-side? Or the PL? There are dedicated pins on the FPGA just for PS side memory. Now, if you absolutely must have PL logic interface with PS memory, then you can open an AXI port on the Zynq PS side to allow PL logic to get at the PS memory space. That's the only way to do it.
On the other hand, if the DDR is correctly pinned out to PL, then you can use the Xilinx Memory Interface Generator (MIG) IP core to build the PL-side logic to interface with it. See here.

DMA on FPGA Cannot Access Kernel Memory Allocated with GFP_KERNEL Flag

I would first like to give a brief description of the scenario that I am working on.
What I am trying to accomplish is to load image data from my user space application and transfer it over PCIe to a custom acceleration engine located inside a FPGA board.
The specifications of my host machine are:
Intel Xeon Processor with 16G ram.
64 Bit Debian Linux with kernel version 4.18.
The FPGA is a Virtex 7 KC705 development board.
The FPGA uses a PCIe controller (bridge) for the communication between the PCIe infrastructure and the AXI interface of the FPGA.
In addition, the FPGA is equiped with a DMA engine which is supposed to read data through the PCIe controller from the kernel memory and forward them to the accelerator.
Since in future implementations I would like to make multiple kernel allocations up to 256M, I have configured my kernel to support CMA and DMA Contiguous Allocator.
According to dmesg I can verify that my system reserves at startup the CMA area.
Regarding the acceleration procedure:
The driver initially allocates 4M kernel memory by using the dma_alloc_coherent() with GFP_KERNEL flag. This allocation is inside the range of the CMA.
Then from my user space application I call mmap with READ_PROT/WRITE_PROT and MAP_SHARED/MAP_LOCKED flags to map the previously allocated CMA memory and load the image data in there.
Once the image data is loaded I forward the dma_addr_t physical address of the CMA allocated memory and I start the DMA to transfer the data to the accelerator. When the acceleration is completed the DMA is supposed to write the processed data back to the same CMA kernel allocated memory.
On completion the user space application reads the processed data from the CMA memory and saves it to a .bmp file. When I check the "processed" image it is the same as the original one. I suppose that the processed data were never written to the CMA memory.
Is there some kind of memory protection that does not allow writing to the CMA memory when using GFP_KERNEL flag?
An interesting fact is that when I allocate kernel memory with dma_alloc_coherent but with either GFP_ATOMIC or GFP_DMA the processed data are written correctly to the kernel memory but unfortunately the allocated memory does not belong to the range of the CMA area.
What is wrong in my implementation?
Please let me know if you need more information!
In order to use mmap() I have adopted the debugfs file operations method.
Initially, I open a debugfs file as follows:
shared_image_data_file = open("/sys/kernel/debug/shared_image_data_mmap_value", O_RDWR);
The shared_image_data_mmap_value is my debugfs file which is created in my kernel driver and the shared_image_data_file is just an integer.
Then, I call mmap() from userspace as follows:
kernel_address = (unsigned int *)mmap(0, (4 * MBYTE), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, shared_image_data_file, 0);
When I call the mmap() function in user space the mmap file operation of my debugfs file executes the following function in the kernel driver:
dma_mmap_coherent(&dev->dev, vma, shared_image_data_virtual_address, shared_image_data_physical_address, length);
The shared_image_data_virtual_address is a pointer of type uint_64_t while the shared_image_data_physical_address is of type dma_addr_t and they where created earlier when I used the following code to allocate memory in kernel space:
shared_image_data_virtual_address = dma_alloc_coherent(&dev->dev, 4 * MBYTE, &shared_image_data_physical_address, GFP_KERNEL);
The address that I pass to the DMA of the FPGA is the shared_image_data_physical_address.
I hope that the above are helpful.
Thank you!

How the instructions and data are organised in a MicroBlaze MCS?

I'm actually studying into the MicroBlaze micro controller system that I've implemented in my FPGA. But I want to understand how is working this MCU. Let's consider this block diagram :
MicroBlaze MCS block diagram
We can see that the processor is connect though 2 bus of 32 bits into a BRAM module. One of these bus is the ILMB (Instruction Local Memory Bus) and the other is DLMB (Data Local Memory Bus). We can see that both are connect to different port of the BRAM Module. So there is my question : In an Harvard Architecture, the program instructions and the random access memory is not supposed to be separated ? When we generate the system with the Xilinx IP Core Generator, the memory size that we put in is the size for the program instructions, the RAM or both ?
Memory size?
I've searched into the define in the xparameters.h header file to find the adress in memory of the ILMB and the DLMB and I've found that both are the same adress range.
#define XPAR_DLMB_CNTLR_BASEADDR 0x00000000
#define XPAR_DLMB_CNTLR_HIGHADDR 0x00003FFF
#define XPAR_ILMB_CNTLR_BASEADDR 0x00000000
#define XPAR_ILMB_CNTLR_HIGHADDR 0x00003FFF
The fact that both Instruction and Data are referred at the same adress confused me. Can someone tell me where I'm wrong ?
Both ILMB and DLMB buses in this case are sharing the same physical memory space. The same applies to the memory size parameter. By default Mircoblaze system is configured to have shared data and instruction memory space.
But the fact there are 2 separate busses allows you to configure your system to have data and instructions residing in totally different address spaces (or physical devices). For example, ILMB can be configured to address on a ROM memory and data can access completely different hardware block of RAM memory.
Microblaze is highly configurable CPU, and separate memory busses is one of these configuration points that needs to be configured in very rare cases. Most of the time these share the same address space of the BRAM memory.

How do I read large amounts of data from an AXI4 bus

I'm building something on a zybo board, so using a Zynq device.
I'd like to write into main memory from the CPU, and read from it with the FPGA in order to write the CPU results out to another device.
I'm pretty sure that I need to use the AXI bus to do this, but I can't work out the best approach to the problem. Do I:
Make a full AXI peripheral myself? Presumably a master which issues read requests to main memory, and then has them fulfilled. I'm finding it quite hard to find resources on how to actually make an AXI peripheral, where would I start looking for straightforward explanations.
Use one of the Xilinx IP cores to handle the AXI bus for me, but there are quite a few of them, and I'm not sure of the best one to use.
Whatever it is, it needs to be fast, and it needs to be able to do large reads from the DDR memory on my board. That memory needs to also be writable by the CPU.
Thanks!
An easy option is to use the AXI-Stream FIFO component in your block diagram. Then you can code up an AXI-Stream slave to receive the data. So the ARM would write via AXI to the FIFO, and your component would stream data out of the FIFO. No need to do any AXI work.
Take a look at Xilinx's PG080 for details.
If you have access to the vivado-hls tool.
Then transferring data from the main memory to the FPGA memory (e.g., BRAM) under a burst scheme would be one solution.
Just you need to use memcpy in your code and then the synthesis tool automatically generates the master IP which is very fast and reliable.
Option 1: Create your own AXI master. You would probably need to create a AXI slave for configuration purposes as well.
I found this article quite helpful to get started with AXI:
http://silica.com/wps/wcm/connect/88aa13e1-4ba4-4ed9-8247-65ad45c59129/SILICA_Xilinx_Designing_a_custom_axi_slave_rev1.pdf?MOD=AJPERES&CVID=kW6xDPd
And of course, the full AXI reference specification is here:
http://www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI4_specification.pdf
Option 2: Use the Xilinx AXI DMA component to setup DMA transfers between DDR memory and AXI streams. You would need to interface your logic to the "AXI streams" of the Xilinx DMA component. AXI streams are typically easier to implement than creating a new high performance AXI master.
This approach supports very high bandwidths, and can do both continous streams and packet based transfers. It also supports metadata for each packet.
The Xilinx AXI DMA component is here:
http://www.xilinx.com/products/intellectual-property/axi_dma.html
Xilinx also provides software drivers for this.

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

Resources