What is the difference between IRQ and FIQ as per the Linux API wise? Are they use same api?
Is the difference only inside ARM core or is it do do with the kernel function calls also?
No, they use different APIs. The best place to look is in arch/arm/kernel/fiq.c of the kernel tree. It looks like there are a few drivers in the tree that use it that may be helpful as examples.
Related
I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.
I am writing a device driver in Linux for which I need to implement DMA.
It is clear that DMA buffers can be allocated by a call to pci_alloc_consistent(). But how can we write commands to those buffers from user level?
Tasks include writing values to specific registers, how are these implemented using DMA commands?
I believe you can write with DMA through I/O operations that you may access through a GNU C library . You must use system calls such as ioperm or iopl and run as root to gain access to DMA registers. At least thats how one gains access to IO space which may be used for DMA access. Though I may not answer the question completely, hopefully this points you in a good direction.
I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.
How do GDB watchpoints work? Can similar functionality be implemented to harness byte level access at defined locations?
On x86 there are CPU debug registers D0-D3 that track memory address.
This explains how hardware breakpoints are implemented in Linux and also gives details of what processor specific features are used.
Another article on hardware breakpoints.
I believe gdb uses the MMU so that the memory pages containing watched address ranges are marked as protected - then when an exception occurs for a write to a protected pages gdb handles the exception, checks to see whether the address of the write corresponds to a particular watchpoint, and then either resumes or drops to the gdb command prompt accordingly.
You can implement something similar for your own debugging code or test harness using mprotect, although you'll need to implement an exception handler if you want to do anything more sophisticated than just fail on a bad write.
Using the MMU or an MPU (on other processors such as embedded), can be used to implement "hardware watchpoints"; however, some processors (e.g., many Arm implementations) have dedicated watchpoint hardware accessed via a debug port. This has some advantages over using an MMU or MPU.
If you use the MMU or MPU approach:
PRO - There is no special hardware needed for application-class processors because an MMU is built-in to support the needs of Linux or Windows. In the case of specialized realtime-class processors, there is often an MPU.
CON - There will be software overhead handling the exception. This is probably not a problem for an Application class processor (e.g., x86); however, for embedded realtime-application, this could spell disaster.
CON- MMU or MPU faults may happen for other reasons, which means the handler will need to figure our exactly why it faulted by reading various status registers.
PRO - using MMU memory protection faults can often cover many separate variables to watchpoint many variables easily. However, this not normally required of most debugging situations.
If you use dedicated debug watchpoint hardware such as supported by Arm:
PRO - There is no impact on software performance (helps if debugging subtle timing issues). The debug infrastructure is designed to be non-intrusive.
CON - There are a limited number of these hardware units on any particular silicon. For Arm, there may be 0, 2 or 4 of these. So you need to be careful to choose. The units can cover a range of addresses, but there are limits. For some processors, they may even be limited to the region of memory.
I am having some trouble finding some suitable examples to solve my problem. I want to share 4K (4096) byte of data between user and kernel space. I found many ideas which says I have to allocate memory from kernel and mmap it in user-space. Can someone provide an example on how to do it in Linux 2.6.38. Is there any good document which explains it?
Thanks in advance.
Your proposed way is one way, but as userspace is not within your control (meaning any userspace program have a possibility of poking into the kernel), you are opening up the opportunities for malicious attack from userspace. This kernel-based memory-sharing-with-userspace is also described here:
http://www.scs.ch/~frey/linux/memorymap.html
Instead, how about allocating memory in userspace, and then from kernel use the API copy_from_user() and copy_to_user() to copy to/from userspace memory? If u want to share the memory among the different processes, then u can always use IPC related API to allocate and define the memory, eg shmget() etc. And in this case there are lots of sample codes within the kernel source itself.
eg.
fs/checksum.c: missing = __copy_from_user(dst, src, len);