Multi-Queue (MQ) in SCSI and iSCSI - linux-kernel

I have a question regarding multi-queue (MQ) in SCSI layer and consequently iSCSI. Apparently there is good technical and scientific literature explaining multi-queue (MQ) on block layer level. But there is rarely any good explanation that how this multi-queue (MQ) is triggered down to SCSI layer and then iSCSI. AFAIK, since Linux kernel 3.13 (2014), the linux block layer has multi-queue a.k.a mq-blk. After mq-blk in block layer, the SCSI IO submission path had to be updated. As a result, SCSI multi-queue a.k.a scsi-mq work has been functional since Linux kernel 3.17. Therefore, I have following questions:
Question 1: How actually multi-queuing is achieved in SCSI layer?
Question 2: Traditionally, the SCSI mid-level layer used to create queuecommand (). Now when there is multi-queuing implemented in SCSI, does multi-queuing actually means creating more than one queuecommand ()?
Question 3: Where exactly one can see multi-queue in SCSI code base?
Question 4: Once we have multi-queue in SCSI, how is this implemented in iSCSI level?
Please help me understand it.

Related

Mapping device memory into user process address space

While I was going through the LDD3 book, in chapter 15, (memory mapping and dma), the introduction of mmap call says:
mmap() system call allows mapping of device memory directly into user process address space.
The confusion is regarding the address space. Why would device memory be mapped to user space since the kernel only takes care of the device. Why would one need to map it in user space. If the device memory is mapped in user space, why kernel manages it then? And what if the device could be used erroneously if it lies in user address space?
Please correct me if I am wrong, I am just new to it.
Thanks
The same chapter you are referring to, has the answer to your question.
A definitive example of mmap usage can be seen by looking at a subset of the virtual memory areas for the X Window System server. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. In the X server example, using mmap allows quick and easy access to the video card’s memory. For a performance-critical application like this, direct access makes a large difference.
...
Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a high-performance application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work
done.
But you are correct that usually kernel drivers handle devices without revealing device memory to user space:
As you might suspect, not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices. Another limitation of mmap is that mapping is PAGE_SIZE grained.
In the end, it all depends on how you want your device to be used from user space:
which interfaces you want to provide from driver to user space
what are performance requirements
Usually you hide device memory from user, but sometimes it's needed to give user a direct access to device memory (when alternative is bad performance or ugly interface). Only you, as an engineer, can decide which way is the best, in each particular case.
There are few usages I can think of:
user mode drivers - in this case kernel driver is only posing as a stub: for mapping memory to user space, passing interrupts, etc. (this is common for proprietary drivers).
some user space application is filling or reading DMA buffers directly, to avoid copying them between user space and kernel space.
Regards,
Mateusz.

How do I read large amounts of data from an AXI4 bus

I'm building something on a zybo board, so using a Zynq device.
I'd like to write into main memory from the CPU, and read from it with the FPGA in order to write the CPU results out to another device.
I'm pretty sure that I need to use the AXI bus to do this, but I can't work out the best approach to the problem. Do I:
Make a full AXI peripheral myself? Presumably a master which issues read requests to main memory, and then has them fulfilled. I'm finding it quite hard to find resources on how to actually make an AXI peripheral, where would I start looking for straightforward explanations.
Use one of the Xilinx IP cores to handle the AXI bus for me, but there are quite a few of them, and I'm not sure of the best one to use.
Whatever it is, it needs to be fast, and it needs to be able to do large reads from the DDR memory on my board. That memory needs to also be writable by the CPU.
Thanks!
An easy option is to use the AXI-Stream FIFO component in your block diagram. Then you can code up an AXI-Stream slave to receive the data. So the ARM would write via AXI to the FIFO, and your component would stream data out of the FIFO. No need to do any AXI work.
Take a look at Xilinx's PG080 for details.
If you have access to the vivado-hls tool.
Then transferring data from the main memory to the FPGA memory (e.g., BRAM) under a burst scheme would be one solution.
Just you need to use memcpy in your code and then the synthesis tool automatically generates the master IP which is very fast and reliable.
Option 1: Create your own AXI master. You would probably need to create a AXI slave for configuration purposes as well.
I found this article quite helpful to get started with AXI:
http://silica.com/wps/wcm/connect/88aa13e1-4ba4-4ed9-8247-65ad45c59129/SILICA_Xilinx_Designing_a_custom_axi_slave_rev1.pdf?MOD=AJPERES&CVID=kW6xDPd
And of course, the full AXI reference specification is here:
http://www.gstitt.ece.ufl.edu/courses/fall15/eel4720_5721/labs/refs/AXI4_specification.pdf
Option 2: Use the Xilinx AXI DMA component to setup DMA transfers between DDR memory and AXI streams. You would need to interface your logic to the "AXI streams" of the Xilinx DMA component. AXI streams are typically easier to implement than creating a new high performance AXI master.
This approach supports very high bandwidths, and can do both continous streams and packet based transfers. It also supports metadata for each packet.
The Xilinx AXI DMA component is here:
http://www.xilinx.com/products/intellectual-property/axi_dma.html
Xilinx also provides software drivers for this.

How to interact between Nios and FPGA?

Example:
Let's assume there is a Nios running on a FPGA that sends randomly (or every second) a string to an attached display over the SPI interface. On the other hand there is the FPGA code that monitors a pushbutton. Every press on this button should send a string to the same attached display.
Question:
How works the interaction (or communication) between the FPGA and Nios in general or in such described case? How is it possible to 'inform' Nios that the pushbutton is pressed when this code is running under the FPGA code? Maybe there is a documentation about this topic to get an idea how it works...
Thanks in advance
Low speed or high speed?
For low speed, plug a GPIO core with enough I/O "pins" into the NIOS system and rebuild it. Wire your hardware to those pins, and use the GPIO driver code to access them. Done. Buttons count as low speed. SPI can too, though you'll probably find a much better SPI peripheral for NIOS, so I'd use that.
For high speed, you need to design a peripheral (IP core) that interfaces to whatever bus the NIOS system uses, and provides all the registers, memory, interrupt sources etc you need to interface to your VHDL hardware. There are plenty of example peripherals you can use as a starting point. Then you get to write the driver software to access that peripheral, again, starting from sample code.
This is a much more complex project, and while it's much faster than GPIO, you find "high speed" is relative; any embedded CPU is appallingly slow compared to custom hardware. We're not talking about factors of 2 here but orders of magnitude.
EDIT : Whichever approach you use, as described above, interacting with the hardware from the software side is best done through the driver software.
If you're in the situation where you have to write your own driver, then you declare variables to match each accessible register or memory block (represented by an array variable). Often the vendor tools can create a skeleton driver for you, from either the VHDL code or some other description. I don't know how Altera/Nios tools are set up but they surely have tutorials to teach you their approach.
If you have an Ada compiler you can declare these variables at package scope, to maintain proper abstraction and information hiding. But if you have to use C, with no packages, you are probably stuck with global variables.
You fix each variable at whatever physical address your hardware maps them to, and you must declare them "volatile" so that accesses to them are never optimised into registers.
If your hardware can interrupt the CPU, you have to write an interrupt handler function, with pragmas to tell the compiler which interrupt vector it should be connected to. You'll need to get the exact details from your own compiler documentation and examples of driver code for other peripherals.
I would start here:
https://www.altera.com/support/support-resources/design-examples/intellectual-property/embedded/nios-ii/exm-developing-hal-drivers.html
with sample code and a short "Guidelines" document
and use the NIOS software handbook for more depth.
To help find what you're looking for, apparently Altera use the terms "HAL" (Hardware Abstraction Layer) to describe the part of the driver that directly accesses the hardware, and "BSP" (Board Support Package) for the facilities that allow you to describe your hardware to the tools - and to your software team. Any tools to build a skeleton driver will be associated with the BSP : I see a section called "Creating a new BSP" in the software handbook.

Atomic operations in ARM strex and ldrex - can they work on I/O registers?

Suppose I'm modifying a few bits in a memory-mapped I/O register, and it's possible that another process or and ISR could be modifying other bits in the same register.
Can ldrex and strex be used to protect against this? I mean, they can in principle because you can ldrex, and then change the bit(s), and strex it back, and if the strex fails it means another operation may have changed the reg and you have to start again. But can the strex/ldrex mechanism be used on a non-cacheable area?
I have tried this on raspberry pi, with an I/O register mapped into userspace, and the ldrex operation gives me a bus error. If I change the ldrex/strex to a simple ldr/str it works fine (but is not atomic any more...) Also, the ldrex/strex routines work fine on ordinary RAM. Pointer is 32-bit aligned.
So is this a limitation of the strex/ldrex mechanism? or a problem with the BCM2708 implementation, or the way the kernel has set it up? (or somethinge else- maybe I've mapped it wrong)?
Thanks for mentioning me...
You do not use ldrex/strex pairs on the resource itself. Like swp or test and set or whatever your instruction set supports (for arm it is swp and more recently strex/ldrex). You use these instructions on ram, some ram location agreed to by all the parties involved. The processes sharing the resource use the ram location to fight over control of the resource, whoever wins, gets to then actually address the resource. You would never use swp or ldrex/strex on a peripheral itself, that makes no sense. and I could see the memory system not giving you an exclusive okay response (EXOKAY) which is what you need to get out of the ldrex/strex infinite loop.
You have two basic methods for sharing a resource (well maybe more, but here are two). One is you use this shared memory location and each user of the shared resource, fights to win control over the memory location. When you win you then talk to the resource directly. When finished give up control over the shared memory location.
The other method is you have only one piece of software allowed to talk to the peripheral, nobody else is allowed to ever talk to the peripheral. Anyone wishing to have something done on the peripheral asks the one resource to do it for them. It is like everyone being able to share the soft drink fountain, vs the soft drink fountain is behind the counter and only the soft drink fountain employee is allowed to use the soft drink fountain. Then you need a scheme either have folks stand in line or have folks take a number and be called to have their drink filled. Along with the single resource talking to the peripheral you have to come up with a scheme, fifo for example, to essentially make the requests serial in nature.
These are both on the honor system. You expect nobody else to talk to the peripheral who is not supposed to talk to the peripheral, or who has not won the right to talk to the peripheral. If you are looking for hardware solutions to prevent folks from talking to it, well, use the mmu but now you need to manage the who won the lock and how do they get the mmu unblocked (without using the honor system) and re-blocked in a way that
Situations where you might have an interrupt handler and a foreground task sharing a resource, you have one or the other be the one that can touch the resource, and the other asks for requests. for example the resource might be interrupt driven (a serial port for example) and you have the interrupt handlers talk to the serial port hardware directly, if the application/forground task wants to have something done it fills out a request (puts something in a fifo/buffer) the interrupt then looks to see if there is anything in the request queue, and if so operates on it.
Of course there is the, disable interrupts and re-enable critical sections, but those are scary if you want your interrupts to have some notion of timing/latency...Understand what you are doing and they can be used to solve this app+isr two user problem.
ldrex/strex on non-cached memory space:
My extest perhaps has more text on the when you can and cant use ldrex/strex, unfortunately the arm docs are not that good in this area. They tell you to stop using swp, which implies you should use strex/ldrex. But then switch to the hardware manual which says you dont have to support exclusive operations on a uniprocessor system. Which says two things, ldrex/strex are meant for multiprocessor systems and meant for sharing resources between processors on a multiprocessor system. Also this means that ldrex/strex is not necessarily supported on uniprocessor systems. Then it gets worse. ARM logic generally stops either at the edge of the processor core, the L1 cache is contained within this boundary it is not on the axi/amba bus. Or if you purchased/use the L2 cache then the ARM logic stops at the edge of that layer. Then you get into the chip vendor specific logic. That is the logic that you read the hardware manual for where it says you dont NEED to support exclusive accesses on uniprocessor systems. So the problem is vendor specific. And it gets worse, ARM's L1 and L2 cache so far as I have found do support ldrex/strex, so if you have the caches on then ldrex/strex will work on a system whose vendor code does not support them. If you dont have the cache on that is when you get into trouble on those systems (that is the extest thing I wrote).
The processors that have ldrex/strex are new enough to have a big bank of config registers accessed through copressor reads. buried in there is a "swp instruction supported" bit to determine if you have a swap. didnt the cortex-m3 folks run into the situation of no swap and no ldrex/strex?
The bug in the linux kernel (there are many others as well for other misunderstandings of arm hardware and documentation) is that on a processor that supports ldrex/strex the ldrex/strex solution is chosen without determining if it is multiprocessor, so you can (and I know of two instances) get into an infinite ldrex/strex loop. If you modify the linux code so that it uses the swp solution (there is code there for either solution) they linux will work. why only two people have talked about this on the internet that I know of, is because you have to turn off the caches to have it happen (so far as I know), and who would turn off both caches and try to run linux? It actually takes a fair amount of work to succesfully turn off the caches, modifications to linux are required to get it to work without crashing.
No, I cant tell you the systems, and no I do not now nor ever have worked for ARM. This stuff is all in the arm documentation if you know where to look and how to interpret it.
Generally, the ldrex and strex need support from the memory systems. You may wish to refer to some answers by dwelch as well as his extext application. I would believe that you can not do this for memory mapped I/O. ldrex and strex are intended more for Lock Free algorithms, in normal memory.
Generally only one driver should be in charge of a bank of I/O registers. Software will make requests to that driver via semaphores, etc which can be implement with ldrex and strex in normal SDRAM. So, you can inter-lock these I/O registers, but not in the direct sense.
Often, the I/O registers will support atomic access through write one to clear, multiplexed access and other schemes.
Write one to clear - typically use with hardware events. If code handles the event, then it writes only that bit. In this way, multiple routines can handle different bits in the same register.
Multiplexed access - often an interrupt enable/disable will have a register bitmap. However, there are also alternate register that you can write the interrupt number to which enable or disable a particular register. For instance, intmask maybe two 32 bit registers. To enable int3, you could mask 1<<3 to the intmask or write only 3 to an intenable register. They intmask and intenable are hooked to the same bits via hardware.
So, you can emulate an inter-lock with a driver or the hardware itself may support atomic operations through normal register writes. These schemes have served systems well for quiet some time before people even started to talk about lock free and wait free algorithms.
Like previous answers state, ldrex/strex are not intended for accessing the resource itself, but rather for implementing the synchronization primitives required to protect it.
However, I feel the need to expand a bit on the architectural bits:
ldrex/strex (pronounced load-exclusive/store-exclusive) are supported by all ARM architecture version 6 and later processors, minus the M0/M1 microcontrollers (ARMv6-M).
It is not architecturally guaranteed that load-exclusive/store-exclusive will work on memory types other than "Normal" - so any clever usage of them on peripherals would not be portable.
The SWP instruction isn't being recommended against simply because its very nature is counterproductive in a multi-core system - it was deprecated in ARMv6 and is "optional" to implement in certain ARMv7-A revisions, and most ARMv7-A processors already require it to be explicitly enabled in the cp15 SCTLR. Linux by default does not, and instead emulates the operation through the undef handler using ... load-exclusive and store-exclusive (what #dwelch refers to above). So please don't recommend SWP as a valid alternative if you are expecting code to be portable across ARMv7-A platforms.
Synchronization with bus masters not in the inner-shareable domain (your cache-coherency island, as it were) requires additional external hardware - referred to as a global monitor - in order to track which masters have requested exclusive access to which regions.
The "not required on uniprocessor systems" bit sounds like the ARM terminology getting in the way. A quad-core Cortex-A15 is considered one processor... So testing for "uniprocessor" in Linux would not make one iota of a difference - the architecture and the interconnect specifications remain the same regardless, and SWP is still optional and may not be present at all.
Cortex-M3 supports ldrex/strex, but its interconnect (AHB-lite) does not support propagating it, so it cannot use it to synchronize with external masters. It does not support SWP, never introduced in the Thumb instruction set, which its interconnect would also not be able to propagate.
If the chip in question has a toggle register (which is essentially XORed with the output latch when written to) there is a work around.
load port latch
mask off unrelated bits
xor with desired output
write to toggle register
as long as two processes do not modify the same pins (as opposed to "the same port") there is no race condition.
In the case of the bcm2708 you could choose an output pin whose neighbors are either unused or are never changed and write to GPFSELn in byte mode. This will however only ensure that you will not corrupt others. If others are writing in 32 bit mode and you interrupt them they will still corrupt you. So its kind of a hack.
Hope this helps

Questions on how network cards in Windows work

I am trying to figure out how network cards work in Windows, and how the data is being relayed.
I have two hypotheses.
1.
Data is received by the network card.
The card then puts the data in an internal buffer, possibly a double buffer or a ring buffer.
The card accumulates data until some amount has been reached, upon which it sends an interrupt.
Windows copies the data from the card to the RAM and notifies appropriate handlers.
2.
Data is received.
The card puts the data in the RAM using DMA. (Does DMA guarantee that data will not be lost, or does the card still need its own buffer?)
The card fires an interrupt upon putting enough data in the RAM.
Windows receives the interrupt and copies or exposes the data to appropriate handlers.
Are either of my hypotheses correct?
Is there any message from the card or Windows if buffers are full?
In my Windows systems properties for my ethernet controller I can see properties called "Receive buffers" and "Transmit buffers", both are set to 256.
What does this mean?
Are there any good literature on this subject? (I have Tanenbaum's Modern Operating Systems, but it is not specifically related to Windows.)
Your question subsumes (at least!) three very, very broad topics:
1) how does a Layer 2 (Data Link) hardware device work?
2) How does it relate to the operating system's network stack
... and ...
3) How does it relate to the operating system's kernel-level device driver?
The next link is actually 180 degrees opposite your original question (the API is relatively high level, your question pertains to the lowest software levels), but it wouldn't hurt to look at the .Net API for perspective "how things work":
http://msdn.microsoft.com/en-us/library/4as0wz7t.aspx
'Hope that helps ... at least a little bit...
PS:
Linux is a wealth of information about implementing a network stack: all of the kernel source and all of the device drivers are completely available, and very well documented.

Resources