shared lock-free queue between kernel/user address space - linux-kernel

I am trying to construct two shared queues(one command queue, and one reply queue) between user and kernel space. So that kernel can send message to userspace and userspace can send reply to kernel after it finishes the processing.
What I have done is use allocate kernel memory pages(for the queues) and mmap to user space, and now both user and kernel side can access those pages(here I mean what is written in kernel space can be correctly read in user space, or vise versa).
The problem is I don't know how I can synchronize the access between kernel and user space. Say if I am going to construct a ring buffer for multi-producer 1-consumer scheme, How to those ring buffer access don't get corrupted by simultaneous writes?
I did some research this week and here are some possible approaches but I am quite new in kernel module development and not so sure whether it will work or not. While digging into them, I will be so glad if I can get any comments or suggestions:
use shared semaphore between user/kernel space: Shared semaphore between user and kernel spaces
But many system calls like sem_timedwait() will be used, I am worrying about how efficient it will be.
What I really prefer is a lock-free scheme, as described in https://lwn.net/Articles/400702/. Related files in kernel tree are:
kernel/trace/ring_buffer_benchmark.c
kernel/trace/ring_buffer.c
Documentation/trace/ring-buffer-design.txt
how lock-free is achieved is documented here: https://lwn.net/Articles/340400/
However, I assume these are kernel implementation and cannot directly be used in user space(As the example in ring_buffer_benchmark.c). Is there any way I can reuse those scheme in user space? Also hope I can find more examples.
Also in that article(lwn 40072), one alternative approach is mentioned using perf tools, which seems similar what I am trying to do. If 2 won't work i will try this approach.
The user-space perf tool therefore interacts with the
kernel through reads and writes in a shared memory region without using system
calls.
Sorry for the English grammar...Hope it make sense.

For syncrhonize between kernel and user space you may use curcular buffer mechanism (documentation at Documentation/circular-buffers.txt).
Key factor of such buffers is two pointers (head and tail), which can be updated separately, which fits well for separated user and kernel codes. Also, implementation of circular buffer is quite simple, so it is not difficult to implement it in user space.
Note, that for multiple producers in the kernel you need to syncrhonize them with spinlock or similar.

Related

Efficient kernel/userspace communication for Linux kernel module

I’ve got a Linux kernel module (a qdisc / network traffic scheduler, but this shouldn’t matter) which needs efficient communication between kernel and user space.
For the communication from kernel to user space, I found relayfs over debugfs very useful, it passes a high-volume amount of debugging information out via a ring buffer, and if userspace misses a page it doesn’t matter much.
For the communication back (specifically, changing the bandwidth limit between 10 and 1000 times a second) I’ve been relying on the normal netlink stuff, i.e. tc qdisc change …. This is not only overhead, mildly mitigated with tc -b - and piping the commands in via stdin, but also introduces unwanted latency because that command involves more locking than necessary.
I think I need an efficient way to write control parameters (currently just one 64-bit number) to kernel space… very often, and with as few locking as possible. The actual updating of the value can probably be done with an atomic write (most likely followed by some kind of barrier so the other CPUs read the new value — I’ll have to read up on that as well)… unsure if I have to change the reading code as well.
relayfs is unidirectional, and buffered, so it wouldn’t be a good fit either.
From what I understand, procfs and sysfs have traditionally been the ways to do this, with sysfs being preferred for new code. From what I see from other users of sysfs though, it invoves an open/write/close cycle every time a quantity is written, and it’s usually converted to a decimal ASCII string instead of using the binary value in host endianness. This doesn’t sound like an ideal fit either.
Additionally, there may be more than one instance of the qdisc in use at any given time, so I need a way to address the correct one. I could use module-global state to keep a list of currently active qdiscs, but that would require locking, so ideally I’d do the fetching of the pointer to the control variables only on open, with locking; writing can then be lockless atomic. On the other hand, one qdisc does not necessarily need more than one writer to its controlling communication channel.
Update: considering adding just another debugfs file (so I can pass the pointers as data argument to debugfs_create_file) and doing the atomic writing + barrier in the write file operation, then I think I need only atomic reading (with a paired barrier) in the user (which runs in softirq context). Is this sound?
Apparently, atomic64_read_acquire and atomic64_set_release are my friends for this, and the Linux kernel guarantees just casting from/to u64 will not invoke UB.

Ways to invoke Linux kernel memory allocation?

I am examining how kernel memory allocators work (SLAB and SLUB). To trick them into being called, I need to invoke kernel memory allocations via a user-land program.
The obvious way would be calling syscall.fork(), which would generate process instances, for which the kernel must maintain PCB structures, which require a fair amount of memory space.
Then I'm out. I would not limit my experiments to merely calling fork() and trace them using Systemtap. Any other convenient ways to do the similar, but may requiring kernel objects (other than proc_t) with various features (the most important of which: their sizes)?
Thanks.
SLUB is just an efficient way (in comparison with SLAB) of managing the cache objects. It is more or less the same thing. You can read here why SLUB was introduced and this link talks about what exactly slab allocator is. Now on to tracing what exactly happens in kernel and how to trace it:
The easier but inefficient way is to read the source code but for that you need to know from where to start in the source.
Another way, more accurate, is to write a driver that allocates memory using kmem_cache_create() and then call it from your user program. Now you have a well defined start point, use kgdb and step through the entire sequence.

Atomic operations in ARM strex and ldrex - can they work on I/O registers?

Suppose I'm modifying a few bits in a memory-mapped I/O register, and it's possible that another process or and ISR could be modifying other bits in the same register.
Can ldrex and strex be used to protect against this? I mean, they can in principle because you can ldrex, and then change the bit(s), and strex it back, and if the strex fails it means another operation may have changed the reg and you have to start again. But can the strex/ldrex mechanism be used on a non-cacheable area?
I have tried this on raspberry pi, with an I/O register mapped into userspace, and the ldrex operation gives me a bus error. If I change the ldrex/strex to a simple ldr/str it works fine (but is not atomic any more...) Also, the ldrex/strex routines work fine on ordinary RAM. Pointer is 32-bit aligned.
So is this a limitation of the strex/ldrex mechanism? or a problem with the BCM2708 implementation, or the way the kernel has set it up? (or somethinge else- maybe I've mapped it wrong)?
Thanks for mentioning me...
You do not use ldrex/strex pairs on the resource itself. Like swp or test and set or whatever your instruction set supports (for arm it is swp and more recently strex/ldrex). You use these instructions on ram, some ram location agreed to by all the parties involved. The processes sharing the resource use the ram location to fight over control of the resource, whoever wins, gets to then actually address the resource. You would never use swp or ldrex/strex on a peripheral itself, that makes no sense. and I could see the memory system not giving you an exclusive okay response (EXOKAY) which is what you need to get out of the ldrex/strex infinite loop.
You have two basic methods for sharing a resource (well maybe more, but here are two). One is you use this shared memory location and each user of the shared resource, fights to win control over the memory location. When you win you then talk to the resource directly. When finished give up control over the shared memory location.
The other method is you have only one piece of software allowed to talk to the peripheral, nobody else is allowed to ever talk to the peripheral. Anyone wishing to have something done on the peripheral asks the one resource to do it for them. It is like everyone being able to share the soft drink fountain, vs the soft drink fountain is behind the counter and only the soft drink fountain employee is allowed to use the soft drink fountain. Then you need a scheme either have folks stand in line or have folks take a number and be called to have their drink filled. Along with the single resource talking to the peripheral you have to come up with a scheme, fifo for example, to essentially make the requests serial in nature.
These are both on the honor system. You expect nobody else to talk to the peripheral who is not supposed to talk to the peripheral, or who has not won the right to talk to the peripheral. If you are looking for hardware solutions to prevent folks from talking to it, well, use the mmu but now you need to manage the who won the lock and how do they get the mmu unblocked (without using the honor system) and re-blocked in a way that
Situations where you might have an interrupt handler and a foreground task sharing a resource, you have one or the other be the one that can touch the resource, and the other asks for requests. for example the resource might be interrupt driven (a serial port for example) and you have the interrupt handlers talk to the serial port hardware directly, if the application/forground task wants to have something done it fills out a request (puts something in a fifo/buffer) the interrupt then looks to see if there is anything in the request queue, and if so operates on it.
Of course there is the, disable interrupts and re-enable critical sections, but those are scary if you want your interrupts to have some notion of timing/latency...Understand what you are doing and they can be used to solve this app+isr two user problem.
ldrex/strex on non-cached memory space:
My extest perhaps has more text on the when you can and cant use ldrex/strex, unfortunately the arm docs are not that good in this area. They tell you to stop using swp, which implies you should use strex/ldrex. But then switch to the hardware manual which says you dont have to support exclusive operations on a uniprocessor system. Which says two things, ldrex/strex are meant for multiprocessor systems and meant for sharing resources between processors on a multiprocessor system. Also this means that ldrex/strex is not necessarily supported on uniprocessor systems. Then it gets worse. ARM logic generally stops either at the edge of the processor core, the L1 cache is contained within this boundary it is not on the axi/amba bus. Or if you purchased/use the L2 cache then the ARM logic stops at the edge of that layer. Then you get into the chip vendor specific logic. That is the logic that you read the hardware manual for where it says you dont NEED to support exclusive accesses on uniprocessor systems. So the problem is vendor specific. And it gets worse, ARM's L1 and L2 cache so far as I have found do support ldrex/strex, so if you have the caches on then ldrex/strex will work on a system whose vendor code does not support them. If you dont have the cache on that is when you get into trouble on those systems (that is the extest thing I wrote).
The processors that have ldrex/strex are new enough to have a big bank of config registers accessed through copressor reads. buried in there is a "swp instruction supported" bit to determine if you have a swap. didnt the cortex-m3 folks run into the situation of no swap and no ldrex/strex?
The bug in the linux kernel (there are many others as well for other misunderstandings of arm hardware and documentation) is that on a processor that supports ldrex/strex the ldrex/strex solution is chosen without determining if it is multiprocessor, so you can (and I know of two instances) get into an infinite ldrex/strex loop. If you modify the linux code so that it uses the swp solution (there is code there for either solution) they linux will work. why only two people have talked about this on the internet that I know of, is because you have to turn off the caches to have it happen (so far as I know), and who would turn off both caches and try to run linux? It actually takes a fair amount of work to succesfully turn off the caches, modifications to linux are required to get it to work without crashing.
No, I cant tell you the systems, and no I do not now nor ever have worked for ARM. This stuff is all in the arm documentation if you know where to look and how to interpret it.
Generally, the ldrex and strex need support from the memory systems. You may wish to refer to some answers by dwelch as well as his extext application. I would believe that you can not do this for memory mapped I/O. ldrex and strex are intended more for Lock Free algorithms, in normal memory.
Generally only one driver should be in charge of a bank of I/O registers. Software will make requests to that driver via semaphores, etc which can be implement with ldrex and strex in normal SDRAM. So, you can inter-lock these I/O registers, but not in the direct sense.
Often, the I/O registers will support atomic access through write one to clear, multiplexed access and other schemes.
Write one to clear - typically use with hardware events. If code handles the event, then it writes only that bit. In this way, multiple routines can handle different bits in the same register.
Multiplexed access - often an interrupt enable/disable will have a register bitmap. However, there are also alternate register that you can write the interrupt number to which enable or disable a particular register. For instance, intmask maybe two 32 bit registers. To enable int3, you could mask 1<<3 to the intmask or write only 3 to an intenable register. They intmask and intenable are hooked to the same bits via hardware.
So, you can emulate an inter-lock with a driver or the hardware itself may support atomic operations through normal register writes. These schemes have served systems well for quiet some time before people even started to talk about lock free and wait free algorithms.
Like previous answers state, ldrex/strex are not intended for accessing the resource itself, but rather for implementing the synchronization primitives required to protect it.
However, I feel the need to expand a bit on the architectural bits:
ldrex/strex (pronounced load-exclusive/store-exclusive) are supported by all ARM architecture version 6 and later processors, minus the M0/M1 microcontrollers (ARMv6-M).
It is not architecturally guaranteed that load-exclusive/store-exclusive will work on memory types other than "Normal" - so any clever usage of them on peripherals would not be portable.
The SWP instruction isn't being recommended against simply because its very nature is counterproductive in a multi-core system - it was deprecated in ARMv6 and is "optional" to implement in certain ARMv7-A revisions, and most ARMv7-A processors already require it to be explicitly enabled in the cp15 SCTLR. Linux by default does not, and instead emulates the operation through the undef handler using ... load-exclusive and store-exclusive (what #dwelch refers to above). So please don't recommend SWP as a valid alternative if you are expecting code to be portable across ARMv7-A platforms.
Synchronization with bus masters not in the inner-shareable domain (your cache-coherency island, as it were) requires additional external hardware - referred to as a global monitor - in order to track which masters have requested exclusive access to which regions.
The "not required on uniprocessor systems" bit sounds like the ARM terminology getting in the way. A quad-core Cortex-A15 is considered one processor... So testing for "uniprocessor" in Linux would not make one iota of a difference - the architecture and the interconnect specifications remain the same regardless, and SWP is still optional and may not be present at all.
Cortex-M3 supports ldrex/strex, but its interconnect (AHB-lite) does not support propagating it, so it cannot use it to synchronize with external masters. It does not support SWP, never introduced in the Thumb instruction set, which its interconnect would also not be able to propagate.
If the chip in question has a toggle register (which is essentially XORed with the output latch when written to) there is a work around.
load port latch
mask off unrelated bits
xor with desired output
write to toggle register
as long as two processes do not modify the same pins (as opposed to "the same port") there is no race condition.
In the case of the bcm2708 you could choose an output pin whose neighbors are either unused or are never changed and write to GPFSELn in byte mode. This will however only ensure that you will not corrupt others. If others are writing in 32 bit mode and you interrupt them they will still corrupt you. So its kind of a hack.
Hope this helps

How to ensure that malloc and mmap cooperate, i.e., work on non-overlapping memory regions?

My main problem is that I need to enable multiple OS processes to communicate via a large shared memory heap that is mapped to identical address ranges in all processes. (To make sure that pointer values are actually meaningful.)
Now, I run into trouble that part of the program/library is using standard malloc/free and it seems to me that the underlying implementation does not respect mappings I create with mmap.
Or, another option is that I create mappings in regions that malloc already planned to use.
Unfortunately, I am not able to guarantee 100% identical malloc/free behavior in all processes before I establish the mmap-mappings.
This leads me to give the MAP_FIXED flag to mmap. The first process is using 0x0 as base address to ensure that the mapping range is at least somehow reasonable, but that does not seem to transfer to other processes. (The binary is also linked with -Wl,-no_pie.)
I tried to figure out whether I could query the system to know which pages it plans to use for malloc by reading up on malloc_default_zone, but that API does not seem to offer what I need.
Is there any way to ensure that malloc is not using particular memory pages/address ranges?
(It needs to work on OSX. Linux tips, which guide me in the right direction are appreciate, too.)
I notice this in the mmap documentation:
If MAP_FIXED is specified, a successful mmap deletes any previous mapping in the allocated address range
However, malloc won't use map fixed, so as long as you get in before malloc, you'd be okay: you could test whether a region is free by first trying to map it without MAP_FIXED, and if that succeeds at the same address (which it will do if the address is free) then you can remap with MAP_FIXED knowing that you're not choosing a section of address space that malloc had already grabbed
The only guaranteed way to guarantee that the same block of logical memory will be available in two processes is to have one fork from the other.
However, if you're compiling with 64-bit pointers, then you can just pick an (unusual) region of memory, and hope for the best, since the chance of collision is tiny.
See also this question about valid address spaces.
OpenBSD malloc() implementation uses mmap() for memory allocation. I suggest you to see how does it work then write your own custom implementation of malloc() and tell your program and the libraries used by it to use your own implementation of malloc().
Here is OpenBSD malloc():
http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libc/stdlib/malloc.c?rev=1.140
RBA

How to use mmap to share user-space and kernel threads

I am having some trouble finding some suitable examples to solve my problem. I want to share 4K (4096) byte of data between user and kernel space. I found many ideas which says I have to allocate memory from kernel and mmap it in user-space. Can someone provide an example on how to do it in Linux 2.6.38. Is there any good document which explains it?
Thanks in advance.
Your proposed way is one way, but as userspace is not within your control (meaning any userspace program have a possibility of poking into the kernel), you are opening up the opportunities for malicious attack from userspace. This kernel-based memory-sharing-with-userspace is also described here:
http://www.scs.ch/~frey/linux/memorymap.html
Instead, how about allocating memory in userspace, and then from kernel use the API copy_from_user() and copy_to_user() to copy to/from userspace memory? If u want to share the memory among the different processes, then u can always use IPC related API to allocate and define the memory, eg shmget() etc. And in this case there are lots of sample codes within the kernel source itself.
eg.
fs/checksum.c: missing = __copy_from_user(dst, src, len);

Resources