The situation is that I have 2 boards connected together via PCIE bus. One board is the rootport and one board is the endpoint. The endpoint side exported a memory region to the rootport side.
The communication between two boards is implemented via software message queue. The queue meta data and buffer are all located inside the exported memory region.
Both sides can access the memory region at the same time (rootport via its PCIE bus, and endpoint via its local bus). This may cause problem when both sides try to update the queue meta data.
At first, I tried to allocate a spinlock_t on the same exported memory region, but because each board is uniprocessor, the spinlock_t is not allocated anyway.
May anyone please suggest a mechanism to protect the shared region or recommend other approach to communicate between two boards. Any recommendations are appreciated. Thanks a lot!
Thank you for your interest so far.
We finally implemented the shared memory communication with circular queue. The implementation can be referenced from this link. We reduce the problem to single producer single consumer thus the circular queue does not require lock to protect. The limitation of this approach is we have to create a queue for each peer connection.
In PCIE spec, there is also sections described the Atomic Operation, unfortunately our PCIE host controller does not support this feature, so we can not make use of this feature.
Related
I am trying to construct two shared queues(one command queue, and one reply queue) between user and kernel space. So that kernel can send message to userspace and userspace can send reply to kernel after it finishes the processing.
What I have done is use allocate kernel memory pages(for the queues) and mmap to user space, and now both user and kernel side can access those pages(here I mean what is written in kernel space can be correctly read in user space, or vise versa).
The problem is I don't know how I can synchronize the access between kernel and user space. Say if I am going to construct a ring buffer for multi-producer 1-consumer scheme, How to those ring buffer access don't get corrupted by simultaneous writes?
I did some research this week and here are some possible approaches but I am quite new in kernel module development and not so sure whether it will work or not. While digging into them, I will be so glad if I can get any comments or suggestions:
use shared semaphore between user/kernel space: Shared semaphore between user and kernel spaces
But many system calls like sem_timedwait() will be used, I am worrying about how efficient it will be.
What I really prefer is a lock-free scheme, as described in https://lwn.net/Articles/400702/. Related files in kernel tree are:
kernel/trace/ring_buffer_benchmark.c
kernel/trace/ring_buffer.c
Documentation/trace/ring-buffer-design.txt
how lock-free is achieved is documented here: https://lwn.net/Articles/340400/
However, I assume these are kernel implementation and cannot directly be used in user space(As the example in ring_buffer_benchmark.c). Is there any way I can reuse those scheme in user space? Also hope I can find more examples.
Also in that article(lwn 40072), one alternative approach is mentioned using perf tools, which seems similar what I am trying to do. If 2 won't work i will try this approach.
The user-space perf tool therefore interacts with the
kernel through reads and writes in a shared memory region without using system
calls.
Sorry for the English grammar...Hope it make sense.
For syncrhonize between kernel and user space you may use curcular buffer mechanism (documentation at Documentation/circular-buffers.txt).
Key factor of such buffers is two pointers (head and tail), which can be updated separately, which fits well for separated user and kernel codes. Also, implementation of circular buffer is quite simple, so it is not difficult to implement it in user space.
Note, that for multiple producers in the kernel you need to syncrhonize them with spinlock or similar.
I was reading that in some network drivers it is possible via DMA to pass packets directly into user memory. In that case, how would it be possible for the kernel's TCP/IP stack to process the packets?
The short answer is that it doesn't. Data isn't going to be processed in more than one location at once, so if networking packets are passed directly to a user space program, then the kernel isn't going to do anything else with them; it has been bypassed. It will be up to the user space program to handle it.
An example of this was presented in a device drivers class I took a while back: High-Frequency stock trading. There is an article about one such implementation at Forbes.com. The idea is that traders want their information as fast as possible, so they use specially crafted packets that when received (by equally specialized hardware), they are presented directly to the traders program, bypassing the relatively high-latency TCP/IP stack in the kernel. Here's an excerpt from the linked article talking about two such special network cards:
Both of these cards provide kernel bypass drivers that allow you to send/receive data via TCP and UDP in userspace. Context switching is an expensive (high-latency) operation that is to be avoided, so you will want all critical processing to happen in user space (or kernel-space if so inclined).
This technique can be used for just about any application where the latency between user programs and the hardware needs to be minimized, but as your question implies, it means that the kernel's normal mechanisms for handling such transactions are going to be bypassed.
Networking chip can have register entries that can filter out per IP/UDP/TCP + port and routes those packets to via special set DMA descriptors. If you pre-allocate the DMA able memory via driver and MMAP that memory to user space, one can easily route a particular stream of traffic to user space completely without any kernel code touching it.
I used to work on a video platform. The networking ingress is done by FPGA. Once configured, it can route 10 gbits of UDP packets into the system and automatically route certain MPEG PS PID matched packets out to CPU. It can filter some other video/audio packets into the other part of system at 10gbits wire speed in a very low end FPGA.
I have been asked in an interview why the message queues are in kernel address space and same has been suggested in following link.
http://stork.sourceforge.net/thesis/node49.html
Which says "Message queue can be best described as an internal linked list within the kernel's addressing space".
I answered telling interviewer kernel logical addresses can't be swapped out and hence make message queue more robust in a situation where we have to retrieve some data from message queue after any process crash.
I am not sure this is right answer.
Also interviewer then asked why shared memory is not part of kernel address space ?
I couldn't really think of it why is it so.
Can anyone please address these two questions?
I would say message queues are maintained in kernel space for (a) historical reasons and (b) architectural reasons -- they are modeled as a kernel-managed resource: they are only created, modified, and deleted according to the defined API. That means, for example, once a process sends a message, it can't be modified in flight, it can only be received. Access controls are also imposed on objects in the queue. Managing and enforcing the details of the API would be difficult if it were maintained in user space memory.
That being said, apart from the security/assurance aspects, you probably could actually implement message queues with the same API using a shared memory area and have it be completely transparent to consuming applications.
For shared memory itself, the key is it's shared. That means in order to fulfill its function, it must be accessible in the virtual address spaces of process A and process B at the same time. If process A stores a byte at a given offset in the shared memory, process B should (ideally) see that modification near-instantaneously (though obviously there will always be a potential for cache delays and so forth in multi-processor systems). And user-space processes are never allowed to directly modify kernel virtual addresses so the shared mapping must be created in user virtual address space (though there's no reason the kernel could not also map the same region into kernel virtual address space).
Unlike share-memory, message queue can be implemented in kernel space is because the content of each element of queue is just a COPY action between user space and kernel address space when transferring data between two user processes. Hence there is no any chance that user can destroy kernel memory space via memory pointer. It is similar to that Linux uses copy_to_user() and copy_from_user() to protect kernel from user's careless.
I am implementing a kernel function in which the memory from the host side is transferred to kernel.The kernel has three functions.. Is it possible to share the same memory buffers with the kernels at different times ??
Yes, multiple kernels can use the same memory objects, as long as there is no risk for the kernels to be executed at the same time. It is the case for the usual "single command queue not created with out of order execution".
Yes, I do this with my ray tracer. I have three kernels. A preprocessor which changes geometry, a ray tracer , and a post processor which does image processing. I share memory buffers with all three of them. I make sure the kernels finish before I start the next one.
You can share memory without any problem. If the memory is read only you can even use that memory object as an input for 2 kernels running concurrently (ie: different GPUs/same context).
However, if you want to overwrite the memory zones, then be careful and use events to sync your kernels. I strongly recomend the events mechanism, since it enables parallel I/O read and writes to the memory zones in another queue.
Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.