Is the implementation of the blkio subsystem in cgroups complete? - linux-kernel

I am trying to set upper limits on processes. The read access is correctly bounded. However, the behavior for write operations is wrong.
I am dealing with pseudo-files read_bps_device and write_bps_device.
In http://www.mjmwired.net/kernel/Documentation/cgroups/blkio-controller.txt, last paragraph, I read the text below. Nevertheless, it is not clear for me if they are talking about CFQ or also about Throttling.
What works?
Currently only sync IO queues are support. All the buffered writes are
still system wide and not per group. Hence we will not see service
differentiation between buffered writes between groups.

Related

Efficient kernel/userspace communication for Linux kernel module

I’ve got a Linux kernel module (a qdisc / network traffic scheduler, but this shouldn’t matter) which needs efficient communication between kernel and user space.
For the communication from kernel to user space, I found relayfs over debugfs very useful, it passes a high-volume amount of debugging information out via a ring buffer, and if userspace misses a page it doesn’t matter much.
For the communication back (specifically, changing the bandwidth limit between 10 and 1000 times a second) I’ve been relying on the normal netlink stuff, i.e. tc qdisc change …. This is not only overhead, mildly mitigated with tc -b - and piping the commands in via stdin, but also introduces unwanted latency because that command involves more locking than necessary.
I think I need an efficient way to write control parameters (currently just one 64-bit number) to kernel space… very often, and with as few locking as possible. The actual updating of the value can probably be done with an atomic write (most likely followed by some kind of barrier so the other CPUs read the new value — I’ll have to read up on that as well)… unsure if I have to change the reading code as well.
relayfs is unidirectional, and buffered, so it wouldn’t be a good fit either.
From what I understand, procfs and sysfs have traditionally been the ways to do this, with sysfs being preferred for new code. From what I see from other users of sysfs though, it invoves an open/write/close cycle every time a quantity is written, and it’s usually converted to a decimal ASCII string instead of using the binary value in host endianness. This doesn’t sound like an ideal fit either.
Additionally, there may be more than one instance of the qdisc in use at any given time, so I need a way to address the correct one. I could use module-global state to keep a list of currently active qdiscs, but that would require locking, so ideally I’d do the fetching of the pointer to the control variables only on open, with locking; writing can then be lockless atomic. On the other hand, one qdisc does not necessarily need more than one writer to its controlling communication channel.
Update: considering adding just another debugfs file (so I can pass the pointers as data argument to debugfs_create_file) and doing the atomic writing + barrier in the write file operation, then I think I need only atomic reading (with a paired barrier) in the user (which runs in softirq context). Is this sound?
Apparently, atomic64_read_acquire and atomic64_set_release are my friends for this, and the Linux kernel guarantees just casting from/to u64 will not invoke UB.

Questions about supervisor mode

Reading OS from multiple resources has left be confused about supervisor mode. For example, on Wikipedia:
In kernel mode, the CPU may perform any operation allowed by its architecture ..................
In the other CPU modes, certain restrictions on CPU operations are enforced by the hardware. Typically, certain instructions are not permitted (especially those—including I/O operations—that could alter the global state of the machine), some memory areas cannot be accessed
Does it mean that instructions such as LOAD and STORE are prohibited? or does it mean something else?
I am asking this because on a pure RISC processor, the only instructions that should access IO/memory are LOAD and STORE. A simple program that evaluates some arithmetic expression will thus need supervisor mode to read its operands.
I apologize if it's vague. If possible, can anyone explain it with an example?
I see this question was asked few months back and this should have been answered long back.
I will try to set few things straight before talking about I/O part of your question.
CPU running in "kernel mode" means that OS has permitted CPU to be able to execute few extra instructions. This is done by setting some flag at an appropriate moment. One can think of it as if a digital switch enables or disables specific operations embedded inside a processor.
In RISC machines, LOAD and STORE are generally register related operations. In fact from processor's perspective, traffic to and from main-memory is not really considered an I/O operation. Data transfer between main memory and processor happens very much automatically, by virtue of a pre-programmed page table (unless the required data is NOT found in main memory as well in which case it generally has to do disk I/O). Obviously OS programs this page table well in advance and does its book keeping operations in it.
An I/O operation generally relates to those with other external devices which are reachable through interrupt controller. Whenever an I/O operation completes, the corresponding device raises an interrupt towards processor and this causes OS to immediately change the processor's privilege level appropriately. Processor in turn works out the request raised by interrupt. This interrupt is a program written by OS developers, which may contain certain privileged instructions. This raised privileged level is some times referred as "kernel mode".

Why do operating systems limit file descriptors?

I ask this question after trying my best to research the best way to implement a message queue server. Why do operating systems put limits on the number of open file descriptors a process and the global system can have?
My current server implementation uses zeromq, and opens a subscriber socket for each connected websocket client. Obviously that single process is only going to be able to handle clients to the limit of the fds.
When I research the topic I find lots of info on how to raise system limits to levels as high as 64k fds but it never mentions how it affects system performance and why it is 1k and lower to start with?
My current approach is to try and dispatch messaging to all clients using a coroutine in its own loop, and a map of all clients and their subscription channels. But I would just love to hear a solid answer about file descriptor limitations and how they affect applications that try to use them on a per client level with persistent connections?
It may be because a file descriptor value is an index into a file descriptor table. Therefore, the number of possible file descriptors would determine the size of the table. Average users would not want half of their ram being used up by a file descriptor table that can handle millions of file descriptors that they will never need.
There are certain operations which slow down when you have lots of potential file descriptors. One example is the operation "close all file descriptors except stdin, stdout, and stderr" -- the only portable* way to do this is to attempt to close every possible file descriptor except those three, which can become a slow operation if you could potentially have millions of file descriptors open.
*: If you're willing to be non-portable, you cna look in /proc/self/fd -- but that's besides the point.
This isn't a particularly good reason, but it is a reason. Another reason is simply to keep a buggy program (i.e, one that "leaks" file descriptors) from consuming too much system resources.
For performance purposes, the open file table needs to be statically allocated, so its size needs to be fixed. File descriptors are just offsets into this table, so all the entries need to be contiguous. You can resize the table, but this requires halting all threads in the process and allocating a new block of memory for the file table, then copying all entries from the old table to the new one. It's not something you want to do dynamically, especially when the reason you're doing it is because the old table is full!
On unix systems, the process creation fork() and fork()/exec() idiom requires iterating over all potential process file descriptors attempting to close each one, typically leaving leaving only a few file descriptors such as stdin, stdout, stderr untouched or redirected to somewhere else.
Since this is the unix api for launching a process, it has to be done anytime a new process is created, including executing each and every non built-in command invoked within shell scripts.
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Unix has a legacy asynchronous I/O mechanism based on file descriptor sets which use bit offsets to represent files to wait on and files that are ready or in an exception condition. It doesn't scale well for thousands of files as these descriptor sets need to be setup and cleared each time around the runloop. Newer non standard apis have appeared on the major unix variants including kqueue() on *BSD and epoll() on Linux to address performance shortcomings when dealing with a large number of descriptors.
It is important to note that select()/poll() is still used by A LOT of software as for a long time it has been the POSIX api for asynchronous I/O. The modern POSIX asynchronous IO approach is now aio_* API but it is likely not competitve with kqueue() or epoll() API's. I haven't used aio in anger and it certainly wouldn't have the performance and semantics offered by native approaches in the way they can aggregate multiple events for higher performance. kqueue() on *BSD has really good edge triggered semantics for event notification allowing it to replace select()/poll() without forcing large structural changes to your application. Linux epoll() follows the lead of *BSD kqueue() and improves upon it which in turn followed lead of Sun/Solaris evports.
The upshot is that increasing the number of allowed open files across the system adds both time and space overhead for every process in the system even if they can't make use of those descriptors based on the api they are using. There are also aggregate system limits as well for the number of open files allowed. This older but interesting tuning summary for 100k-200k simultaneous connections using nginx on FreeBSD provides some insight into the overheads for maintaining open connections and another one covering a wider range of systems but "only" seeing 10K connections as the Mt Everest.
Probably the best reference for unix system programing is W. Richard Stevens Advanced Programming in the Unix Environment

Efficient Overlapped I/O for a socket server

Which of these two different models would be more efficient (consider thrashing, utilization of processor cache, overall desgn, everything, etc)?
1 IOCP and spinning up X threads (where X is the number of processors the computer has). This would mean that my "server" would only have 1 IOCP (queue) for all requests and X Threads to serve/handle them. I have read many articles discussing the effeciency of this design. With this model I would have 1 listener that would also be associated to the IOCP. Lets assume that I could figure out how to keep the packets/requests synchronized.
X IOCP (where X is the number of processors the computer has) and each IOCP has 1 thread. This would mean that each Processor has its own queue and 1 thread to serve/handle them. With this model I would have a separate Listener (not using IOCP) that would handle incomming connections and would assign the SOCKET to the proper IOCP (one of the X that were created). Lets assume that I could figure out the Load Balancing.
Using an overly simplified analogy for the two designs (a bank):
One line with several cashiers to hand the transactions. Each person is in the same line and each cashier takes the next available person in line.
Each cashier has their own line and the people are "placed" into one of those lines
Between these two designs, which one is more efficient. In each model the Overlapped I/O structures would be using VirtualAlloc with MEM_COMMIT (as opposed to "new") so the swap-file should not be an issue (no paging). Based on how it has been described to me, using VirtualAlloc with MEM_COMMIT, the memory is reserved and is not paged out. This would allow the SOCKETS to write the incomming data right to my buffers without going through intermediate layers. So I don't think thrashing should be a factor but I might be wrong.
Someone was telling me that #2 would be more efficient but I have not heard of this model. Thanks in advance for your comments!
I assume that for #2 you plan to manually associate your sockets with an IOCP that you decide is 'best' based on some measure of 'goodness' at the time the socket is accepted? And that somehow this measure of 'goodness' will persist for the life of the socket?
With IOCP used the 'standard' way, i.e. your option number 1, the kernel works out how best to use the threads you have and allows more to run if any of them block. With your method, assuming you somehow work out how to distribute the work, you are going to end up with more threads running than with option 1.
Your #2 option also prevents you from using AcceptEx() for overlapped accepts and this is more efficient than using a normal accept loop as you remove a thread (and the resulting context switching and potential contention) from the scene.
Your analogy breaks down; it's actually more a case of either having 1 queue with X bank tellers where you join the queue and know that you'll be seen in an efficient order as opposed to each teller having their own queue and you having to guess that the queue you join doesn't contain a whole bunch of people who want to open new accounts and the one next to you contains a whole bunch of people who only want to do some paying in. The single queue ensures that you get handled efficiently.
I think you're confused about MEM_COMMIT. It doesn't mean that the memory isn't in the paging file and wont be paged. The usual reason for using VirtualAlloc for overlapped buffers is to ensure alignment on page boundaries and so reduce the number of pages that are locked for I/O (a page sized buffer can be allocated on a page boundary and so only take one page rather than happening to span two due to the memory manager deciding to use a block that doesn't start on a page boundary).
In general I think you're attempting to optimise something way ahead of schedule. Get an efficient server working using IOCP the normal way first and then profile it. I seriously doubt that you'll even need to worry about building your #2 version ... Likewise, use new to allocate your buffers to start with and then switch to the added complexity of VirtualAlloc() when you find that you server fails due to ENOBUFS and you're sure that's caused by the I/O locked page limit and not lack of non-paged pool (you do realise that you have to allocate in 'allocation granularity' sized chunks for VirtualAlloc()?).
Anyway, I have a free IOCP server framework that's available here: http://www.serverframework.com/products---the-free-framework.html which might help you get started.
Edited: The complex version that you suggest could be useful in some NUMA architectures where you use NIC teaming to have the switch spit your traffic across multiple NICs, bind each NIC to a different physical processor and then bind your IOCP threads to the same processor. You then allocate memory from that NUMA node and effectively have your network switch load balance your connections across your NUMA nodes. I'd still suggest that it's better, IMHO, to get a working server which you can profile using the "normal" method of using IOCP first and only once you know that cross NUMA node issues are actually affecting your performance move towards the more complex architecture...
Queuing theory tells us that a single queue has better characteristics than multiple queues. You could possibly get around this with work-stealing.
The multiple queues method should have better cache behavior. Whether it is significantly better depends on how many received packets are associated with a single transaction. If a request fits in a single incoming packet, then it'll be associated to a single thread even with the single IOCP approach.

Win32 Overlapped I/O - Completion routines or WaitForMultipleObjects?

I'm wondering which approach is faster and why ?
While writing a Win32 server I have read a lot about the Completion Ports and the Overlapped I/O, but I have not read anything to suggest which set of API's yields the best results in the server.
Should I use completion routines, or should I use the WaitForMultipleObjects API and why ?
You suggest two methods of doing overlapped I/O and ignore the third (or I'm misunderstanding your question).
When you issue an overlapped operation, a WSARecv() for example, you can specify an OVERLAPPED structure which contains an event and you can wait for that event to be signalled to indicate the overlapped I/O has completed. This, I assume, is your WaitForMultipleObjects() approach and, as previously mentioned, this doesn't scale well as you're limited to the number of handles that you can pass to WaitForMultipleObjects().
Alternatively you can pass a completion routine which is called when completion occurs. This is known as 'alertable I/O' and requires that the thread that issued the WSARecv() call is in an 'alertable' state for the completion routine to be called. Threads can put themselves in an alertable state in several ways (calling SleepEx() or the various EX versions of the Wait functions, etc). The Richter book that I have open in front of me says "I have worked with alertable I/O quite a bit, and I'll be the first to tell you that alertable I/O is horrible and should be avoided". Enough said IMHO.
There's a third way, before issuing the call you should associate the handle that you want to do overlapped I/O on with a completion port. You then create a pool of threads which service this completion port by calling GetQueuedCompletionStatus() and looping. You issue your WSARecv() with an OVERLAPPED structure WITHOUT an event in it and when the I/O completes the completion pops out of GetQueuedCompletionStatus() on one of your I/O pool threads and can be handled there.
As previously mentioned, Vista/Server 2008 have cleaned up how IOCPs work a little and removed the problem whereby you had to make sure that the thread that issued the overlapped request continued to run until the request completed. Link to a reference to that can be found here. But this problem is easy to work around anyway; you simply marshal the WSARecv over to one of your I/O pool threads using the same IOCP that you use for completions...
Anyway, IMHO using IOCPs is the best way to do overlapped I/O. Yes, getting your head around the overlapped/async nature of the calls can take a little time at the start but it's well worth it as the system scales very well and offers a simple "fire and forget" method of dealing with overlapped operations.
If you need some sample code to get you going then I have several articles on writing IO completion port systems and a heap of free code that provides a real-world framework for high performance servers; see here.
As an aside; IMHO, you really should read "Windows Via C/C++ (PRO-Developer)" by Jeffrey Richter and Christophe Nasarre as it deals will all you need to know about overlapped I/O and most other advanced windows platform techniques and APIs.
WaitForMultipleObjects is limited to 64 handles; in a highly concurrent application this could become a limitation.
Completion ports fit better with a model of having a pool of threads all of which are capable of handling any event, and you can queue your own (non-IO based) events into the port, whereas with waits you would need to code your own mechanism.
However completion ports, and the event based programming model, are a more difficult concept to really work against.
I would not expect any significant performance difference, but in the end you can only make your own measurements to reflect your usage. Note that Vista/Server2008 made a change with completion ports that the originating thread is not now needed to complete IO operations, this may make a bigger difference (see this article by Mark Russinovich).
Table 6-3 in the book Network Programming for Microsoft Windows, 2nd Edition compares the scalability of overlapped I/O via completion ports vs. other techniques. Completion ports blow all the other I/O models out of the water when it comes to throughput, while using far fewer threads.
The difference between WaitForMultipleObjects() and I/O completion ports is that IOCP scales to thousands of objects, whereas WFMO() does not and should not be used for anything more than 64 objects (even though you could).
You can't really compare them for performance, because in the domain of < 64 objects, they will be essentially identical.
WFMO() however does a round-robin on its objects, so busy objects with low index numbers can starve objects with high index numbers. (E.g. if object 0 is going off constantly, it will starve objects 1, 2, 3, etc). This is obviously undesireable.
I wrote an IOCP library (for sockets) to solve the C10K problem and put it in the public domain. I was able on a 512mb W2K machine to get 4,000 sockets concurrently transferring data. (You can get a lot more sockets, if they're idle - a busy socket consumes more non-paged pool and that's the ultimate limit on how many sockets you can have).
http://www.45mercystreet.com/computing/libiocp/index.html
The API should give you exactly what you need.
Not sure. But I use WaitForMultipleObjects and/or WaitFoSingleObjects. It's very convenient.
Either routine works and I don't really think one is any significant faster then another.
These two approaches exists to satisfy different programming models.
WaitForMultipleObjects is there to facilitate async completion pattern (like UNIX select() function) while completion ports is more towards event driven model.
I personally think WaitForMultipleObjects() approach result in cleaner code and more thread safe.

Resources