I have a family of PID controllers (around 100), each designed based on an operating point. During implementation, I am using interpolation. However, still the transfer is not bumpless.
How do we implement a family of PIDs with bumpless transfer?
The PID tracking mode is a common solution for bumpless transfer.
Related
From my reading dbus performance should be twice slower than other messaging ipc mechanisms due to existence of a daemon.
In the discussion of the so question which Linux IPC technique to use someones mention performance issues. Do you see performance issues other than the twice slower factor? Do you see the issue that prevent dbus from being used in embedded system?
To my understanding if dbus is intended for small messages. If large amount of data need to be passed around, one of the solution is to put the data into shared memory or a pile, and then use dbus to notify. Other ipc mechanisms according to the so discussion being in consideration are: Signals, Anonymous Pipes, Named Pipes or FIFOs, SysV Message Queues, POSIX Message Queues, SysV Shared memory, POSIX Shared memory, SysV semaphores, POSIX semaphores, FUTEX locks, File-backed and anonymous shared memory using mmap, UNIX Domain Sockets, Netlink Sockets, Network Sockets, Inotify mechanisms, FUSE subsystem, D-Bus subsystem.
I should mention another so question which lists the requirements (though it is apache centered):
packet/message oriented
ability to handle both point-to-point and one-to-many communication
no hierarchy, there's no server and client
if one endpoint crashes, the others must be notified
good support from existing Linux distros
existence of a "bind" for Apache, for the purpose of creating dynamic pages -- this is too specific though, it can be ignored in a general embedded dbus usage discussion
Yet another so question about performance mentions techniques to improve the performance. With all this being taken care of I guess there should be less issue or drawback when dbus is used in an embedded system.
I don't think there is any real-and-big performance issue.
Did some profiling:
On an arm926ejs 200MHz processor, a method call and reply with two uint32 arguments consumes anywhere between 0 to 15 ms. average 6 ms.
Changed the 2nd parameter to an array of 1000 bytes. If use the iteration api to pack and unpack the 2nd parameter, it takes about 18 ms.
The same 2nd parameter of an array of 1000 bytes. If use the fixed-length api to pack and unpack the 2nd parameter, it takes about 8 ms.
As a comparison, use the SysV msgq passing a message to another process and getting a reply. It is about 10 ms too, though without optimizing the code and repeating the test for a large number of samples.
In summary, the profiling does not show a performance issue.
To support this conclusion, there is a performance related page on dbus page, which specifies only the double-context-switching because with dbus it needs to pass the message to the daemon then to the destination.
Edit: If you send messages directly bypassing the daemon, the performance would double.
Well, the Genivi alliance, targeting the automotive industry, implemented and supports CommonAPI, which works on top of DBUS, as IPC mechanism for cars' head-units.
In the answer of this, it mentioned:
People also hear that X uses the "network" and think this is going to
be a performance bottleneck. "Network" here means local UNIX domain
socket, which has negligible overhead on modern Linux. Things that
would bottleneck on the network, there are X extensions to make fast
(shared memory pixmaps, DRI, etc.). Threads in-process wouldn't
necessarily be faster than the X socket, because the bottlenecks have
more to do with the inherent problem of coordinating multiple threads
or processes accessing the same hardware, than with the minimal
overhead of local sockets.
I don't get it. I always think that multiple threads communicate by shared variables should be faster than multiple processes communicate by Unix domain socket. So...am I wrong? Is that coordinating multiple threads such a time consuming job? And the order of how processes get scheduled does not affect the performance of the Unix domain socket at all?
Any idea? Please...
Sorry, I didn't make the question clear. What I wanted to ask is about IPC efficiency rather than X Window/Wayland system.
I just want to know why UNIX domain socket can be faster than shared memory? AFAIK, shared memory is the most primitive way to communicate between processes and threads isn't it? So UNIX domain socket should be built on top of shared memory mechanism (accompany with proper locking). How come a student (i.e. Unix domain socket) can outperform his teacher (i.e. shared memory)?
For performance, what matters is the slowest thing (the bottleneck). If some part of the program could be faster, but it isn't the bottleneck, modifying that part of the program will not help you.
This is why improving performance should always start with profiling. Every single bit of a program can always be made faster, but you need to make the bottleneck faster, not just some random thing.
With X, people will often latch on to the easy insight that something over a socket could always be slightly faster if in a single process. Which is true, but it doesn't necessarily matter to overall performance. More important is the overall design of the system... that's what something like Wayland is trying to fix.
I ask this question after trying my best to research the best way to implement a message queue server. Why do operating systems put limits on the number of open file descriptors a process and the global system can have?
My current server implementation uses zeromq, and opens a subscriber socket for each connected websocket client. Obviously that single process is only going to be able to handle clients to the limit of the fds.
When I research the topic I find lots of info on how to raise system limits to levels as high as 64k fds but it never mentions how it affects system performance and why it is 1k and lower to start with?
My current approach is to try and dispatch messaging to all clients using a coroutine in its own loop, and a map of all clients and their subscription channels. But I would just love to hear a solid answer about file descriptor limitations and how they affect applications that try to use them on a per client level with persistent connections?
It may be because a file descriptor value is an index into a file descriptor table. Therefore, the number of possible file descriptors would determine the size of the table. Average users would not want half of their ram being used up by a file descriptor table that can handle millions of file descriptors that they will never need.
There are certain operations which slow down when you have lots of potential file descriptors. One example is the operation "close all file descriptors except stdin, stdout, and stderr" -- the only portable* way to do this is to attempt to close every possible file descriptor except those three, which can become a slow operation if you could potentially have millions of file descriptors open.
*: If you're willing to be non-portable, you cna look in /proc/self/fd -- but that's besides the point.
This isn't a particularly good reason, but it is a reason. Another reason is simply to keep a buggy program (i.e, one that "leaks" file descriptors) from consuming too much system resources.
For performance purposes, the open file table needs to be statically allocated, so its size needs to be fixed. File descriptors are just offsets into this table, so all the entries need to be contiguous. You can resize the table, but this requires halting all threads in the process and allocating a new block of memory for the file table, then copying all entries from the old table to the new one. It's not something you want to do dynamically, especially when the reason you're doing it is because the old table is full!
On unix systems, the process creation fork() and fork()/exec() idiom requires iterating over all potential process file descriptors attempting to close each one, typically leaving leaving only a few file descriptors such as stdin, stdout, stderr untouched or redirected to somewhere else.
Since this is the unix api for launching a process, it has to be done anytime a new process is created, including executing each and every non built-in command invoked within shell scripts.
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Unix has a legacy asynchronous I/O mechanism based on file descriptor sets which use bit offsets to represent files to wait on and files that are ready or in an exception condition. It doesn't scale well for thousands of files as these descriptor sets need to be setup and cleared each time around the runloop. Newer non standard apis have appeared on the major unix variants including kqueue() on *BSD and epoll() on Linux to address performance shortcomings when dealing with a large number of descriptors.
It is important to note that select()/poll() is still used by A LOT of software as for a long time it has been the POSIX api for asynchronous I/O. The modern POSIX asynchronous IO approach is now aio_* API but it is likely not competitve with kqueue() or epoll() API's. I haven't used aio in anger and it certainly wouldn't have the performance and semantics offered by native approaches in the way they can aggregate multiple events for higher performance. kqueue() on *BSD has really good edge triggered semantics for event notification allowing it to replace select()/poll() without forcing large structural changes to your application. Linux epoll() follows the lead of *BSD kqueue() and improves upon it which in turn followed lead of Sun/Solaris evports.
The upshot is that increasing the number of allowed open files across the system adds both time and space overhead for every process in the system even if they can't make use of those descriptors based on the api they are using. There are also aggregate system limits as well for the number of open files allowed. This older but interesting tuning summary for 100k-200k simultaneous connections using nginx on FreeBSD provides some insight into the overheads for maintaining open connections and another one covering a wider range of systems but "only" seeing 10K connections as the Mt Everest.
Probably the best reference for unix system programing is W. Richard Stevens Advanced Programming in the Unix Environment
I noticed that all the _EPROCESS objects are linked to each other via the ActiveProcessList link. What is the purpose of this List. For what does the OS use this list of Active Processes?
In Windows NT, the schedulable unit is the thread. Processes serve as a container of threads, and also as an abstraction that defines what virtual memory map is active (and some other things).
All operating systems need to keep this information available. At different times, different components of the operating system could need to search for a process that matches a specific characteristic, or would need to assess all active processes.
So, how do we store this information? Why not a gigantic array in memory? Well, how big is that array going to be? Are we comfortable limiting the number of active processes to the size of this array? What happens if we can't grow the array? Are we prepared to reserve all that memory up front to keep track of the processes? In the low process use case, isn't that a lot of wasted memory?
So we can keep them on a linked list.
There are some occasions in NT where we care about process context but not thread context. One of those is I/O completion. When an I/O operation is handled asynchronously by the operating system, the eventual completion of that I/O could be in a process context that is different from the requesting process context. So, we need some records and information about the originating process so that we can "attach" to this process. "Attaching" to the process swaps us into the appropriate context with the appropriate user-mode memory available. We don't care about thread context, we care about process context, so this works.
We are now assessing different IPC (or rather RPC) methods for our current project, which is in its very early stages. Performance is a big deal, and so we are making some measurements to aid our choice. Our processes that will be communicating will reside on the same machine.
A separate valid option is to avoid IPC altogether (by encapsulating the features of one of the processes in a .NET DLL and having the other one use it), but this is an option we would really like to avoid, as these two pieces of software are developed by two separate companies and we find it very important to maintain good "fences", which make good neighbors.
Our tests consisted of passing messages (which contain variously sized BLOBs) across process boundaries using each method. These are the figures we get (performance range correlates with message size range):
Web Service (SOAP over HTTP):
25-30 MB/s when binary data is encoded as Base64 (default)
70-100 MB/s when MTOM is utilized
.NET Remoting (BinaryFormatter over TCP): 100-115 MB/s
Control group - DLL method call + mem copy: 800-1000 MB/s
Now, we've been looking all over the place for some average performance figures for these (and other) IPC methods, including performance of raw TCP loopback sockets, but couldn't find any. Do these figures look sane? Why is the performance of these local IPC methods at least 10 times slower than copying memory? I couldn't get better results even when I used raw sockets - is the overhead of TCP that big?
Shared memory is the fastest.
A producer process can put its output into memory shared between processes and notify other processes that the shared data has been updated. On Linux you naturally put a mutex and a condition variable in that same shared memory so that other processes can wait for updates on the condition variable.
Memory-mapped files + synchronization objects is the right way to go (almost the same as shared memory, but with more control). Sockets are way too slow for local communications. Especially it sometimes happens that network drivers are slower with localhost, than over network.
Several parts of our system have been redesigned so that we don't have to pass 30MB messages around, but rather 3MB. This allowed us to choose .NET Remoting with BinaryFormatter over named pipes (IpcChannel), which gives satisfactory results.
Our contingency plan (in case we ever do need to pass 30MB messages around) is to pass protobuf-serialized messages over named pipes manually. We have determined that this also provides satisfactory results.