What makes VxWorks so deterministic and fast? - performance

I worked on VxWorks 5.5 long time back and it was the best experience working on world's best real time OS. Since then I never got a chance to work on it again. But, a question keeps popping to me, what makes is so fast and deterministic?
I have not been able to find many references for this question via Google.
So, I just tried thinking what makes a regular OS non-deterministic:
Memory allocation/de-allocation:- Wiki says RTOS use fixed size blocks, so that these blocks can be directly indexed, but this will cause internal fragmentation and I am sure this is something not at all desirable on mission critical systems where the memory is already limited.
Paging/segmentation:- Its kind of linked to Point 1
Interrupt Handling:- Not sure how VxWorks implements it, as this is something VxWorks handles very well
Context switching:- I believe in VxWorks 5.5 all the processes used to execute in kernel address space, so context switching used to involve just saving register values and nothing about PCB(process control block), but still I am not 100% sure
Process scheduling algorithms:- If Windows implements preemptive scheduling (priority/round robin) then will process scheduling be as fast as in VxWorks? I dont think so. So, how does VxWorks handle scheduling?
Please correct my understanding wherever required.

I believe the following would account for lots of the difference:
No Paging/Swapping
A deterministic RTOS simply can't swap memory pages to disk. This would kill the determinism, since at any moment you could have to swap memory in or out.
vxWorks requires that your application fit entirely in RAM
No Processes
In vxWorks 5.5, there are tasks, but no process like Windows or Linux. The tasks are more akin to threads and switching context is a relatively inexpensive operation. In Linux/Windows, switching process is quite expensive.
Note that in vxWorks 6.x, a process model was introduced, which increases some overhead, but mainly related to transitioning from User mode to Supervisor mode. The task switching time is not necessarily directly affected by the new model.
Fixed Priority
In vxWorks, the task priorities are set by the developer and are system wide. The highest priority task at any given time will be the one running. You can thus design your system to ensure that the tasks with the tightest deadline always executes before others.
In Linux/Windows, generally speaking, while you have some control over the priority of processes, the scheduler will eventually let lower priority processes run even if higher priority process are still active.

Related

Is a context switch less expensive between processes with the same executable

Question
Are there any notable differences between context switching between processes running the same executable (for example, two separate instances of cat) vs processes running different executables?
Background
I already know that having the same executable means that it can be cached in the same place in memory and in any of the CPU caches that might be available, so I know that when you switch from one process to another, if they're both executing the same executable, your odds of having a cache miss are smaller (possibly zero, if the executable is small enough or they're executing in roughly the same "spot", and the kernel doesn't do anything in the meantime that could cause the relevant memory to be evicted from the cache). This of course applies "all the way down", to memory still being in RAM vs. having been paged out to swap/disk.
I'm curious if there are other considerations that I'm missing? Anything to do with virtual memory mappings, perhaps, or if there are any kernels out there which are able to somehow get more optimal performance out of context switches between two processes running the same executable binary?
Motivation
I've been thinking about the Unix philosophy of small programs that do one thing well, and how taken to its logical conclusion, it leads to lots of small executables being forked and executed many times. (For example, 30-something runsv processes getting started up nearly simultaneously on Void Linux boot - note that runsv is only a good example during startup, because they mostly spend their time blocked waiting for events once they start their child service, so besides early boot, there isn't much context-switching between them happening. But we could easily image numerous cat or /bin/sh instances running at once or whatever.)
The context switching overhead is the same. That is usually done with a single (time consuming) instruction.
There are some more advanced operating systems (i.e. not eunuchs) that support installed shared programs. They have reduced overhead when more than one process accesses them. E.g., only one copy of read only data loaded into physical memory.

Why do you use the keyword delete?

I understand that delete returns memory to the heap that was allocated of the heap, but what is the point? Computers have plenty of memory don't they? And all of the memory is returned as soon as you "X" out of the program.
Example:
Consider a server that allocates an object Packet for each packet it receives (this is bad design for the sake of the example).
A server, by nature, is intended to never shut down. If you never delete the thousands of Packet your server handles per second, your system is going to swamp and crash in a few minutes.
Another example:
Consider a video game that allocates particles for the special effect, everytime a new explosion is created (and never deletes them). In a game like Starcraft (or other recent ones), after a few minutes of hilarity and destruction (and hundres of thousands of particles), lag will be so huge that your game will turn into a PowerPoint slideshow, effectively making your player unhappy.
Not all programs exit quickly.
Some applications may run for hours, days or longer. Daemons may be designed to run without cease. Programs can easily consume more memory over their lifetime than available on the machine.
In addition, not all programs run in isolation. Most need to share resources with other applications.
There are a lot of reasons why you should manage your memory usage, as well as any other computer resources you use:
What might start off as a lightweight program could soon become more complex, depending on your design areas of memory consumption may grow exponentially.
Remember you are sharing memory resources with other programs. Being a good neighbour allows other processes to use the memory you free up, and helps to keep the entire system stable.
You don't know how long your program might run for. Some people hibernate their session (or never shut their computer down) and might keep your program running for years.
There are many other reasons, I suggest researching on memory allocation for more details on the do's and don'ts.
I see your point, what computers have lots of memory but you are wrong. As an engineer you have to create programs, what uses computer resources properly.
Imagine, you made program which runs all the time then computer is on. It sometimes creates some objects/variables with "new". After some time you don't need them anymore and you don't delete them. Such a situation occurs time to time and you just make some RAM out of stock. After a while user have to terminate your program and launch it again. It is not so bad but it not so comfortable, what is more, your program may be loading for a while. Because of these user feels bad of your silly decision.
Another thing. Then you use "new" to create object you call constructor and "delete" calls destructor. Lets say you need to open so file and destructor closes it and makes it accessible for other processes in this case you would steel not only memory but also files from other processes.
If you don't want to use "delete" you can use shared pointers (it has garbage collector).
It can be found in STL, std::shared_ptr, it has one disatvantage, WIN XP SP 2 and older do not support this. So if you want to create something for public you should use boost it also has boost::shared_ptr. To use boost you need to download it from here and configure your development environment to use it.

How expensive is a context switch? Is it better to implement a manual task switch than to rely on OS threads?

Imagine I have two (three, four, whatever) tasks that have to run in parallel. Now, the easy way to do this would be to create separate threads and forget about it. But on a plain old single-core CPU that would mean a lot of context switching - and we all know that context switching is big, bad, slow, and generally simply Evil. It should be avoided, right?
On that note, if I'm writing the software from ground up anyway, I could go the extra mile and implement my own task-switching. Split each task in parts, save the state inbetween, and then switch among them within a single thread. Or, if I detect that there are multiple CPU cores, I could just give each task to a separate thread and all would be well.
The second solution does have the advantage of adapting to the number of available CPU cores, but will the manual task-switch really be faster than the one in the OS core? Especially if I'm trying to make the whole thing generic with a TaskManager and an ITask, etc?
Clarification: I'm a Windows developer so I'm primarily interested in the answer for this OS, but it would be most interesting to find out about other OSes as well. When you write your answer, please state for which OS it is.
More clarification: OK, so this isn't in the context of a particular application. It's really a general question, the result on my musings about scalability. If I want my application to scale and effectively utilize future CPUs (and even different CPUs of today) I must make it multithreaded. But how many threads? If I make a constant number of threads, then the program will perform suboptimally on all CPUs which do not have the same number of cores.
Ideally the number of threads would be determined at runtime, but few are the tasks that can truly be split into arbitrary number of parts at runtime. Many tasks however can be split in a pretty large constant number of threads at design time. So, for instance, if my program could spawn 32 threads, it would already utilize all cores of up to 32-core CPUs, which is pretty far in the future yet (I think). But on a simple single-core or dual-core CPU it would mean a LOT of context switching, which would slow things down.
Thus my idea about manual task switching. This way one could make 32 "virtual" threads which would be mapped to as many real threads as is optimal, and the "context switching" would be done manually. The question just is - would the overhead of my manual "context switching" be less than that of OS context switching?
Naturally, all this applies to processes which are CPU-bound, like games. For your run-of-the-mill CRUD application this has little value. Such an application is best made with one thread (at most two).
I don't see how a manual task switch could be faster since the OS kernel is still switching other processes, including yours in out of the running state too. Seems like a premature optimization and a potentially huge waste of effort.
If the system isn't doing anything else, chances are you won't have a huge number of context switches anyway. The thread will use its timeslice, the kernel scheduler will see that nothing else needs to run and switch right back to your thread. Also the OS will make a best effort to keep from moving threads between CPUs so you benefit there with caching.
If you are really CPU bound, detect the number of CPUs and start that many threads. You should see nearly 100% CPU utilization. If not, you aren't completely CPU bound and maybe the answer is to start N + X threads. For very IO bound processes, you would be starting a (large) multiple of the CPU count (i.e. high traffic webservers run 1000+ threads).
Finally, for reference, both Windows and Linux schedulers wake up every millisecond to check if another process needs to run. So, even on an idle system you will see 1000+ context switches per second. On heavily loaded systems, I have seen over 10,000 per second per CPU without any significant issues.
The only advantage of manual switch that I can see is that you have better control of where and when the switch happens. The ideal place is of course after a unit of work has been completed so that you can trash it all together. This saves you a cache miss.
I advise not to spend your effort on this.
Single-core Windows machines are going to become extinct in the next few years, so I generally write new code with the assumption that multi-core is the common case. I'd say go with OS thread management, which will automatically take care of whatever concurrency the hardware provides, now and in the future.
I don't know what your application does, but unless you have multiple compute-bound tasks, I doubt that context switches are a significant bottleneck in most applications. If your tasks block on I/O, then you are not going to get much benefit from trying to out-do the OS.

What is the 'realtime' process priority setting for?

From what I've read in the past, you're encouraged not to change the priority of your Windows applications programmatically, and if you do, you should never change them to 'Realtime'.
What does the 'Realtime' process priority setting do, compared to 'High', and 'Above Normal'?
A realtime priority thread can never be pre-empted by timer interrupts and runs at a higher priority than any other thread in the system. As such a CPU bound realtime priority thread can totally ruin a machine.
Creating realtime priority threads requires a privilege (SeIncreaseBasePriorityPrivilege) so it can only be done by administrative users.
For Vista and beyond, one option for applications that do require that they run at realtime priorities is to use the Multimedia Class Scheduler Service (MMCSS) and let it manage your threads priority. The MMCSS will prevent your application from using too much CPU time so you don't have to worry about tanking the machine.
Simply, the "Real Time" priority class is higher than "High" priority class. I don't think there's much more to it than that. Oh yeah - you have to have the SeIncreaseBasePriorityPrivilege to put a thread into the Real Time class.
Windows will sometimes boost the priority of a thread for various reasons, but it won't boost the priority of a thread into another priority class. It also won't boost the priority of threads in the real-time priority class. So a High priority thread won't get any automatic temporary boost into the Real Time priority class.
Russinovich's "Inside Windows" chapter on how Windows handles priorities is a great resource for learning how this works:
http://web.archive.org/web/20140909124652/http://download.microsoft.com/download/5/b/3/5b38800c-ba6e-4023-9078-6e9ce2383e65/C06X1116607.pdf
Note that there's absolutely no problem with a thread having a Real-time priority on a normal Windows system - they aren't necessarily for special processes running on dedicatd machines. I imagine that multimedia drivers and/or processes might need threads with a real-time priority. However, such a thread should not require much CPU - it should be blocking most of the time in order for normal system events to get processing.
It would be the highest available priority setting, and would usually only be used on box that was dedicated to running that specific program. It's actually high enough that it could cause starvation of the keyboard and mouse threads to the extent that they become unresponsive.
So basicly, if you have to ask, don't use it :)
Real-time is the highest priority class available to a process. Therefore, it is different from 'High' in that it's one step greater, and 'Above Normal' in that it's two steps greater.
Similarly, real-time is also a thread priority level.
The process priority class raises or lowers all effective thread priorities in the process and is therefore considered the 'base priority'.
So, a process has a:
Base process priority class.
Individual thread priorities, offsets of the base priority class.
Since real-time is supposed to be reserved for applications that absolutely must pre-empt other running processes, there is a special security privilege to protect against haphazard use of it. This is defined by the security policy.
In NT6+ (Vista+), use of the Vista Multimedia Class Scheduler is the proper way to achieve real-time operations in what is not a real-time OS. It works, for the most part, though is not perfect since the OS isn't designed for real-time operations.
Microsoft considers this priority very dangerous, rightly so. No application should use it except in very specialized circumstances, and even then try to limit its use to temporary needs.
Once Windows learns a program uses higher than normal priority it seems like it limits the priority on the process.
Setting the priority from IDLE to REALTIME does NOT change the CPU usage.
I found on My multi-processor AMD CPU that if I drop one of the CPUs ot like the LAST one the CPU usage will MAX OUT and the last CPU remains idle. The processor speed increases to 75% on my Quad AMD.
Use Task Manager->select process->Right Click on the process->Select->Set Affinity
Click all but the last processor. The CPU usage will increase to the MAX on the remaining processors and Frame counts if processing video will increase.
Like all other answers before real time gives that program the utmost priority class. Nothing is processed until that program has been processed.
On my pentium 4 machine I set minecraft to real time a lot since it increases the game performance a lot, and the system seems completely stable. so realtime isn't as bad as it seems, just if you have a multi-core set a program's affinity to a specific core or cores (just not all of them, just to let everything else be able to run in case the real time set programs gets hung up) and set the priority to real time.
It basically is higher/greater in everything else. A keyboard is less of a priority than the real time process. This means the process will be taken into account faster then keyboard and if it can't handle that, then your keyboard is slowed.
Realtime process priority is used mainly to make a specific process run faster at the expense of literally everything else. I personally have my Discord bot's process priority set to realtime, since the bot is lightweight, and I need to keep it responding quickly, but it would be a bad idea to randomly change process priorities unless you know what you're doing, for example, if you were to set Google Chrome's priority to low, it wouldn't cause any problems, but if you were to set registry's priority to low, it would likely cause more than enough problems. Just don't set a heavy program to realtime unless the device it's being run on is completely dedicated to that program.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Resources