Do Operating Systems slow down program execution? - performance

This question is about operating systems in general. Is there any necessary mechanism in implementation of operating systems that impacts flow of instructions my program sends to CPU?
For example if my program was set for maximum priority in OS, would it perform exactly the same when run without OS?

Is there any necessary mechanism in implementation of operating systems that impacts flow of instructions my program sends to CPU?
Not strictly necessary mechanisms (depending on how you define "OS"); but typically there's IRQs, exceptions and task switches.
IRQs are used by devices to ask the OS (their device driver) for attention; and interrupting the flow of instructions your program sends to CPU. The alternative is polling, which wastes a huge amount of CPU time checking if the device needs attention when it probably doesn't. Because applications need to use devices (file IO, keyboard, video, etc) and wasting CPU time is bad; IRQs significantly improve the performance of applications.
Exceptions (like IRQs) also interrupt the normal flow of instructions. They occur when the normal flow of instructions can't continue, either because your program crashed, or because your program needs something. The most common cause of exceptions is virtual memory (e.g. using swap space to let the application have more memory than actually exists so that the application can actually work properly; where the exception tells the OS that your program tried to access memory that has to be fetched from disk first). In general; this also improves performance for multiple reasons (because "can't execute because there's not enough RAM" can be considered "zero performance"; and because various tricks reduce RAM consumption and increase the amount of RAM that can be used for things like caching files which improve file IO speed).
Task switches is the basis of multi-tasking (e.g. being able to run more than one application at a time). If there are more tasks that want CPU time than there are CPUs, then the OS (scheduler) may (depending on task priorities and scheduler design) switch between them so that all the tasks get some CPU time. However; most applications spend most of their time waiting for something to do (e.g. waiting for user to press a key) and don't need CPU time while waiting; and if the OS is only running one task then the scheduler does nothing (no task switches because there's no other task to switch to). In other words, if the OS supports multi-tasking but you're only running one task, then it makes no difference.
Note that in some cases, IRQs and/or tasks are also used to "opportunistically" do work in the background (when hardware has nothing better to do) to improve performance (e.g. pre-fetch, pre-process and/or pre-calculate data before it's needed so that the resulting data is available instantly when it is needed).
For example if my program was set for maximum priority in OS, would it perform exactly the same when run without OS?
It's best to think of it as many layers - hardware & devices (CPU, etc), with kernel and device drivers on top, with applications on top of that. If you remove any of the layers nothing works (e.g. how can an application read and write files when there's no file system and no disk device drivers?).
If you shift all of the functionality that an OS provides into the application (e.g. a statically linked library that can make an application boot on bare metal); then if the functionality is the same the performance will be the same.
You can only improve performance by reducing functionality. For example, if you get rid of security you'll improve performance (temporarily, until your application becomes part of an attacker's botnet and performance becomes significantly worse due to all the bitcoin mining it's doing). In a similar way, you can get rid of flexibility (reboot the computer when you plug in a different USB flash stick), or fault tolerance (trash all of your data without any warning when the storage devices start failing because software assumed hardware is permanently perfect).

Related

Why processes don't have the ability to run in kernel mode?

OS use kernel mode (privilege mode) and user mode. It seems very reasonable for security reasons. Process cant make any command it wants, only the operation system can make those commands.
On the other hand it take long time all the context switch. change between user to kernel mode and vice versa.
The trap to the operation system take a long time.
I think why the operation system not give the ability to process to run in kernel mode to increase it's performance (this can be very big improve)?
In real time systems this works in the same way?
Thanks.
There are safety and stability reasons, which disallow user-space process to access kernel space functions directly.
Kernel code garantees, that no user-space process(until being executed with root priveleges) can break operating system. This is a vital property of modern OS. Also it is important, that development of user-space apps is much more simple, than kernel modules development.
In case when application needs more perfomance than available for use-space, it is possible to move its code(or part of it) into kernel space. E.g., network protocols and filesystems are implemented as kernel drivers mostly because of perfomance reasons.
Real time applications are more demanding to stability. They also use system calls.
I think there is no sense to do this.
1.) If you want something to be runned in kernel context use kernel module API, what is the problem with that?
2.) Why do you think that it will multiple process speed? Switch between kernel and userspace is just additional registers state save / restore. It will run faster, but i don't think user will even notice it.

Temporarily suspend the PC operating system

How does one programmatically cause the OS to switch off, go away and stop doing anything at all so that a program may have complete control of a PC system?
I'm interested in doing this from both an MS Windows and Linux environments. Any languages or APIs considered.
I want the OS to stop preempting my program, stop its virtual memory management, stop its device drivers and interrupt service routines from running and basically just go away. Then, when my program has had its evil way with the bare metal, I want the OS to come back again without a reboot.
Is this even possible?
With Linux, you could use kexec jump to transfer control completely to another kernel (ie, your program). Of course, with great power comes great responsibility - it is entirely up to you to service interrupts, and avoid corrupting the old kernel's memory. You'll end up having to write your own OS kernel to do this. Also, the transfer of control takes quite some time, as the kernel has to de-initialize all hardware, then reinitialize it when it's time to resume. Since kexec jump was originally designed for hibernation support, this isn't a problem in its original context, but depending on what you're doing, it might be a problem.
You may want to consider instead working within the framework given to you by the OS - just write a normal driver for whatever you're doing.
Finally, one more option would be using the linux Real-Time patchset. This lets you assign static priorities to everything, even interrupt handlers; by running a process with higher priority than anything else, you could suspend /nearly/ everything - the system will still service a small stub for interrupts, as well as certain interrupts that can't be deferred, like timing interrupts, but for the most part the heavy work will be deferred until you relinquish control of the CPU.
Note that the RT patchset won't stop virtual memory and the like - mlockall will prevent page faults on valid pages though, if that's enough for you.
Also, keep in mind that whatever you do, the system BIOS can still cause SMM traps, which cannot be disabled, except by motherboard-model-specific methods.
There are lots of really ugly ways to do this. You could modify the running kernel by writing some trampoline code to /dev/kmem that passes control to your application. But I wouldn't recommend attempting something like that!
Basically, you would need to have your application act as its own operating system. If you want to read data from a file, you would have to figure out where the data lives on disk, and generate your own SCSI requests to talk to the disk drive. You would have to implement your own interrupt handler to get notified when the data is ready. Likewise you would have to handle page faults, memory allocation, etc. Most users feel that this isn't worth the effort...
Why do you want to do this?
Is there something that your application needs to do that the OS won't let it do? Are you concerned with the OS impact on performance? Something else?
If you don't mind shelling out some cash, you could use IntervalZero's RTX to do this for a Windows system. It's a hard realtime subsystem that gets installed on a Windows box as sort of a hack into the HAL and takes over the machine, letting Windows have whatever CPU cycles are left over.
It has its own scheduler and device drivers, but if you run your program at the top RTX priority, don't install any RTX device drivers (or disable interrupts for the duration), then nothing will interrupt it.
It also supports a small amount of interaction with programs on the Windows side.
We use it as a nice way to get a hard realtime box that runs Windows.
coLinux loads CoLinuxDriver into the NT kernel or a colinux.ko into the Linux kernel. It does exactly what you asked – it "unschedules" the host OS, and runs its own code, with its own memory management, interrupts, etc. Then, when it's done, it "reschedules" the host OS, allowing it to continue from where it left off. coLinux uses this to run a modified Linux kernel parallel to the host OS.
Unlike more common virtualization techniques, there are no barriers between coLinux and the bare metal hardware at all. However, hardware and the host OS tend to get confused if the coLinux guest touches anything without restoring it before returning to the host OS.
Not really. Operating Systems are a foundation, and your program runs on top of them. The OS handles memory access, disk writing operations, communications, etc. when your application makes requests, and asking the OS to move out of the way would mean that your program would have to do the OS's job instead.
Not as such, no.
What you want is basically an application that becomes an OS; a severely stripped down Linux kernel coupled with some highly customized and minimized tools might be the way to go for this.
if you were devious, and wanted to avoid alot of the operating system housekeeping you could probably hook yourself into a driver routine. Thinking out aloud, verging on hacking. google how to write root kits.
Yeah dude, you can totally do that, you can also write a program to tell my bank to give you all my money and send you a hot Russian.

how to controls the CPU usage of an app on OS X?

I'm running an application right now which seems to be running at full throttle, but even though the fan seems to be spinning at it's max and the activity monitor reports that the application is using 100% of the processor, I'm suspecting that at the most it is using 100% only of a single of the two cores on my machine.
How can I tell OS X to allow an application use 100%, or as much as the OS can allow, of the processing power of my computer? I have tried some terminal commands like "nice" and "renice" to set up the priority of this process but still can't get it to run at full throttle.
I also would like to know how to do the opposite, set a limit of the processor usage of an app, example set app X to run at 20%.
Is this possible to do without modifying the code of the app?
The answer to this depends upon whether your application is multi-threaded or not. If this is a single-threaded application, which it is unless you have specifically made it multi-threaded then the process will run on one core of your multi-core hardware. There is nothing you can do about this it's a function of the underlying operating system.
If your program is multi-threaded then it is possible to have different threads executing on separate cores. This will increase the overall usage of the process and allow figures greater than 100%.
You can not however force the machine to use 'all' of the processing power available, but you can influence it with nice.
In order to reduce the amount of processor used then you can use nice to lower the priority of the process. If you are root you can also use nice to increase the priority of your process

Is there any way of throttling CPU/Memory of a process?

Problem: I have a developers machine (read: fast, lots of memory), but the user has a users machine (read: slow, not very much memory).
I can simulate a slow network using Fiddler (http://www.fiddler2.com/fiddler2/)
I can look at how CPU is used over time for a process using Process Explorer (http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx).
Is there any way I can restrict the amount of CPU a process can have, or the amount of memory a process can have in order to simulate a users machine more effectively? (In order to isolate performance problems for instance)
I suppose I could use a VM, but I'm looking for something a bit lighter.
I'm using Windows XP, but a solution for any Windows machine would be welcome. Thanks.
The platform SDK used to come with stress tools for doing just this back in the good old days (STRESS.EXE, CPUSTRESS.EXE in the SDK), but they might still be there (check your platform SDK and/or Visual Studio installation for these two files -- unfortunately I have niether the PSDK nor VS installed on the machine I'm typing from.)
Other tools:
memory: performance & reliability (e.g. handling failed memory allocation): can use EatMem
CPU: performance & reliability (e.g. race conditions): can use CPU Burn, Prime95, etc
handles (GDI, User): reliability (e.g. handling failed GDI resource allocation): ??? may have to write your own, but running out of GDI handles (buggy GTK apps would usually eat them all away until all other apps on the system would start falling dead like flies) is a real test for any Windows app
disk: performance & reliability (e.g. handling disk full): DiskFiller, etc.
AppVerifier has a low-resource simulation feature.
You could also try setting the priority of your process to be very low.
You can run MemAlloc to chew up RAM, possibly a few copies at once.
I found a related question:
Set Windows process (or user) memory limit
The accepted answer for the question has a link to the Windows API's SetProcessWorkingSetSize, so it's not exactly a tool that can limit the amount of memory that a process can use.
In terms of changing the amount of CPU resources a process can use, if you don't mind the granularity of per-core limiting of resources, Task Manager can change the processor affinity of a process.
In Task Manager, right-click a process and select "Set Affinity...", then select the processor cores that the process can be assigned to.
If the development machine has many cores but the user machine only has one, then, rather than allowing the process to run on all the available cores, set the process' processor affinity to only one core.
It has nothing to do with SetProcessWorkingSetSize
Just use internal Win32 kernel apis to restrict CPU Usage

How to obtain good concurrent read performance from disk

I'd like to ask a question then follow it up with my own answer, but also see what answers other people have.
We have two large files which we'd like to read from two separate threads concurrently. One thread will sequentially read fileA while the other thread will sequentially read fileB. There is no locking or communication between the threads, both are sequentially reading as fast as they can, and both are immediately discarding the data they read.
Our experience with this setup on Windows is very poor. The combined throughput of the two threads is in the order of 2-3 MiB/sec. The drive seems to be spending most of its time seeking backwards and forwards between the two files, presumably reading very little after each seek.
If we disable one of the threads and temporarily look at the performance of a single thread then we get much better bandwidth (~45 MiB/sec for this machine). So clearly the bad two-thread performance is an artefact of the OS disk scheduler.
Is there anything we can do to improve the concurrent thread read performance? Perhaps by using different APIs or by tweaking the OS disk scheduler parameters in some way.
Some details:
The files are in the order of 2 GiB each on a machine with 2GiB of RAM. For the purpose of this question we consider them not to be cached and perfectly defragmented. We have used defrag tools and rebooted to ensure this is the case.
We are using no special APIs to read these files. The behaviour is repeatable across various bog-standard APIs such as Win32's CreateFile, C's fopen, C++'s std::ifstream, Java's FileInputStream, etc.
Each thread spins in a loop making calls to the read function. We have varied the number of bytes requested from the API each iteration from values between 1KiB up to 128MiB. Varying this has had no effect, so clearly the amount the OS is physically reading after each disk seek is not dictated by this number. This is exactly what should be expected.
The dramatic difference between one-thread and two-thread performance is repeatable across Windows 2000, Windows XP (32-bit and 64-bit), Windows Server 2003, and also with and without hardware RAID5.
The problem seems to be in Windows I/O scheduling policy. According to what I found here there are many ways for an O.S. to schedule disk requests. While Linux and others can choose between different policies, before Vista Windows was locked in a single policy: a FIFO queue, where all requests where splitted in 64 KB blocks. I believe that this policy is the cause for the problem you are experiencing: the scheduler will mix requests from the two threads, causing continuous seek between different areas of the disk.
Now, the good news is that according to here and here, Vista introduced a smarter disk scheduler, where you can set the priority of your requests and also allocate a minimum badwidth for your process.
The bad news is that I found no way to change disk policy or buffers size in previous versions of Windows. Also, even if raising disk I/O priority of your process will boost the performance against the other processes, you still have the problems of your threads competing against each other.
What I can suggest is to modify your software by introducing a self-made disk access policy.
For example, you could use a policy like this in your thread B (similar for Thread A):
if THREAD A is reading from disk then wait for THREAD A to stop reading or wait for X ms
Read for X ms (or Y MB)
Stop reading and check status of thread A again
You could use semaphores for status checking or you could use perfmon counters to get the status of the actual disk queue.
The values of X and/or Y could also be auto-tuned by checking the actual trasfer rates and slowly modify them, thus maximizing the throughtput when the application runs on different machines and/or O.S. You could find that cache, memory or RAID levels affect them in a way or the other, but with auto-tuning you will always get the best performance in every scenario.
I'd like to add some further notes in my response. All other non-Microsoft operating systems we have tested do not suffer from this problem. Linux, FreeBSD, and Mac OS X (this final one on different hardware) all degrade much more gracefully in terms of aggregate bandwidth when moving from one thread to two. Linux for example degraded from ~45 MiB/sec to ~42 MiB/sec. These other operating systems must be reading larger chunks of the file between each seek, and therefor not spending nearly all their time waiting on the disk to seek.
Our solution for Windows is to pass the FILE_FLAG_NO_BUFFERING flag to CreateFile and use large (~16MiB) reads in each call to ReadFile. This is suboptimal for several reasons:
Files don't get cached when read like this, so there are none of the advantages that caching normally gives.
The constraints when working with this flag are much more complicated than normal reading (alignment of read buffers to page boundaries, etc).
(As a final remark. Does this explain why swapping under Windows is so hellish? Ie, Windows is incapable of doing IO to multiple files concurrently with any efficiency, so while swapping all other IO operations are forced to be disproportionately slow.)
Edit to add some further details for Will Dean:
Of course across these different hardware configurations the raw figures did change (sometimes substantially). The problem however is the consistent degradation in performance that only Windows suffers when moving from one thread to two. Here is a summary of the machines tested:
Several Dell workstations (Intel Xeon) of various ages running Windows 2000, Windows XP (32-bit), and Windows XP (64-bit) with single drive.
A Dell 1U server (Intel Xeon) running Windows Server 2003 (64-bit) with RAID 1+0.
An HP workstation (AMD Opteron) with Windows XP (64-bit), and Windows Server 2003, and hardware RAID 5.
My home unbranded PC (AMD Athlon64) running Windows XP (32-bit), FreeBSD (64-bit), and Linux (64-bit) with single drive.
My home MacBook (Intel Core1) running Mac OS X, single SATA drive.
My home Koolu PC running Linux. Vastly underpowered compared to the other systems but I demonstrated that even this machine can outperform a Windows server with RAID5 when doing multi-threaded disk reads.
CPU usage on all of these systems was very low during the tests and anti-virus was disabled.
I forgot to mention before but we also tried the normal Win32 CreateFile API with the FILE_FLAG_SEQUENTIAL_SCAN flag set. This flag didn't fix the problem.
It does seem a little strange that you see no difference across quite a wide range of windows versions and nothing between a single drive and hardware raid-5.
It's only 'gut feel', but that does make me doubtful that this is really a simple seeking problem. Other than the OS X and the Raid5, was all this tried on the same machine - have you tried another machine? Is your CPU usage basically zero during this test?
What's the shortest app you can write which demonstrates this problem? - I would be interested to try it here.
I would create some kind of in memory thread safe lock. Each thread could wait on the lock until it was free. When the lock becomes free, take the lock and read the file for a defined length of time or a defined amount of data, then release the lock for any other waiting threads.
Do you use IOCompletionPorts under Windows? Windows via C++ has an in-depth chapter on this subject and as luck would have it, it is also available on MSDN.
Paul - saw the update. Very interesting.
It would be interesting to try it on Vista or Win2008, as people seem to be reporting some considerable I/O improvements on these in some circumstances.
My only suggestion about a different API would be to try memory mapping the files - have you tried that? Unfortunately at 2GB per file, you're not going to be able to map multiple whole files on a 32-bit machine, which means this isn't quite as trivial as it might be.

Resources