qemu vs qemu-kvm: some performance measurements - performance

I conducted the following benchmark in qemu and qemu-kvm, with the following configuration:
CPU: AMD 4400 process dual core with svm enabled, 2G RAM
Host OS: OpenSUSE 11.3 with latest Patch, running with kde4
Guest OS: FreeDos
Emulated Memory: 256M
Network: Nil
Language: Turbo C 2.0
Benchmark Program: Count from 0000000 to 9999999. Display the counter on the screen
by direct accessing the screen memory (i.e. 0xb800:xxxx)
It only takes 6 sec when running in qemu.
But it takes 89 sec when running in qemu-kvm.
I ran the benchmark one by one, not in parallel.
I scratched my head the whole night, but still not idea why this happens. Would somebody give me some hints?

KVM uses qemu as his device simulator, any device operation is simulated by user space QEMU program. When you write to 0xB8000, the graphic display is operated which involves guest's doing a CPU `vmexit' from guest mode and returning to KVM module, who in turn sends device simulation requests to user space QEMU backend.
In contrast, QEMU w/o KVM does all the jobs in unified process except for usual system calls, there's fewer CPU context switches. Meanwhile, your benchmark code is a simple loop which only requires code block translation for just one time. That cost nothing, compared to vmexit and kernel-user communication of every iteration in KVM case.
This should be the most probable cause.

Your benchmark is an IO-intensive benchmark and all the io-devices are actually the same for qemu and qemu-kvm. In qemu's source code this can be found in hw/*.
This explains that the qemu-kvm must not be very fast compared to qemu. However, I have no particular answer for the slowdown. I have the following explanation for this and I think its correct to a large extent.
"The qemu-kvm module uses the kvm kernel module in linux kernel. This runs the guest in x86 guest mode which causes a trap on every privileged instruction. On the contrary, qemu uses a very efficient TCG which translates the instructions it sees at the first time. I think that the high-cost of trap is showing up in your benchmarks." This ain't true for all io-devices though. Apache benchmark would run better on qemu-kvm because the library does the buffering and uses least number of privileged instructions to do the IO.

The reason is too much VMEXIT take place.

Related

How to check IRQ latency in Linux (X86_64) for performance tuning?

Is there a way to check the interrupt processing latency in Linux kernel?
Or is there a way to check why CPU usage is only 40% in a specific configuration of Linux 4.19.138?
Background:
Currently I met a problem, I had a X86 server running either a 3rd party Linux-4.19.138 kernel (whose configuration file is about 6000 lines) or Ubuntu 20.04 X86_64 (whose configuration file is about 9500 lines long).
When running netperf test on this server , I found with the 3rd-party Linux-4.19.138 kernel, the IO latency of netperf is worse than with Ubuntu 20.04. The CPU usage is below 40% when running the 3rd party kernel, while it is about 100% when running Ubuntu 20.04.
They are using the same kernel command line and same performance profile in kernel runtime.
It seemed that the interrupt or the netserver process in the server is throttled in Linux-4.19.138.
Then, I rebuilt Ubuntu 20.04 kernel by using the short configuration file (6000 lines long), and got the similar bad results.
So it concluded that the kernel configuration made the difference.
Before comparing the 2 configurations (6000 lines vs 9500 lines), to narrow it down, my ask is, is there a way to check why CPU usage is only 40% in that configuration of 4.19.138? Or is there a way to check the interrupt processing latency in Linux kernel ?
I finally found the reason. It is from the
net.core.busy_read and
net.core.busy_poll are both to 0.
That means the socket polling is disabled, which impacts the netperf latency.
But the question changed to
In this case, the lower CPU usage is a sign that there is something different in Linux, what kind of tool or how can we should figure out what causes the CPU usage difference in 2 kernels?

Low hardware simulation for performance profiling

I need to optimize the app I'm working on and I can't get reliable profiling data on my development machine. The app should run on low end ARM hardware on QNX, but from logistic reasons I don't have access to the final hardware for profiling.
I've tried to do profiling on my development machine, but as you can imagine everything is so fast that I can't pin point the slow parts. I've created a Linux virtual machine with reduced memory and CPU cores count, but they are still too fast compared to the final hardware.
Is it possible to reduce the CPU clock speed/ram speed/disk speed in a virtual machine to simulate low performance hardware or is there any other way to get relevant profiling data on my development machine?
Considering the app is processing several gigabytes of data I assume disk access is a major bottleneck and limiting disk speed might help
I can use any (as in most open source and commercially available) tool/approach that runs on Windows/Linux/MacOS on real or virtual machine.
This URL describes how to limit disk bandwidth on VirtualBox images. You could run a Linux VM on Virtualbox and use this method to limit disk access speeds, turn off Disk Caching using suggestions from this answer and profile your application. Alternatively you can download QNX SDP, which comes with the option of a prebuilt x86_64 Virtual Machine image that can be run using VMWare/Virtualbox/qemu
My previous experiences with QNX on armv7 and x86_64 suggest that the devb-sdmmc driver is possibly a bottleneck when working with a lot of big files being read from flash storage. devb-sdmmc and io-blk often require fine tuning of the drivers with proper cache, block, read-ahead size and other parameters helps improve disk access performance.

Detecting HyperThreading without CPUID?

I'm working on a number-crunching application and I'm trying to squeeze all possible performance out of it that I can. I'm designing it to work for both Windows and *nix and even for multi-CPU machines.
The way I have it currently set up, it asks the OS how many cores there are, sets affinity on each core to a function that runs a CPUID ASM command (yes, it'll get run multiple times on the same CPU; no biggie, it's just initialization code) and checks for HyperThreading in the Features request of CPUID. From the responses to the CPUID command it calculates how many threads it should run. Of course, if a core/CPU supports HyperThreading it will spawn two on a single core.
However, I ran into a branch case with my own machine. I run an HP laptop with a Core 2 Duo. I replaced the factory processor a while back with a better Core 2 Duo that supports HyperThreading. However, the BIOS does not support it as the factory processor didn't. So, even though the CPU reports that it has HyperThreading it's not capable of utilizing it.
I'm aware that in Windows you can detect HyperThreading by simply counting the logical cores (as each physical HyperThreading-enabled core is split into two logical cores). However, I'm not sure if such a thing is available in *nix (particularly Linux; my test bed).
If HyperTreading is enabled on a dual-core processor, wil the Linux function sysconf(_SC_NPROCESSORS_CONF) show that there are four processors or just two?
If I can get a reliable count on both systems then I can simply skip the CPUID-based HyperThreading checking (after all, it's a possibility that it is disabled/not available in BIOS) and use what the OS reports, but unfortunately because of my branch case I'm not able to determine this.
P.S.: In my Windows section of the code I am parsing the return of GetLogicalProcessorInformation()
Bonus points: Anybody know how to mod a BIOS so I can actually HyperThread my CPU ;)? Motherboard is an HP 578129-001 with the AMD M96 chipset (yuck).

How can I override the CUDA kernel execution time limit on Windows with a secondary GPUs?

From Nvidia's website, it explain the time-out problem:
Q: What is the maximum kernel execution time? On Windows, individual
GPU program launches have a maximum run time of around 5 seconds.
Exceeding this time limit usually will cause a launch failure reported
through the CUDA driver or the CUDA runtime, but in some cases can
hang the entire machine, requiring a hard reset. This is caused by
the Windows "watchdog" timer that causes programs using the primary
graphics adapter to time out if they run longer than the maximum
allowed time.
For this reason it is recommended that CUDA is run on a GPU that is
NOT attached to a display and does not have the Windows desktop
extended onto it. In this case, the system must contain at least one
NVIDIA GPU that serves as the primary graphics adapter.
Source: https://developer.nvidia.com/cuda-faq
So it seems that, nvidia believes, or at least strongly implys, having multi- (nvidia) gpus, and with proper configuration, can prevent this from happening?
But how? so far I tried lots ways but there is still the annoying time-out on a GK110 GPU that is: (1) plugging in the secondary PCIE 16X slots; (2) Not being connected to any monitors (3) Is setted to use as an exclusive physX card in driver control panel (as recommended by some other guys), but the block-out is still there.
If your GK110 is a Tesla K20c GPU, then you should switch the device from wddm mode to TCC mode. This can be done with the nvidia-smi.exe tool that gets installed with the driver. Use the windows search function to find this file (nvidia-smi.exe) then use the command line help (`nvidia-smi --help) to discover the commands necessary to switch a GPU from WDDM to TCC mode.
Once you have done this, the windows watchdog mechanism will no longer pay attention to your GK110 device.
If on the other hand it is a GeForce GPU, there is no way to switch it to TCC mode. Your only option is to modify the registry settings, which is somewhat difficult. Your mileage may vary, as the exact structure of the reg keys varies by OS.
If a GPU is in WDDM mode, it is subject to the watchdog timer.

Emulating a processor's (limited) resources, including clock speed

I would like a software environment in which I can test the speed of my software on hardware with specific resources. For example, how fast does this program run on an 800MHz x86 with 24 Mb of RAM, when my host hardware is a 3GHz quad core amd64 with 12GB of RAM? Emulators such as qemu make a great point of running "almost as fast" as the underlying hardware; I would like to make it run slower. Is there a way to do that?
I have never tried it, but perhaps you could achieve what you want to some extent by combining an emulator like QEMU or VirtualBox on Linux with something like this:
http://cpulimit.sourceforge.net/
If you can limit the CPU time available to the emulator you might be able to simulate the results of execution on a slower computer. Keep in mind, though, that this would only affect the execution speed (or so I hope, anyway).
The CPU instruction set and other system features would remain unchanged. This means that emulating a specific processor accurately would be difficult if not impossible.
In addition, using something like cpulimit, which works using SIGSTOP and SIGCONT to repeatedly stop/restart the emulator process might cause side-effects, such as timing inconsistencies, video display artifacts etc.
In your emulator, keep a virtual "clock" and increment it appropriately as you execute each instruction. From there you can simply report how long it took in virtual time to execute, or you can have your emulator sleep now and again to keep execution speed roughly where it would be in the target.

Resources