I would like a software environment in which I can test the speed of my software on hardware with specific resources. For example, how fast does this program run on an 800MHz x86 with 24 Mb of RAM, when my host hardware is a 3GHz quad core amd64 with 12GB of RAM? Emulators such as qemu make a great point of running "almost as fast" as the underlying hardware; I would like to make it run slower. Is there a way to do that?
I have never tried it, but perhaps you could achieve what you want to some extent by combining an emulator like QEMU or VirtualBox on Linux with something like this:
http://cpulimit.sourceforge.net/
If you can limit the CPU time available to the emulator you might be able to simulate the results of execution on a slower computer. Keep in mind, though, that this would only affect the execution speed (or so I hope, anyway).
The CPU instruction set and other system features would remain unchanged. This means that emulating a specific processor accurately would be difficult if not impossible.
In addition, using something like cpulimit, which works using SIGSTOP and SIGCONT to repeatedly stop/restart the emulator process might cause side-effects, such as timing inconsistencies, video display artifacts etc.
In your emulator, keep a virtual "clock" and increment it appropriately as you execute each instruction. From there you can simply report how long it took in virtual time to execute, or you can have your emulator sleep now and again to keep execution speed roughly where it would be in the target.
Related
From Nvidia's website, it explain the time-out problem:
Q: What is the maximum kernel execution time? On Windows, individual
GPU program launches have a maximum run time of around 5 seconds.
Exceeding this time limit usually will cause a launch failure reported
through the CUDA driver or the CUDA runtime, but in some cases can
hang the entire machine, requiring a hard reset. This is caused by
the Windows "watchdog" timer that causes programs using the primary
graphics adapter to time out if they run longer than the maximum
allowed time.
For this reason it is recommended that CUDA is run on a GPU that is
NOT attached to a display and does not have the Windows desktop
extended onto it. In this case, the system must contain at least one
NVIDIA GPU that serves as the primary graphics adapter.
Source: https://developer.nvidia.com/cuda-faq
So it seems that, nvidia believes, or at least strongly implys, having multi- (nvidia) gpus, and with proper configuration, can prevent this from happening?
But how? so far I tried lots ways but there is still the annoying time-out on a GK110 GPU that is: (1) plugging in the secondary PCIE 16X slots; (2) Not being connected to any monitors (3) Is setted to use as an exclusive physX card in driver control panel (as recommended by some other guys), but the block-out is still there.
If your GK110 is a Tesla K20c GPU, then you should switch the device from wddm mode to TCC mode. This can be done with the nvidia-smi.exe tool that gets installed with the driver. Use the windows search function to find this file (nvidia-smi.exe) then use the command line help (`nvidia-smi --help) to discover the commands necessary to switch a GPU from WDDM to TCC mode.
Once you have done this, the windows watchdog mechanism will no longer pay attention to your GK110 device.
If on the other hand it is a GeForce GPU, there is no way to switch it to TCC mode. Your only option is to modify the registry settings, which is somewhat difficult. Your mileage may vary, as the exact structure of the reg keys varies by OS.
If a GPU is in WDDM mode, it is subject to the watchdog timer.
I've seen various RTOSes that have this strategy that they have windows boot on one or more CPUs and then run realtime programs on the rest of the CPUs. Any idea how this might be accomplished? Can I let the computer boot off two CPUs and then stop execution on the rest of the CPUs? What documentation should I start looking at? I have enough experience with the linux kernel that I might be able to figure out how to do it under linux, so if there's anything that maps onto linux well that you could describe it in terms of, that'd be fantastic.
You can boot Windows on fewer CPUs than available easily. Run msconfig.exe, go to the Boot tab, click the Advanced options... button, check the number of processors box and set the desired number (this is for Windows 7, the exact location for Vista and XP might differ slightly).
But that's just a solution to a very small part of the problem.
You will need to implement a special kernel-mode driver to start those other CPUs (Windows won't let you do that sort of thing from non-kernel-mode code). And you will need to implement a thread scheduler for those CPUs and a bunch of other low-level things... You might want to steal some physical memory (RAM) from Windows as well and implement a memory manager as well and those two may be a very involved thing.
What to read? The Intel/AMD CPU documentation (specifically the APIC part), the x86 Multiprocessor specification from Intel, books on Windows drivers, Windows Internals books, MSDN, etc.
You can't turn off Windows on one CPU and expect to run your program as usual because syscalls are serviced by the same CPU that the thread issuing the syscall is issued on. The syscall relies on kernel-mode accessible per-thread data to handle the syscalls, and hence any thread (usermode or kernel-mode) can only run when Windows has performed the per-core initialization of the CPU.
It seems likely that you're writing a super-double-mega-awesome app that really-definitely needs to run, like, super-fast and you want everyone else to get off the core, 'cos then, like, you'll be the totally fastest-est, but you're not really appreciating that if Windows isn't on your core, then you can't use ANY part of Windows on that core either.
If you really do want to do this, you'll have to run as a boot-driver. The boot-driver will be able to reserve one of the cores from being initialized during boot, preventing Windows from "seeing" that core. You can then manually construct your own thread of execution to run on that core, but you'll need to handle paging, memory allocation, scheduling, NUMA, NMI exceptions, page-faulting, and ACPI events yourself. You won't be able to call Windows from that core without bluescreening Windows. You'll be on your own.
What you probably want to do is to lock your thread to a single processor (via SetThreadAffinity) and then up the priority of your thread to the maximum value. When you do so, Windows is still running on your core to service things like pagefaults and hardware interrupts, but no lower priority user-mode thread will run on that core (they'll all move to other cores unless they are also locked to your processor).
I could not understand the question properly. But if you asking for scheduling process to cores then linux can accomplish this using set affinity. Follow this page :
http://www.kernel.org/doc/man-pages/online/pages/man2/sched_setaffinity.2.html
Is there some tool which allows one to control the MS-Windows (XP-SP3 32-bit in my case) scheduler, s.t. a target application (which I'd like to test), operates as if it is running on a slower CPU. Say my physical host is a 2.4GHzv Dual-Core, but I'd like the application to run as if, it is running on a 800MHz/1.0GHz CPU.
I am aware of some such programs which allowed old DOS games to run slower, but AFAIK, they take the approach of consuming CPU cycles to starve the application. I do not want such a thing, and also would like to have higher precision control on the clock.
I don't believe you'll find software that directly emulates the different CPUs. But something like ProcessLasso would let you control a programs CPU usage. Thus simulating, in a way, a slower clock speed.
I also found this blog entry with many other ways to throttle your CPU: Windows CPU throttling techniques
Additionally, if you have access to VMWare you could setup a resource pool with a limited CPU reservation.
I conducted the following benchmark in qemu and qemu-kvm, with the following configuration:
CPU: AMD 4400 process dual core with svm enabled, 2G RAM
Host OS: OpenSUSE 11.3 with latest Patch, running with kde4
Guest OS: FreeDos
Emulated Memory: 256M
Network: Nil
Language: Turbo C 2.0
Benchmark Program: Count from 0000000 to 9999999. Display the counter on the screen
by direct accessing the screen memory (i.e. 0xb800:xxxx)
It only takes 6 sec when running in qemu.
But it takes 89 sec when running in qemu-kvm.
I ran the benchmark one by one, not in parallel.
I scratched my head the whole night, but still not idea why this happens. Would somebody give me some hints?
KVM uses qemu as his device simulator, any device operation is simulated by user space QEMU program. When you write to 0xB8000, the graphic display is operated which involves guest's doing a CPU `vmexit' from guest mode and returning to KVM module, who in turn sends device simulation requests to user space QEMU backend.
In contrast, QEMU w/o KVM does all the jobs in unified process except for usual system calls, there's fewer CPU context switches. Meanwhile, your benchmark code is a simple loop which only requires code block translation for just one time. That cost nothing, compared to vmexit and kernel-user communication of every iteration in KVM case.
This should be the most probable cause.
Your benchmark is an IO-intensive benchmark and all the io-devices are actually the same for qemu and qemu-kvm. In qemu's source code this can be found in hw/*.
This explains that the qemu-kvm must not be very fast compared to qemu. However, I have no particular answer for the slowdown. I have the following explanation for this and I think its correct to a large extent.
"The qemu-kvm module uses the kvm kernel module in linux kernel. This runs the guest in x86 guest mode which causes a trap on every privileged instruction. On the contrary, qemu uses a very efficient TCG which translates the instructions it sees at the first time. I think that the high-cost of trap is showing up in your benchmarks." This ain't true for all io-devices though. Apache benchmark would run better on qemu-kvm because the library does the buffering and uses least number of privileged instructions to do the IO.
The reason is too much VMEXIT take place.
I have a web application, and my users are complaining about performance. I have been able to narrow it down to JavaScript in IE6 issues, which I need to resolve. I have found the excellent dynaTrace AJAX tool, but my problem is that I don't have any issues on my dev machine.
The problem is that my users' computers are ancient, so timings which are barely noticable on my machine are perhaps 3-5 times longer on theirs, and suddenly the problem is a lot larger. Is it possible somehow to degrade the performance of my dev machine, or preferrably of a VM running on my dev machine, to the specs of my customers' computers?
I don't know of any virtualization solutions that can do this, but I do know that the computer/CPU emulator Bochs allows you to specify a limit on the number of emulated instructions per second, which you can use to simulate slower CPUs.
I am not sure if you can cpu bound it, but in VirutalBox or Parallel, you can bound the memory usage. I assume if you only give it about 128MB then it will be very slow. You can also limit the throughput on the network with a lot of tools. I guess the only thing I am not sure about is the CPU. That's tricky. Curious to know what you find. :)
You could get a copy of VMWare Workstation and choke the CPU of your VM.
With most virtual PC software you can limit the amount of RAM, but you are not able to set the CPU to a slower speed as it does not emulate a CPU, but uses the host CPU.
You could go with some emulation software like bochs that will let you setup an x89 processor environment.
You may try Fossil Toys
* PC Speed
PC CPU speed monitor / benchmark. With logging facility.
* Memory Load Test
Test application/operating system behaviour under low memory conditions.
* CPU Load Test
Test application/operating system behaviour under high CPU load conditions.
Although it doesn't simulate a specific CPU clock speed.