How can I limit the processing power given to a specific program? - limit

I develop on a laptop with a dual-core amd 1.8 GHz processor but people frequently run my programs on much weaker systems (300 MHz ARM for example).
I would like to simulate such weak environments on my laptop so I can observe how my program runs. It is an interactive application.
I looked at qemu and I know how to set up an environment but its a bit painful and I didn't see the exact incantation of switches I would need to make qemu simulate a weaker cpu.
I have virtualbox but it doesn't seem like I can virtualize fewer than 1 full host cpu.
I know about http://cpulimit.sourceforge.net/ which uses sigstop and sigcont to try to limit the cpu given to a process but I am worried this is not really an accurate portrayal of a weaker cpu.
Any ideas?

If your CPU is 1800 MHz and your target is 300 MHz, and your code is like this:
while(1) { /*...*/ }
you can rewrite it like:
long last=gettimestamp();
while(1)
{
long curr=gettimestamp();
if(curr-last>1000) // out of every second...
{
long target=curr+833; // ...waste 5/6 of it
while(gettimestamp()<target);
last=target;
}
// your original code
}
where gettimestamp() is your OS's high frequency timer.
You can choose to work with smaller values for a smoother experience, say 83ms out of every 100ms, or 8ms out of every 10ms, and so on. The lower you go though the more precision loss will ruin your math.
edit: Or how about this? Create a second process that starts the first and attaches itself as a debugger to it, then periodically pauses it and resumes it according to the algorithm above.

You may want to look at an emulator that is built for this. For example, from Microsoft you can find this tech note, http://www.nsbasic.com/ce/info/technotes/TN23.htm.
Without knowing more about the languages you are using, and platforms, it is hard to be more specific, but I would trust the emulator programs to do a good job in providing the test environment.

I've picked a PIIMMX-266 laptop somewhere, and installed a mininal Debian on it. That was a perfect solution until it has died some weeks ago. It is a Panasonic model, which has a non-standard IDE connector (it's not 40-pin, nor 44-pin), so I was unable to replace its HDD with a CF (a CF-to-IDE adapter costs near zero). Also, the price of such a machine is USD 50 / EUR 40.
(I was using it to simulate a slow ARM-based machine for our home aut. system, which is planned to able to run on even smallest-slowest Linux systems. Meanwhile, we've choosen a small and slow computer for home aut. purposes: GuruPlug. It has a cca. 1.2 GByte fast CPU.)
(I'm not familiar with QEMU, but the manual says that you can use KVM (kernel virtualization) in order to run programs at native speed; I assume that if it's an extra feature then it can be turned off, so, strange but true, it can emulate x86 on x86.)

Related

A multi-threaded software(PFC3D-to do simulation) not using all the available cores

I'm using a multi-threaded software(PFC3D developped by Itasca consulting) to do some simulations.After moving to a powerful computer Intel Xeon Gold 5120T CPU 2.2GHZ 2.19 GHZ (2 Processors)(28 physical cores, 56 logical cores)(Windows10) to have rapid calculations, the software seems to only use a limited number of cores.Normally the number of cores detected in the software is 56 and it takes automaticly the maximum number of cores.
I'm quite sure that the problem is in the system not in my software because I'm running the same code in a intel core i9-9880H Processor (16 logical cores) and it'is using all the cores with even more efficiency than the xeon gold.
The software is using 22 to 30;
28 cores/56 threads are displayed on task managers CPU page.I have windows 10 pro.
I appreciate very much your precious help.
Thank you
Youssef
interface
classes
details
code
It's hard to say because I do not have the code and you provide so little information.
You seems to have no IO because you said that you use 100% of the CPU on i9. That should simplify a little bit but...
There could be many reasons.
My feeling is that you have threading synchronisation (like critical section) that depends on shared ressource(s). That ressource seems to be lightly solicitated and thread require it just a little wich enable 16 threads to access it without too much collisions (or very little). I mean that thread do not have to wait for shared resource (it is mostly available / not locked). But adding more threads improve significantly collisions amount (locking state of shared ressources by another thread) to have to wait for that ressource. That really sounds like something like that. But it is only a guess.
A quick try that could potentially improve the performance (because I have the feeling that shared resource require very quick access), is to use SpinLock instead of regular Critical Section. But that is totally a guess based on very little and also SpinLock is available in C# but perhaps not in your language.
About the number of CPU taken, it could be normal to take only the half depending on how the program is made. Sometimes it could be better to not use hyperthreaded and perhaps your program is doing this itself. Also there could be a bug in either the program itself, in C# or in the BIOS which tell the app that there is only 28 cpus instead of 56 (usually due to hyperthreading). IT is still a guess.
There could be some additional information that could potentially help you in this stack overflow question.
Good luck.

How to prevent Windows GPU "Timeout Detection and Recovery"?

If I run a long-running kernel on a GPU device, after 2 seconds (by default) the windows TDR (Timeout Detection and Recovery) will kill the running kernels. I understand it, but what if you can't predict how long the kernel will run, because you need to do lots of computations and neither you know the capacity/speed of the underlying GPU for the actual user, who runs your program?
What are the best practices for solving this problem?
I found 3 ways to prevent it to happen, but none of those seems a good solution for me:
You need to make sure that your kernels are not too time-consuming:
The kernel is time consuming and though I could do some kind of fragmentation and not run 1 million of them but 2*500k or 4*250k, but I still can't predict if it will fit into the default 2 seconds on the actual user's GPU. (I had the idea to half the number until your kernel won't drop a CL_INVALID_COMMAND_QUEUE error, and then you just call it multiple times with the smaller amount, but to be honest it sounds really hackie and have some other drawbacks.)
You can turn-off the watchdog timer (or increase the delay): Timeout Detection and Recovery of GPUs:
It's done by registry edit, and you need to restart Windows to make it effective. You can't do it on a user's machine.
You can run the kernel on a GPU that is not hooked up to a display:
How can you make sure the GPU is not hooked up to a display on a users machine? Even in my laptop my primary GPU is the Intel HD4000 and the NVidia GPU is not in use for display (I think so), but TDR still kills my kernels.
You listed all of the solutions I know of. Since solution 2 leaves the machine in an unusable state while your kernel runs (not a good practice) it should be avoided. Since adding another GPU (solution 3) is not practical for you, your best bet is to focus on solution 1. I don't know why you are trying to maximize the work size to run as long as possible to avoid TDR. You should instead target around 10 ms or less (if you run many kernels that take longer the GUI is very sluggish). So instead of 4*250000, think more like 400*2500. You may need to put in some clFinish calls between each one (or batch of 10, or whatever). Keeping the execution time small (10 ms) and not overfilling the queue will allow the GPU to do other things in between kernels and you won't get TDR resets nor make the machine unusable and yet the GPU will be quite busy.

How to reduce time taken for large calculations in MATLAB

When using the desktop PC's in my university (Which have 4Gb of ram), calculations in Matlab are fairly speedy, but on my laptop (Which also has 4Gb of ram), the exact same calculations take ages. My laptop is much more modern so I assume it also has a similar clock speed to the desktops.
For example, I have written a program that calculates the solid angle subtended by 50 disks at 500 points. On the desktop PC's this calculation takes about 15 seconds, on my laptop it takes about 5 minutes.
Is there a way to reduce the time taken to perform these calculations? e.g, can I allocate more ram to MATLAB, or can I boot up my PC in a way that optimises it for using MATLAB? I'm thinking that if the processor on my laptop is also doing calculations to run other programs this will slow down the MATLAB calculations. I've closed all other applications, but I know theres probably a lot of stuff going on I can't see. Can I boot my laptop up in a way that will have less of these things going on in the background?
I can't modify the code to make it more efficient.
Thanks!
You might run some of my benchmarks which, along with example results, can be found via:
http://www.roylongbottom.org.uk/
The CPU core used at a particular point in time, is the same on Pentiums, Celerons, Core 2s, Xeons and others. Only differences are L2/L3 cache sizes and external memory bus speeds. So you can compare most results with similar vintage 2 GHz CPUs. Things to try, besides simple number crunching tests.
1 - Try memory test, such as my BusSpeed, to show that caches are being used and RAM not dead slow.
2 - Assuming Windows, check that the offending program is the one using most CPU time in Task Manager, also that with the program not running, that CPU utilisation is around zero.
3 - Check that CPU temperature is not too high, like with SpeedFan (free D/L).
4 - If disk light is flashing, too much RAM might be being used, with some being swapped in and out. Task Manager Performance would show this. Increasing RAM demands can be checked my some of my reliability tests.
There are many things that go into computing power besides RAM. You mention processor speed, but there is also number of cores, GPU capability and more. Programs like MATLAB are designed to take advantage of features like parallelism.
Summary: You can't compare only RAM between two machines and expect to know how they will perform with respect to one another.
Side note: 4 GB is not very much RAM for a modern laptop.
Firstly you should perform a CPU performance benchmark on both computers.
Modern operating systems usually apply the most aggressive power management schemes when it is run on laptop. This usually means turning off one or more cores, or setting them to a very low frequency. For example, a Quad-core CPU that normally runs at 2.0 GHz could be throttled down to 700 MHz on one CPU while the other three are basically put to sleep, while it is on battery. (Remark. Numbers are not taken from a real example.)
The OS manages the CPU frequency in a dynamic way, tweaking it on the order of seconds. You will need a software monitoring tool that actually asks for the CPU frequency every second (without doing busy work itself) in order to know if this is the case.
Plugging in the laptop will make the OS use a less aggressive power management scheme.
(If this is found to be unrelated to MATLAB, please "flag" this post and ask moderator to move this question to the SuperUser site.)

Windows 7 QueryPerformanceFrequency returns 2.4 MHz-ish?

I'm running some timing code on various OS. I notice the following patterns with the results from QueryPerformanceCounter
Standard Windows XP uses the processor frequency, which means it's using RDTSC under the hood.
Vista uses the HPET, 14,318,180 Hz
Any version of Windows with /usepmtimer uses the ACPI clock, 3,579,545 Hz
Windows 7 uses a clock of undetermined origin, returning varying numbers around 2.4 to 2.6 MHz.
Does anyone know what clock Windows 7 is using by default? Why is it even slower than the ACPI clock? Is there a way to force Windows 7 to use the HPET instead?
Windows 7 will pick different QPC sources at boot based on what processor / hardware is available - I believe there are also changes in SP1 regarding this as well.
The change from Vista was most likely taken for AppCompat reasons, since on multicore CPUs that are reading RDTSC, they are not guaranteed to be in-sync, so apps being scheduled on multiple CPUs would sometimes see QPC go backwards and would freak out.
Ok this is only a partial answer as I'm still nailing it down, but this 2.x MHz frequency is equal to the nominal TSC speed divided by 1024.
Try to do the math with your QPF result and your own CPU speed and it should be right.
I initially thought it was a division of the HPET rate but it does not seem to be the case.
Now the question is: the LAPIC timer runs at system bus rate but so is the TSC (before the mult coeff is applied) so we don't know what counter is used before the final division (it could be TSC/1024 or BUS/something else) but we know it's using the main motherboard crystal (the one driving the bus)
What doesn't sound right is that some MSDN articles seem to imply the LAPIC timer is barely used (excepted for hypervisor/virtual machines) but given the fact that the HPET failed to deliver its promises due to many implementation problems, and the fact most new platforms feature an invariant TSC, they are changing direction again.
I didn't found any formal proof from Microsoft concerning the new source used in Win7 though... and we can't completely rule the HPET as even if it's not used in timer mode its counters can still be read (ex: by QPF) but why divide its rate and thus lower its resolution then?

Linux: Timing during recording/playing sound

I have a more general question, regarding timing in a standard Linux-OS dealing with playing sound and receiving data over a serial port.
In the moment, I'm reading a PCM-Signal arriving over a USB-to-Serial Bridge (pl2303) which is recorded, encoded and sent from a FPGA.
Now, I need to create "peaks" at a known position in the recorded soundstream, and plan to play a soundfile from the same machine which is recording at a known moment. The peak has to begin and stop inside windows of maximal 50ms, it's length could be ~200ms...
Now, my question is: How precise can I expect the timing to be? I know, that several components add "unkown lag", jitter:
USB-to-Serial Bridge collects ~20 bytes from the serial side before sending them to the USB-side (at 230400Baud this results in ~1ms)
If I call "`sleep 1; mpg123 $MP3FILE` &" directly before my recording software, the Linux-Kernel will schedule them differenty (maybe this causes a few 10ms, depending on system load?)
The soundcard/driver will maybe add some more unkown lag...
Will tricks like "nice" or "sched_setscheduler" add value in my case?
I could build an additional thread inside my recording sofware, which plays the sound. Doing this, the timing may be more precise, but I have a lot more work to do ...
Thanks a lot.
I will try it anyway, but I'm looking for some background toughts to understand and solve my upcoming problems better.
I am not 100% sure, but I would imagine that your kernel would need to be rebuilt to allow the scheduler to reduce latency time in switching tasks a la multitasking, in kernels 2.6.x series, there's an option to make the kernel more smoother by making it pre-emptible.
Go to Processor Type and features
Pre-emption Model
Select Preemptible kernel (low latency desktop)
This should streamline the timing and make the sounds appear smoother as a result of less jitters.
Try that and recompile the kernel again. There are of course, plenty of kernel patches that would reduce the timeslice for each task-switch to make it even more smoother, your mileage may vary depending on:
Processor speed - what processor is used?
Memory - how much RAM?
Disk input/output - the faster, the merrier
using those three factors combined, will have an influence on the scheduler and the multi-tasking features. The lower the latency, the more fine-grained it is.
Incidentally, there is a specialized linux distribution that is catered for capturing sound in real-time, I cannot remember the name of it, the kernel in that distribution was heavily patched to make sound capture very smooth.
it's me again... After one restless night, I solved my strange timing-problems... My first edit is not completely correct, since what I posted was not 100% reproducible. After running some more tests, I can come up with the following Plot, showing timing accuracy:
Results from analysis http://mega2000.de/~mzenzes/pics4web/2010-05-28_13-37-08_timingexperiment.png
I tried two different ubuntu-kernels: 2.6.32-21-generic and 2.6.32-10-rt
I tried to achieve RT-scheduling: sudo chrt --fifo 99 ./experimenter.sh
And I tried to change powersaving-options: echo {performance,conservative} | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
This resulted in 8 different tests, with 50 runs each. Here are the numbers:
mean(peakPos) std(peakPos)
rt-kernel-fifo99-ondemand 0.97 0.0212
rt-kernel-fifo99-performance 0.99 0.0040
rt-kernel-ondemand 0.91 0.1423
rt-kernel-performance 0.96 0.0078
standard-kernel-fifo99-ondemand 0.68 0.0177
standard-kernel-fifo99-performance 0.72 0.0142
standard-kernel-ondemand 0.69 0.0749
standard-kernel-performance 0.69 0.0147

Resources