windowed OpenGL first frame delay after idle

windowed OpenGL first frame delay after idle - winapi

I have windowed WinApi/OpenGL app. Scene is drawn rarely (compared to games) in WM_PAINT, mostly triggered by user input - MW_MOUSEMOVE/clicks etc.
I noticed, that when there is no scene moving by user mouse (application "idle") and then some mouse action by user starts, the first frame is drawn with unpleasant delay - like 300 ms. Following frames are fast again.
I implemented 100 ms timer, which only does InvalidateRect, which is later followed by WM_PAINT/draw scene. This "fixed" the problem. But I don't like this solution.
I'd like know why is this happening and also some tips how to tackle it.
Does OpenGL render context save resources, when not used? Or could this be caused by some system behaviour, like processor underclocking/energy saving etc? (Although I noticed that processor runs underclocked even when app under "load")

This sounds like Windows virtual memory system at work. The sum of all the memory use of all active programs is usually greater than the amount of physical memory installed on your system. So windows swaps out idle processes to disc, according to whatever rules it follows, such as the relative priority of each process and the amount of time it is idle.
You are preventing the swap out (and delay) by artificially making the program active every 100ms.
If a swapped out process is reactivated, it takes a little time to retrieve the memory content from disc and restart the process.
Its unlikely that OpenGL is responsible for this delay.
You can improve the situation by starting your program with a higher priority.
https://superuser.com/questions/699651/start-process-in-high-priority
You can also use the virtuallock function to prevent Windows from swapping out part of the memory, but not advisable unless you REALLY know what you are doing!
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366895(v=vs.85).aspx
EDIT: You can improve things for sure by adding more memory and for sure 4GB sounds low for a modern PC, especially if you Chrome with multiple tabs open.
If you want to be scientific before spending any hard earned cash :-), then open Performance Manager and look at Cache Faults/Sec. This will show the swap activity on your machine. (I have 16GB on my PC so this number is very low mostly). To make sure you learn, I would check Cache Faults/Sec before and after the memory upgrade - so you can quantify the difference!
Finally, there is nothing wrong with the solution you found already - to kick start the graphic app every 100ms or so.....

Problem was in NVidia driver global 3d setting -"Power management mode".
Options "Optimal Power" and "Adaptive" save power and cause the problem.
Only "Prefer Maximum Performance" does the right thing.

Related

How to prevent Windows GPU "Timeout Detection and Recovery"?

If I run a long-running kernel on a GPU device, after 2 seconds (by default) the windows TDR (Timeout Detection and Recovery) will kill the running kernels. I understand it, but what if you can't predict how long the kernel will run, because you need to do lots of computations and neither you know the capacity/speed of the underlying GPU for the actual user, who runs your program?
What are the best practices for solving this problem?
I found 3 ways to prevent it to happen, but none of those seems a good solution for me:
You need to make sure that your kernels are not too time-consuming:
The kernel is time consuming and though I could do some kind of fragmentation and not run 1 million of them but 2*500k or 4*250k, but I still can't predict if it will fit into the default 2 seconds on the actual user's GPU. (I had the idea to half the number until your kernel won't drop a CL_INVALID_COMMAND_QUEUE error, and then you just call it multiple times with the smaller amount, but to be honest it sounds really hackie and have some other drawbacks.)
You can turn-off the watchdog timer (or increase the delay): Timeout Detection and Recovery of GPUs:
It's done by registry edit, and you need to restart Windows to make it effective. You can't do it on a user's machine.
You can run the kernel on a GPU that is not hooked up to a display:
How can you make sure the GPU is not hooked up to a display on a users machine? Even in my laptop my primary GPU is the Intel HD4000 and the NVidia GPU is not in use for display (I think so), but TDR still kills my kernels.

You listed all of the solutions I know of. Since solution 2 leaves the machine in an unusable state while your kernel runs (not a good practice) it should be avoided. Since adding another GPU (solution 3) is not practical for you, your best bet is to focus on solution 1. I don't know why you are trying to maximize the work size to run as long as possible to avoid TDR. You should instead target around 10 ms or less (if you run many kernels that take longer the GUI is very sluggish). So instead of 4*250000, think more like 400*2500. You may need to put in some clFinish calls between each one (or batch of 10, or whatever). Keeping the execution time small (10 ms) and not overfilling the queue will allow the GPU to do other things in between kernels and you won't get TDR resets nor make the machine unusable and yet the GPU will be quite busy.

high load low cpu low iowait, why?

Look at the those peaks in the first graph, which factor can cause this?
cpu 24X6

There's a lot of stuff going on in any general purpose computer. When I performance profiled apps in a former life, I saw this all the time and factored it out.
It's caused by a whole host of sources: Processor dealing with interrupts, some disk maintenance routine, file system clean up, completely useless background apps that have been installed unknown to you as automatically launched services, etc.
Your plot of idle time is a little disconcerting. It is awfully low. What apps do you have running taking up all that processing? Also, if your memory is low, say because you have 20 or 30 browser tabs/windows open, your CPU load will go through the roof due to all that page and context swapping.

Drastic performance inprovement in .NET CF after app gets moved out of the foreground. Why?

I have a large (500K lines) .NET CF (C#) program, running on CE6/.NET CF 3.5 (v.3.5.10181.0). This is running on a FreeScale i.Mx31 (ARM) # 400MHz. It has 128MB RAM, with ~80MB available to applications. My app is the only significant one running (this is a dedicated, embedded system). Managed memory in use (as reported by GC.Collect) is about 18MB.
To give a better idea of the app size, here's some stats culled from .NET CF Remote Performance Monitor after staring up the application:
GC:
Garbage Collections 131
Bytes Collected by GC 97,919,260
Managed Bytes in use after GC 17,774,992
Total Bytes in use after GC 24,117,424
GC Compactions 41
JIT:
Native Bytes Jitted: 10,274,820
Loader:
Classes Loaded 7,393
Methods Loaded 27,691
Recently, I have been trying to track down a performance problem. I found that my benchmark after running the app in two different startup configurations would run in approximately 2 seconds (slow case) vs. 1 second (fast case). In the slow case, the time for the benchmark could change randomly from EXE run to EXE run from 1.1 to 2 seconds, but for any given EXE run, would not change for the life of the application. In other words, you could re-run the benchmark and the time for the test stays the same until you restart the EXE, at which point a new time is established and consistent.
I could not explain the 1.1 to 2x slowdown via any conventional mechanism, or by narrowing the slowdown to any particular part of the benchmark code. It appeared that the overall process was just running slower, almost like a thread was spinning and taking away some of "my" CPU.
Then, I randomly discovered that just by switching away from my app (the GUI loses the foreground) to another app, my performance issue disappears. It stays gone even after returning my app to the foreground. I now have a tentative workaround where my app after startup launches an auxiliary app with a 1x1 size window that kills itself after 5ms. Thus the aux app takes the foreground, then relinquishes it.
The question is, why does this speed up my application?
I know that code gets pitched when a .NET CF app loses the foreground. I also notice that when performing a "GC Heap" capture with .NET CF Remote Performance Monitor, a Code Pitch is logged -- and this also triggers the performance improvement in my app. So I suspect somehow that code pitching is related or even responsible for fixing performance. But I'm at a loss as to figure out how to determine if that is really the case, or even to explain why pitching code could help in this way. Does pitching out lots of code somehow significantly help locality of reference of code pages (that are re-JITted, presumably near each other in memory) enough to help this much? (My benchmark spans probably 3 dozen routines and hundreds of lines of code.)
Most importantly, what can I do in my app to reliably avoid this slower condition. Any pointers to relevant .NET CF / JIT / Code pitching information would be greatly appreciated.

Your app going to the background auto-triggers a GC.Collect, which collects, may compact the GC Heap and may pitch code. Have you checked to see if a manual GC.Collect without going to the background gives the same behavior? It might not be pitching that's giving the perf gain, it might be collection or compaction. If a significant number of dead roots are swept up, walking the root tree may be getting faster. Can't say I've specifically seen this issue, so this is all conjecture.

Send your app a wm_hibernate before your performance loop. Will clean up things

We have a similar issue with our .NET CF application.
Over time, our application progressively slows down, eventually to a halt with what I anticipate is due to high CPU load, or as #wil-s says, as if thread is spinning consuming CPU. The only assumption / conclusion I've made to so far is either we have a rogue thread in our code, or there's an under the cover issue in .NET CF, maybe with the JITter.
Closing the application and re-launching returns our application to normal expected performance.
I am yet to implement code change to test issuing WM_HIBERNATE or launch a dummy app which quits itself (as above) to force a code pitch, but am fairly sure this will resolve our issue based on the above comments. (so many thanks for that)
However, I'm subsequently interested to know whether a root cause was ever found to this specific issue?
Incidentally and seemingly off topic (but bear with me), we're using a Freescale i.MX28 processor and are experiencing unpredictable FlashDisk corruption. Seeing 2K blocks of 0xFFs (erased blocks) in random files located on NAND Flash.
I'm mentioning this as I now believe the CPU and FlashDisk corruption issues are linked, due to this article as well as this one:
https://electronics.stackexchange.com/questions/26720/flash-memory-corruption-due-to-electricals
In the article, #jwygralak67 comments:
I recently worked through a flash corruption issue, on a WinCE system,
as part of a development team. We would sporadically find 2K blocks of
flash that were erased. (All bytes 0xFF) For about 6 months we tested
everything from ESD, to dirty power to EMI and RFI interference, we
bought brand new devices and tracked usage to make sure we weren't
exceeding the erase cycle limit and buring out blocks, we went through
our (application level) software with a fine toothed comb.
In the end it turned out to be an obscure bug in the very low level
flash driver code, which only occurred under periods of heavy CPU
load. The driver came from a 3rd party. We informed them of the issue
we found, but I don't know if they ever released a patch.
Unfortunately, we're yet to make contact with him.
With all of this in mind, potentially if we work around the high CPU load, maybe the corruption will no longer be present. Another conjecture case!
On that assumption however, this doesn't give a firm root cause for either situation, which I'm desperately seeking!
Any knowledge or insight, however small, would be very gratefully received.
#ctacke - we've spoken before regarding OpenNETCF via email, so I'm pleased to see your name!

Windows Stalls When My Program Uses Swapfile

I am running a user mode program on normal priority. My program is searching an NP problem, and as a result, uses up a lot of memory which eventually ends up in the swap file.
Then my mouse freezes up, and it takes forever for task manager to open up and let me end the process.
What I want to know is how I can stop my Windows operating system from completely locking up from this even though only 1 out of my 2 cores are being used.
Edit:
Thanks for the replies.
I know that making it use less memory will help, but it just doesn't make sense to me that the whole OS should lock up.

The obvious answer is "use less memory". When your app uses up all the
available memory, the OS has to page the task manager (etc.) out to make room for your app. When you switch programs, the OS has to page the other programs back in (as they are needed).
Disk reads are slower than memory reads, so everything appears to be
going slower.
If you want to avoid this, have your app manage its own memory, or
use a better algorithm than brute force. (There are genetic
algorithms, simulated annealing, etc.)

The problem is that when another program (e.g. explorer.exe) is going to execute, all of its code and memory has been swapped out. To make room for the other program Windows has to first write data that your program is using to disk, then load up the other program's memory. Every new page of code that is executed in the other program requires disk access, causing it to run slowly.
I don't know the access pattern of your program, but I'm guessing it touches all of its memory pages a lot in a random fashion, which makes the problem worse because as soon as Windows evicts a memory page from your program, suddenly you need it again and Windows has to find some other page to give the same treatment.
To give other processes more RAM to live in, you can use SetProcessWorkingSetSize to reduce the maximum amount of RAM that your program may use. Of course this will make your program run more slowly because it has to do more swapping.
Another alternative you could try is to add more drives to the system, and distribute the swap files over those. You may have a dual-core CPU, but you have only a single drive. Distributing the swap file over multiple drives allows Windows to balance work across them (although I don't have first-hand experience of how well it does this).

I don't think there's a programming answer to this question, aside from "restructure your app to use less memory." The swapfile problem is most likely due to the bottleneck in accessing the disk, especially if you're using an IDE HDD or a highly fragmented swapfile.

It's a bit extreme, but you could always minimise your swap file so you don't have all the disk thashing, and your program isn't allowed to allocate much virtual memory. Under Control panel / Advanced / Advanced tab / Perfromance / Virtual memory, set the page file to custom size and enter a value of 2mb (smallest allowed on XP). When an allocation fails, you should get an exception and be able exit gracefully. It doesn't quite fix your problem, just speeds it up ;)
Another thing worth considering would be if you are ona 32bit platform, port to a 64bit system and get a box with much more addressable RAM.

Power Efficient Software Coding

In a typical handheld/portable embedded system device Battery life is a major concern in design of H/W, S/W and the features the device can support. From the Software programming perspective, one is aware of MIPS, Memory(Data and Program) optimized code.
I am aware of the H/W Deep sleep mode, Standby mode that are used to clock the hardware at lower Cycles or turn of the clock entirel to some unused circutis to save power, but i am looking for some ideas from that point of view:
Wherein my code is running and it needs to keep executing, given this how can I write the code "power" efficiently so as to consume minimum watts?
Are there any special programming constructs, data structures, control structures which i should look at to achieve minimum power consumption for a given functionality.
Are there any s/w high level design considerations which one should keep in mind at time of code structure design, or during low level design to make the code as power efficient(Least power consuming) as possible?

Like 1800 INFORMATION said, avoid polling; subscribe to events and wait for them to happen
Update window content only when necessary - let the system decide when to redraw it
When updating window content, ensure your code recreates as little of the invalid region as possible
With quick code the CPU goes back to deep sleep mode faster and there's a better chance that such code stays in L1 cache
Operate on small data at one time so data stays in caches as well
Ensure that your application doesn't do any unnecessary action when in background
Make your software not only power efficient, but also power aware - update graphics less often when on battery, disable animations, less hard drive thrashing
And read some other guidelines. ;)
Recently a series of posts called "Optimizing Software Applications for Power", started appearing on Intel Software Blogs. May be of some use for x86 developers.

Zeroith, use a fully static machine that can stop when idle. You can't beat zero Hz.
First up, switch to a tickless operating system scheduler. Waking up every millisecend or so wastes power. If you can't, consider slowing the scheduler interrupt instead.
Secondly, ensure your idle thread is a power save, wait for next interrupt instruction.
You can do this in the sort of under-regulated "userland" most small devices have.
Thirdly, if you have to poll or perform user confidence activities like updating the UI,
sleep, do it, and get back to sleep.
Don't trust GUI frameworks that you haven't checked for "sleep and spin" kind of code.
Especially the event timer you may be tempted to use for #2.
Block a thread on read instead of polling with select()/epoll()/ WaitForMultipleObjects().
Puts stress on the thread scheuler ( and your brain) but the devices generally do okay.
This ends up changing your high-level design a bit; it gets tidier!.
A main loop that polls all the things you Might do ends up slow and wasteful on CPU, but does guarantee performance. ( Guaranteed to be slow)
Cache results, lazily create things. Users expect the device to be slow so don't disappoint them. Less running is better. Run as little as you can get away with.
Separate threads can be killed off when you stop needing them.
Try to get more memory than you need, then you can insert into more than one hashtable and save ever searching. This is a direct tradeoff if the memory is DRAM.
Look at a realtime-ier system than you think you might need. It saves time (sic) later.
They cope better with threading too.

Do not poll. Use events and other OS primitives to wait for notifiable occurrences. Polling ensures that the CPU will stay active and use more battery life.

From my work using smart phones, the best way I have found of preserving battery life is to ensure that everything you do not need for your program to function at that specific point is disabled.
For example, only switch Bluetooth on when you need it, similarly the phone capabilities, turn the screen brightness down when it isn't needed, turn the volume down, etc.
The power used by these functions will generally far outweigh the power used by your code.

To avoid polling is a good suggestion.
A microprocessor's power consumption is roughly proportional to its clock frequency, and to the square of its supply voltage. If you have the possibility to adjust these from software, that could save some power. Also, turning off the parts of the processor that you don't need (e.g. floating-point unit) may help, but this very much depends on your platform. In any case, you need a way to measure the actual power consumption of your processor, so that you can find out what works and what not. Just like speed optimizations, power optimizations need to be carefully profiled.

Consider using the network interfaces the least you can. You might want to gather information and send it out in bursts instead of constantly send it.

Look at what your compiler generates, particularly for hot areas of code.

If you have low priority intermittent operations, don't use specific timers to wake up to deal with them, but deal with when processing other events.
Use logic to avoid stupid scenarios where your app might go to sleep for 10 ms and then have to wake up again for the next event. For the kind of platform mentioned it shouldn't matter if both events are processed at the same time.
Having your own timer & callback mechanism might be appropriate for this kind of decision making. The trade off is in code complexity and maintenance vs. likely power savings.

Simply put, do as little as possible.

Well, to the extent that your code can execute entirely in the processor cache, you'll have less bus activity and save power. To the extent that your program is small enough to fit code+data entirely in the cache, you get that benefit "for free". OTOH, if your program is too big, and you can divide your programs into modules that are more or less independent of the other, you might get some power saving by dividing it into separate programs. (I suppose it's also possible to make a toolchain that spreas out related bundles of code and data into cache-sized chunks...)
I suppose that, theoretically, you can save some amount of unnecessary work by reducing the number of pointer dereferencing, and by refactoring your jumps so that the most likely jumps are taken first -- but that's not realistic to do as a programmer.
Transmeta had the idea of letting the machine do some instruction optimization on-the-fly to save power... But that didn't seem to help enough... And look where that got them.

Set unused memory or flash to 0xFF not 0x00. This is certainly true for flash and eeprom, not sure about s or d ram. For the proms there is an inversion so a 0 is stored as a 1 and takes more energy, a 1 is stored as a zero and takes less. This is why you read 0xFFs after erasing a block.

Rather timely this, article on Hackaday today about measuring power consumption of various commands:
Hackaday: the-effect-of-code-on-power-consumption
Aside from that:
- Interrupts are your friends
- Polling / wait() aren't your friends
- Do as little as possible
- make your code as small/efficient as possible
- Turn off as many modules, pins, peripherals as possible in the micro
- Run as slowly as possible
- If the micro has settings for pin drive strengh, slew rate, etc. check them & configure them, the defaults are often full power / max speed.
- returning to the article above, go back and measure the power & see if you can drop it by altering things.

also something that is not trivial to do is reduce precision of the mathematical operations, go for the smallest dataset available and if available by your development environment pack data and aggregate operations.
knuth books could give you all the variant of specific algorithms you need to save memory or cpu, or going with reduced precision minimizing the rounding errors
also, spent some time checking for all the embedded device api - for example most symbian phones could do audio encoding via a specialized hardware

Do your work as quickly as possible, and then go to some idle state waiting for interrupts (or events) to happen. Try to make the code run out of cache with as little external memory traffic as possible.

On Linux, install powertop to see how often which piece of software wakes up the CPU. And follow the various tips that the powertop site links to, some of which are probably applicable to non-Linux, too.
http://www.lesswatts.org/projects/powertop/

Choose efficient algorithms that are quick and have small basic blocks and minimal memory accesses.
Understand the cache size and functional units of your processor.
Don't access memory. Don't use objects or garbage collection or any other high level constructs if they expands your working code or data set outside the available cache. If you know the cache size and associativity, lay out the entire working data set you will need in low power mode and fit it all into the dcache (forget some of the "proper" coding practices that scatter the data around in separate objects or data structures if that causes cache trashing). Same with all the subroutines. Put your working code set all in one module if necessary to stripe it all in the icache. If the processor has multiple levels of cache, try to fit in the lowest level of instruction or data cache possible. Don't use floating point unit or any other instructions that may power up any other optional functional units unless you can make a good case that use of these instructions significantly shortens the time that the CPU is out of sleep mode.
etc.

Don't poll, sleep
Avoid using power hungry areas of the chip when possible. For example multipliers are power hungry, if you can shift and add you can save some Joules (as long as you don't do so much shifting and adding that actually the multiplier is a win!)
If you are really serious,l get a power-aware debugger, which can correlate power usage with your source code. Like this

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio