CPU usage is too high with cmatrix on OSX - macos

I installed the cmatrix with brew on Mac(OS X 10.11.1).
The CPU usage goes to 190% when I run cmatrix and fullscreen the terminal.

It doesn't seem that you've actually asked a question, but to answer the implicit question “why is the CPU usage so high?” the first step would be for you to define “too high”.
cmatrix drives a very complex animation as fast as it can, which necessarily consumes CPU. If you profile Terminal you'll find that it spends about 100% CPU processing the output from cmatrix, and the other approximately 100% rendering the display. Since cmatrix is designed to make the terminal work, by painting the entire display every 1/30th of a second, it is unsurprising that it would keep Terminal busy most of the time. Terminal is actually showing its mettle by splitting the work onto two CPUs so that it can run at a higher frame-rate.

Related

X11 MIT-SHM performance of XShmPutImage

I just added the option to use MIT-SHM in a big application. The reason was that the Xorg process consumed 15% cpu (more than my process).
This is for just two windows, each 960×540, 30fps (using x11 locally; Linux)
It works in that I can see that the rendering is non-tearing (for which I previously had to use X11-DBE).
But the Xorg process still uses 15% cpu.
What I expected was that using XShmPutImage lets the Xorg process suffice with near 0% cpu usage.
When I disable the call to XShmPutImage then Xorg uses near 0%. This is a clue that it is not some other interaction with X11 that is causing the 15% cpu (such as event-handling).
So XShmPutImage is causing lots of work in Xorg. Can this be reduced? Can MIT-SHM be combined with X11-DBE and would that help?

How to prevent Windows GPU "Timeout Detection and Recovery"?

If I run a long-running kernel on a GPU device, after 2 seconds (by default) the windows TDR (Timeout Detection and Recovery) will kill the running kernels. I understand it, but what if you can't predict how long the kernel will run, because you need to do lots of computations and neither you know the capacity/speed of the underlying GPU for the actual user, who runs your program?
What are the best practices for solving this problem?
I found 3 ways to prevent it to happen, but none of those seems a good solution for me:
You need to make sure that your kernels are not too time-consuming:
The kernel is time consuming and though I could do some kind of fragmentation and not run 1 million of them but 2*500k or 4*250k, but I still can't predict if it will fit into the default 2 seconds on the actual user's GPU. (I had the idea to half the number until your kernel won't drop a CL_INVALID_COMMAND_QUEUE error, and then you just call it multiple times with the smaller amount, but to be honest it sounds really hackie and have some other drawbacks.)
You can turn-off the watchdog timer (or increase the delay): Timeout Detection and Recovery of GPUs:
It's done by registry edit, and you need to restart Windows to make it effective. You can't do it on a user's machine.
You can run the kernel on a GPU that is not hooked up to a display:
How can you make sure the GPU is not hooked up to a display on a users machine? Even in my laptop my primary GPU is the Intel HD4000 and the NVidia GPU is not in use for display (I think so), but TDR still kills my kernels.
You listed all of the solutions I know of. Since solution 2 leaves the machine in an unusable state while your kernel runs (not a good practice) it should be avoided. Since adding another GPU (solution 3) is not practical for you, your best bet is to focus on solution 1. I don't know why you are trying to maximize the work size to run as long as possible to avoid TDR. You should instead target around 10 ms or less (if you run many kernels that take longer the GUI is very sluggish). So instead of 4*250000, think more like 400*2500. You may need to put in some clFinish calls between each one (or batch of 10, or whatever). Keeping the execution time small (10 ms) and not overfilling the queue will allow the GPU to do other things in between kernels and you won't get TDR resets nor make the machine unusable and yet the GPU will be quite busy.

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Possible to run OpenCL program at low priority (be "nice")?

I have an OpenCL Windows program that does heavy number crunching and happily consumes 100% of the GPU. I'd like to be able to run it in the background while using the computer normally, but right now it causes considerable desktop lag and makes any 3d application unusable.
Is there a way to set a priority in OpenCL so that it will yield GPU power to other processes and only use spare cycles?
Unfortunately most GPU's do not support running several tasks at a time, and so there is no way to assign priority. This means that when your OpenCL kernel is running, it is the only task being executed by the GPU and that will be the case until the kernel is complete.
If you want the computer to be usable while running the kernel (normal desktop activity, browsing, videos, games) each kernel iteration would have to be very quick. So if you can reduce the time taken by each set of kernel launches (i.e. each job enqueued with clEnqueueNDRangeKernel) you might get what you're looking for. This could be achieved either through making the NDRange smaller, though it needs to be big enough to be efficient on the GPU. Something like 5120 work-items is what I've found to be the minimum on Radeon HD 5870. Or you could reduce the amount of work in each kernel.
If you can get the execution time of each enqueued job down to maybe 1/60 of a second, there's a good chance the computer will be usable. I've been able to run OpenCL programs where each enqueue takes about 1/120 of a second while gaming without noticing anything.

Qt 4.6.x under MacOS/X: widget update performance mystery

I'm working on a Qt-based MacOS/X audio metering application, which contains audio-metering widgets (potentially a lot of them), each of which is supposed to be updated every 50ms (i.e. at 20Hz).
The program works, but when lots of meters are being updated at once, it uses up lots of CPU time and can bog down (spinny-color-wheel, oh no!).
The strange thing is this: Originally this app would just call update() on the meter widget whenever the meter value changed, and therefore the entire meter-widget would be redrawn every 50ms. However, I thought I'd be clever and compute just the area of the meter that actually needs to be redrawn, and only redraw that portion of the widget (e.g. update(x,y,w,h), where y and h are computed based on the old and new values of the meter). However, when I implemented that, it actually made CPU usage four times higher(!)... even though the app was drawing 50% fewer pixels per second.
Can anyone explain why this optimization actually turns out to be a pessimization? I've posted a trivial example application that demonstrates the effect, here:
http://www.lcscanada.com/jaf/meter_test.zip
When I compile (qmake;make) the above app and run it like this:
$ ./meter.app/Contents/MacOS/meter 72
Meter: Using numMeters=72 (partial updates ENABLED)
... top shows the process using ~50% CPU.
When I disable the clever-partial-updates logic, by running it like this:
$ ./meter.app/Contents/MacOS/meter 72 disable_partial_updates
Meter: Using numMeters=72 (partial updates DISABLED)
... top shows the process using only ~12% CPU. Huh? Shouldn't this case take more CPU, not less?
I tried profiling the app using Shark, but the results didn't mean much to me. FWIW, I'm running Snow Leopard on an 8-core Xeon Mac Pro.
GPU drawing is a lot faster then letting CPU caclulate the part to redraw (at least for OpenGL this takes in account, I got the Book OpenGL superbible, and it states that OpenGL is build to redraw not, to draw delta as this is potentially a lot more work to do). Even if you use Software Rendering, the libraries are higly optimzed to do their job properly and fast. So Just redrawing is state of art.
FWIW top on my Linux box shows ~10-11% without partial updates and 12% using partial updates. I had to request 400 meters to get that CPU usage though.
Perhaps it's just that the overhead of Qt setting up a paint region actually dwarfs your paint time? After all your painting is really simple, it's just two rectangular fills.

Resources