How many layers are between my program and the hardware? - performance

I somehow have the feeling that modern systems, including runtime libraries, this exception handler and that built-in debugger build up more and more layers between my (C++) programs and the CPU/rest of the hardware.
I'm thinking of something like this:
1 + 2 >> OS top layer >> Runtime library/helper/error handler >> a hell lot of DLL modules >> OS kernel layer >> Do you really want to run 1 + 2?-Windows popup (don't take this serious) >> OS kernel layer >> Hardware abstraction >> Hardware >> Go through at least 100 miles of circuits >> Eventually arrive at the CPU >> ADD 1, 2 >> Go all the way back to my program
Nearly all technical things are simply wrong and in some random order, but you get my point right?
How much longer/shorter is this chain when I run a C++ program that calculates 1 + 2 at runtime on Windows?
How about when I do this in an interpreter? (Python|Ruby|PHP)
Is this chain really as dramatic in reality? Does Windows really try "not to stand in the way"? e.g.: Direct connection my binary <> hardware?

"1 + 2" in C++ gets directly translated in an add assembly instruction that is executed directly on the CPU. All the "layers" you refer to really only come into play when you start calling library functions. For example a simple printf("Hello World\n"); would go through a number of layers (using Windows as an example, different OSes would be different):
CRT - the C runtime implements things like %d replacements and creates a single string, then it calls WriteFile in kernel32
kernel32.dll implements WriteFile, notices that the handle is a console and directs the call to the console system
the string is sent to the conhost.exe process (on Windows 7, csrss.exe on earlier versions) which actually hosts the console window
conhost.exe adds the string to an internal buffer that represents the contents of the console window and invalidates the console window
The Window Manager notices that the console window is now invalid and sends it a WM_PAINT message
In response to the WM_PAINT, the console window (inside conhost.exe still) makes a series of DrawString calls inside GDI32.dll (or maybe GDI+?)
The DrawString method loops through each character in the string and:
Looks up the glyph definition in the font file to get the outline of the glyph
Checks it's cache for a rendered version of that glyph at the current font size
If the glyph isn't in the cache, it rasterizes the outline at the current font size and caches the result for later
Copies the pixels from the rasterized glyph to the graphics buffer you specified, pixel-by-pixel
Once all the DrawString calls are complete, the final image for the window is sent to the DWM where it's loaded into the graphics memory of your graphics card, and replaces the old window
When the next frame is drawn, the graphics card now uses the new image to render the console window and your new string is there
Now there's a lot of layers that I've simplified (e.g. the way the graphics card renders stuff is a whole 'nother layer of abstractions). I may have made some errors (I don't know the internals of how Windows is implemented, obviously) but it should give you an idea hopefully.
The important point, though, is that each step along the way adds some kind of value to the system.

As codeka said there's a lot that goes on when you call a library function but what you need to remember is that printing a string or displaying a jpeg or whatever is a very complicated task. Even more so when the method used must work for everyone in every situation; hundreds of edge cases.
What this really means is that when you're writing serious number crunching, chess playing, weather predicting code don't call library functions. Instead use only cheap functions which can and will be executed by the CPU directly. Additionally planning where your expensive functions are can make a huge difference (print everything at the end not each time through the loop).

It doesnt matter how many levels of abstraction there is, as long has the hard work is done in the most efficient way.
In a general sense you suffer from "emulating" your lowest level, e.g. you suffer from emulating a 68K CPU on a x86 CPU running some poorly implemented app, but it wont perform worse than the original hardware would. Otherwise you would not emulate it in the first place. E.g. today most user interface logic is implemented using high level dynamic script languages, because its more productive while the hardcore stuff is handled by optimized low level code.
When it comes to performance, its always the hard work that hits the wall first. The thing in between never suffers from performance issues. E.g. a key handler that processes 2-3 key presses a second can spend a fortune in badly written code without affecting the end user experience, while the motion estimator in an mpeg encoder will fail utterly by just being implemented in software instead of using dedicated hardware.

Related

Popup window in Turbo Pascal 7

In Turbo Pascal 7 for DOS you can use the Crt unit to define a window. If you define a second window on top of the first one, like a popup, I don’t see a way to get rid of the second one except for redrawing the first one on top again.
Is there a window close technique I’m overlooking?
I’m considering keeping an array of screens in memory to make it work, but the TP IDE does popups like I want to do, so maybe it’s easy and I’m just looking in the wrong place?
I don't think there's a window-closing technique you're missing, if you mean one provided by the CRT unit.
The library Borland used for the TP7 IDE was called TurboVision (see https://en.wikipedia.org/wiki/Turbo_Vision) and it was eventually released to the public domain, but well before that, a number of 3rd-party screen handling/windowing libraries had become available and these were much more powerful than what could be achieved with the CRT unit. Probably the best known was Turbopower Software's Object Professional (aka OPro).
Afaik, these libraries (and, fairly obviously TurboVision) were all based on an in-memory representation of a framed window which could be rapidly copied in and out of the PC's video memory and, as in Windows with a capital W, they were treated as a stack with a z-order. So the process or closing/erasing the top level window was one of getting the window(s) that it had been covering to re-draw itself/themselves. Otoh, CRT had basically evolved from v. primitive origins similar to, if not based on, the old DEC VT100 display protocol and wasn't really up to the job of supporting independent, stackable window objects.
Although you may still be able to track down the PD release of TurboVision, it never really caught on as a library for developers. In an ideal world, a better place to start would be with OPro. It was apparently on SoureForge for a while, but seems to have been taken down sometime since about 2007, and these days even if you could get hold of a copy, there is a bit of a question mark over licensing. However ...
There was also a very popular freeware library available for TP by the name of the "Technojock's toolkit" and which had a large functionality overlap (including screen handling) with OPro and it is still available on github - see https://github.com/lallousx86/TurboPascal/tree/master/TotLib/TOTSRC11. Unlike OPro, I never used TechnoJocks myself, but devotees swore by it. Take a look.

If both Mac OS and Windows use the x86 instruction set, why do we have to recompile for each platform?

If both Mac OS and Windows, running on Intel processors, use the x86 instruction set, why can't a program written using only C++11 (no OS Specific libraries, frameworks or API's), run on both without having to recompile for that platform ?
Ultimately the program gets compiled to machine code, so if the instruction set is the same, whats the difference ? What's really going on ?
EDIT: I'm really just talking about a simple "Hello world" program compiled with something like gcc. Not Apps!
EDIT: For example:
#include<iostream>
using namespace std;
int main()
{
cout << "Hello World!";
return 0;
}
EDIT: An even simpler program:
int main(){
int j = 2;
j = j + 3;
}
Because a "program" nowadays consists of more than just a blob of binary code. Their file formats are not cross-compatible (PE/COFF vs. ELF vs. Mach-O). It's kind of silly when you think about it, yes, but that's the reality. It wouldn't have to be this way if you could start history over again.
Edit:
You may also want to see my longer answer on SoftwareEngineering.StackExchange (and others').
Even "Hello, world" needs to generate output. That will either be OS calls, BIOS calls at a somewhat lower level, or, as was common in DOS days for performance reasons, direct output to video via I/O calls or memory mapped video. Any of those methods will be highly specific to the operating system and other design issues. In the example you've listed, iostream hides those details and will be different for each target system.
One reason is provided by #Mehrdad in their answer: even if the assembly code is the same on all platforms, the way it's "wrapped" into an executable file may differ. Back in the day, there were COM files in MS-DOS. You could load this file in a memory and then just start executing it from the very beginning.
Eventually we've got read-only memory pages, .bss, non-executable read-write memory pages (non-executable for safety reasons), embedded resources (like icons on Windows), and other stuff which the OS should know about before running the code in order to properly configure the isolated environment for the newly created process. Of course, there are also shared libraries (which have to be loaded by the OS) and any program which does anything meaningful has to output some result via OS call, e.g. it has to know how to perform system calls.
So, turns out that in multi-process modern OSes executable files should contain a lot of metainformation in addition to the code. That's why we have file formats. They are different on different platforms mainly for historical reasons. Think of it as of PNG vs JPEG - both are compressed rasterized image formats, but they're incompatible, use different algorithms for compression and different storage formats.
no OS Specific libraries, frameworks or API's
That's not true. As we live in multi-process OS, no process has any kind of direct access to the hardware - be it network card or display. In general, it can only access CPU and memory (in a very limited way).
E.g. when you run your program in terminal, its output should get to the terminal emulator, so it can be displayed in a window, which you can drag across the screen, transparently for your "Hello World". So, OS gets involved anyway.
Even your "hello world" application has to:
Load dynamic C++ runtime, which will initialize cin object before your main starts. Who else will initialize cin object and call destructors when main ends?
When you try to print something, your C++ runtime will eventually have to make a call to the OS. Nowadays, it's typically abstracted away in C standard library (libc), which we have to load dynamically even before C++ runtime.
That C standard library invokes some x86 instructions which make the system call which "prints" the string on the screen. Note that different OSes and different CPUs (even among x86 family) have different mechanisms and conventions about system calls. Some use interruptions, some use specifically designed sysenter/syscall instructions (hello from Intel and AMD), some pass arguments in known memory locations, some pass them via registers. Again, that's why this code is abstracted away by the OS's standard library - it typically provides some simple C interface which makes necessary assembly-level magic.
All in all, answering your question: because your program have to interact with the OS and different OSes use completely different mechanisms for that.
If your program has no side effects (like your second example), then it is still saved in the "general" format. And, as "general" formats differ between platforms, we should recompile. It's just not worth to invent a common compatible format for simple programs with no side effects, as they are useless.

WebGL - When to call gl.Flush?

I just noticed today that this method, Flush() is available.
Not able to find detailed documentation on it.
What exactly does this do?
Is this required?
gl.flush in WebGL does have it's uses but it's driver and browser specific.
For example, because Chrome's GPU architecture is multi-process you can do this
function var loadShader = function(gl, shaderSource, shaderType) {
var shader = gl.createShader(shaderType);
gl.shaderSource(shader, shaderSource);
gl.compileShader(shader);
return shader;
}
var vs = loadShader(gl, someVertexShaderSource, gl.VERTEX_SHADER);
var fs = loadShader(gl, someFragmentShaderSource, FRAGMENT_SHADER);
var p = gl.createProgram();
gl.attachShader(p, vs);
gl.attachShader(p, fs);
gl.linkProgram(p);
At this point all of the commands might be sitting in the command
queue with nothing executing them yet. So, issue a flush
gl.flush();
Now, because we know that compiling and linking programs is slow depending on how large and complex they are so we can wait a while before trying using them and do other stuff
setTimeout(continueLater, 1000); // continue 1 second later
now do other things like setup the page or UI or something
1 second later continueLater will get called. It's likely our shaders finished compiling and linking.
function continueLater() {
// check results, get locations, etc.
if (!gl.getShaderParameter(vs, gl.COMPILE_STATUS) ||
!gl.getShaderParameter(fs, gl.COMPILE_STATUS) ||
!gl.getProgramParameter(p, gl.LINK_STATUS)) {
alert("shaders didn't compile or program didn't link");
...etc...
}
var someLoc = gl.getUniformLocation(program, "u_someUniform");
...etc...
}
I believe Google Maps uses this technique as they have to compile many very complex shaders and they'd like the page to stay responsive. If they called gl.compileShader or gl.linkProgram and immediately called one of the query functions like gl.getShaderParameter or gl.getProgramParameter or gl.getUniformLocation the program would freeze while the shader is first validated and then sent to the driver to be compiled. By not doing the querying immediately but waiting a moment they can avoid that pause in the UX.
Unfortunately this only works for Chrome AFAIK because other browsers are not multi-process and I believe all drivers compile/link synchronously.
There maybe be other reasons to call gl.flush but again it's very driver/os/browser specific. As an example let's say you were going to draw 1000 objects and to do that took 5000 webgl calls. It likely would require more than that but just to have a number lets pick 5000. 4 calls to gl.uniformXXX and 1 calls to gl.drawXXX per object.
It's possible all 5000 of those calls fit in the browser's (Chrome) or driver's command buffer. Without a flush they won't start executing until the the browser issues a gl.flush for you (which it does so it can composite your results on the screen). That means the GPU might be sitting idle while you issue 1000, then 2000, then 3000, etc.. commands since they're just sitting in a buffer. gl.flush tells the system "Hey, those commands I added, please make sure to start executing them". So you might decide to call gl.flush after each 1000 commands.
The problem though is gl.flush is not free otherwise you'd call it after every command to make sure it executes as soon as possible. On top of that each driver/browser works in different ways. On some drivers calling gl.flush every few 100 or 1000 WebGL calls might be a win. On others it might be a waste of time.
Sorry, that was probably too much info :p
Assuming it's semantically equivalent to the classic GL glFlush then no, it will almost never be required. OpenGL is an asynchronous API — you queue up work to be done and it is done when it can be. glFlush is still asynchronous but forces any accumulated buffers to be emptied as quickly as they can be, however long that may take; it basically says to the driver "if you were planning to hold anything back for any reason, please don't".
It's usually done only for a platform-specific reason related to the coupling of OpenGL and the other display mechanisms on that system. For example, one platform might need all GL work to be ordered not to queue before the container that it will be drawn into can be moved to the screen. Another might allow piping of resources from a background thread into the main OpenGL context but not guarantee that they're necessarily available for subsequent calls until you've flushed (e.g. if multithreading ends up creating two separate queues where there might otherwise be one then flush might insert a synchronisation barrier across both).
Any platform with a double buffer or with back buffering as per the WebGL model will automatically ensure that buffers proceed in a timely manner. Queueing is to aid performance but should have no negative observable consequences. So you don't have to do anything manually.
If you decline to flush and don't strictly need to even when you semantically perhaps should, but your graphics are predicated on real-time display anyway, then you're probably going to be suffering at worst a fraction of a second of latency.

Is it possible to create a virtual IOHIDDevice from userspace?

I have an HID device that is somewhat unfortunately designed (the Griffin Powermate) in that as you turn it, the input value for the "Rotation Axis" HID element doesn't change unless the speed of rotation dramatically changes or unless the direction changes. It sends many HID reports (angular resolution appears to be about 4deg, in that I get ~90 reports per revolution - not great, but whatever...), but they all report the same value (generally -1 or 1 for CCW and CW respectively -- if you turn faster, it will report -2 & 2, and so on, but you have to turn much faster. As a result of this unfortunate behavior, I'm finding this thing largely useless.
It occurred to me that I might be able to write a background userspace app that seized the physical device and presented another, virtual device with some minor additions so as to cause an input value change for every report (like a wrap-around accumulator, which the HID spec has support for -- God only knows why Griffin didn't do this themselves.)
But I'm not seeing how one would go about creating the kernel side object for the virtual device from userspace, and I'm starting to think it might not be possible. I saw this question, and its indications are not good, but it's low on details.
Alternately, if there's a way for me to spoof reports on the existing device, I suppose that would do it as well, since I could set it back to zero immediately after it reports -1 or 1.
Any ideas?
First of all, you can simulate input events via Quartz Event Services but this might not suffice for your purposes, as that's mainly designed for simulating keyboard and mouse events.
Second, the HID driver family of the IOKit framework contains a user client on the (global) IOHIDResource service, called IOHIDResourceDeviceUserClient. It appears that this can spawn IOHIDUserDevice instances on command from user space. In particular, the userspace IOKitLib contains a IOHIDUserDeviceCreate function which seems to be supposed to be able to do this. The HID family source code even comes with a little demo of this which creates a virtual keyboard of sorts. Unfortunately, although I can get this to build, it fails on the IOHIDUserDeviceCreate call. (I can see in IORegistryExplorer that the IOHIDResourceDeviceUserClient instance is never created.) I've not investigated this further due to lack of time, but it seems worth pursuing if you need its functionality.

Speeding up text output on Windows, for a console

We have an application that has one or more text console windows that all essentially represent serial ports (text input and output, character by character). These windows have turned into a major performance problem in the way they are currently code... we manage to spend a very significant chunk of time in them.
The current code is structured by having the window living its own little life, and the main application thread driving it across "SendMessage()" calls. This message-passing seems to be the cause of incredible overhead. Basically, having a detour through the OS feels to be the wrong thing to do.
Note that we do draw text lines as a whole where appropriate, so that easy optimization is already done.
I am not an expert in Windows coding, so I need to ask the community if there is some other architecture to drive the display of text in a window than sending messages like this? It seems pretty heavyweight.
Note that this is in C++ or plain C, as the main application is a portable C/C++/some other languages program that also runs on Linux and Solaris.
We did some more investigations, seems that half of the overhead is preparing and sending each message using SendMessage, and the other half is the actual screen drawing. The SendMessage is done between functions in the same file...
So I guess all the advice given below is correct:
Look for how much things are redrawn
Draw things directly
Chunk drawing operations in time, to not send every character to the screen, aiming for 10 to 20 Hz update rate of the serial console.
Can you accept ALL answers?
I agree with Will Dean that the drawing in a console window or a text box is a performance bottleneck by itself. You first need to be sure that this isn't your problem. You say that you draw each line as a whole, but even this could be a problem, if the data throughput is too high.
I recommend that you don't use the SendMessage to pass data from the main application to the text window. Instead, use some other means of communication. Are these in the same process? If not, you could use shared memory. Even a file in the disk could do in some circumstances. Have the main application write to this file and the text console read from it. You could send a SendMessage notification to the text console to inform it to update the view. But do not send the message whenever a new line arrives. Define a minimum interval between two subsequent updates.
You should try profiling properly, but in lieu of that I would stop worrying about the SendMessage, which almost certainly not your problem, and think about the redrawing of the window itself.
You describe these are 'text console windows', but then say you have multiple of them - are they actually Windows Consoles? Or are they something your application is drawing?
If the latter, then I would be looking at measuring my paint code, and whether I'm invalidating too much of a window on each update.
Are the output windows part of the same application? It almost sounds like they aren't...
If they are, you should look into the Observer design pattern to get away from SendMessage(). I've used it for the same type of use case, and it worked beautifully for me.
If you can't make a change like that, perhaps you could buffer your output for something like 100ms so that you don't have so many out-going messages per second, but it should also update at a comfortable rate.
Are the output windows part of the
same application? It almost sounds
like they aren't...
Yes they are, all in the same process.
I did not write this code... but it seems like SendMessage is a bit heavy for this all in one application case.
You describe these are 'text console
windows', but then say you have
multiple of them - are they actually
Windows Consoles? Or are they
something your application is drawing?
Our app is drawing them, they are not regular windows consoles.
Note that we also need to get data back when a user types into the console, as we quite often have interactive serial sessions. Think of it as very similar to what you would see in a serial terminal program -- but using an external application is obviously even more expensive than what we have now.
If you can't make a change like that,
perhaps you could buffer your output
for something like 100ms so that you
don't have so many out-going messages
per second, but it should also update
at a comfortable rate.
Good point. Right now, every single character output causes a message to be sent.
And when we scroll the window up when a newline comes, then we redraw it line-by-line.
Note that we also have a scrollback buffer of arbitrary size, but scrolling back is an interactive case with much lower performance requirements.

Resources