in OS X, what is the BASE graphics drawing layer? - macos

I am beginning GUI development in OSX, and I am wondering, what is the VERY BASE layer in the system for which to draw graphics? It seems as if there are so many upper level abstractions (AppKit, OpenGL, CG, etc), which are nice and timesaving, but for me unusable until I understand the base layer (unless its binary or assembly, in which case I throw in the towel).

I am beginning GUI development in OSX, and I am wondering, what is the VERY BASE layer in the system for which to draw graphics?
Believe it or not, but ever since MacOS X Tiger the whole graphics stack it based on OpenGL. Below OpenGL is only the GPU driver and then the bare metal.
It seems as if there are so many upper level abstractions (AppKit, OpenGL, CG, etc), which are nice and timesaving, but for me unusable until I understand the base layer (unless its binary or assembly, in which case I throw in the towel).
Why are they unusable for you? What do you expect to gain from the added knowledge? The lower the level is, that you're using, the more intimate you must be with how it works to make efficient use of it. OpenGL itself is already fairly low level. The OpenGL implementation hides some gory details from you, like on demand texture data swapping from fast to regular memory and the likes, and the GLSL compiler is also rather high level. But on the other side to use OpenGL efficiently you should deliver data in the format the GPU natively works with, shaders can be cached in their binary form and buffer objects provide you with a API for DMA transfers.
If you were really interested in the lowest layer, that you'd have to look at the GPU design, i.e. the metal. AMD did actually publish full programming documentation on some of their GPUs (Google for OpenGPU).

You could do a lot worse than have a look at the Quartz 2D Programming Guide. It's the layer you will be using most often and understanding this will form the basis for any further investigation you do.

Related

Is there a way to determine GPU warp/wavefront/SIMD width on Android?

My question is similar to the question "OpenCL - How to I query for a device's SIMD width?", but I'm wondering whether there's any way to do this outside of OpenCL, CUDA, or anything else that's not really available on Android, which I'm targeting. I am writing an OpenGL ES 3.1 application which makes use of compute shaders, and for certain GPGPU algorithms, such as efficient parallel reduction as described by Nvidia (in the Reduction #5 section), there are optimizations you can make if you are aware of the "warp" (a.k.a. wavefront, a.k.a. SIMD width) size of the GPU the code will be running on. I'm also not sure if it's consistent enough on Android GPUs in order to just make a hard-coded assumption without querying anything, or if there's some table of GPU info I can reference, etc.
I tried Googling if there is any way to do this in OpenGL or even in general on Android, but I could not find anything. Is this possible? If not, is there a "recommended" workaround, like just assuming some minimum possible warp size in cases where that still may produce a small speed-up?
For OpenGL ES if the implementation supports the OpenGL ES KHR_shader_subgroup extension you can use glGetIntegerv(SUBGROUP_SIZE_KHR) to get the subgroup size.
https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_shader_subgroup.txt
For sake of completeness, for Vulkan 1.1 you can query the subgroup size in the device properties VkPhysicalDeviceSubgroupProperties.subgroupSize.
https://www.khronos.org/blog/vulkan-subgroup-tutorial

Is it possible to use GPU for raytracing without CUDA/OpenCL etc?

I'm working on Windows Phone 7 which does not support features like CUDA or OpenCL. I'm new to the GPU side of things, Is there anything on the GPU that I can use to help speed up raytracing? Like triangle intersection tests? Or selecting the correct colour from a texture?
CUDA and the like are really just higher level languages for programming shaders, so any platform that supports programmable shaders allows you some capability to run general purpose calculations on the gpu.
Unfortunately, it looks like Windows Phone 7 does not support custom programmable shaders, so GPU acceleration for a ray tracer is not really possible at this time. Even if it was, it is very difficult to effecticely use a GPU for raytracing because of several very anti-GPU characteristics:
Poor memory coherency (each ray can easily interact with completely different geometry)
High branching factor (shaders work best with code that consistently follows a single path)
Large working set (A lot of geometry has to be accesable in memory at any one time to compute the outcome of even a single ray)
If your goal is to write a raytracer, it would probably be far easier to do completely on the CPU, and only then consider optimizations that are more esoteric.
Raytracing is still a bit slow, even on modern average desktop PC. You can speed it up by shooting just primary rays, but then rasterisation methods will be actually better and faster.
Are you certain, you want to do ray-tracing on a phone, which has even less compute power than PC? They are not designed to do that kind of work.

Why not use GDI to repeatedly fill a window with RGB data from an array?

This is a follow-up to this question. I'm currently writing a simple game and am looking for the fastest way to (repeatedly) display an array of RGB data in a Win32 window, without flickering or other artifacts.
Several different approaches were recommended in the answers to the previous question, but there was no consensus on which would be the fastest. So, I threw together a test program. The code simply displays a framebuffer on the screen repeatedly, as fast as possible.
These are the results I obtained, for 32-bit data running in a 32-bit video mode - they may surprise some people:
- Direct3D (1): 500 fps
- Direct3D (2): 650 fps
- DirectDraw (3): 1100 fps
- DirectDraw (4): 800 fps
- GDI (SetDIBitsToDevice): 2000 fps
Given these figures:
Why are many people adamant that GDI is simply too slow for this operation?
Is there any reason to prefer DirectDraw or Direct3D over SetDIBitsToDevice?
Here is a brief summary of the calls made by each of the Direct* codepaths. If anyone knows a more efficient way to use DirectDraw/Direct3D, please comment.
1. CreateTexture(D3DUSAGE_DYNAMIC, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); DrawPrimitive()
2. CreateTexture(0, D3DPOOL_SYSTEMMEM); CreateTexture(0, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); UpdateTexture(); DrawPrimitive()
3. CreateSurface(); SetSurfaceDesc(lpSurface = &frameBuffer[0]);
memcpy(); primarySurface->Blt();
4. CreateSurface();
Lock(); memcpy(); Unlock(); primarySurface->Blt();
There are a couple of things to keep in mind here. First of all, a lot of "common knowledge" is based on some facts that no longer really apply.
In the days of AGP, when the CPU talked directly to the GPU, it always used the base PCI protocol, which happened at the "1x" rate (always and inevitably). AGX 2x/4x/8x only applied when the GPU was taking to the memory controller directly. In other words, depending on when you looked, it was up to 8 times as fast to have the GPU load a texture from memory as it was for the CPU to send the same data directly to the GPU. Of course, the CPU also had a great deal more bandwidth to memory than the PCI bus supported.
When things switched to PCI-E, however, that changed completely. While there can be differences in bandwidth depending on path, there's no general rule that memory->GPU will be faster than CPU->GPU. The one generalization that's (mostly) safe is that if you have a dedicated graphics card, then the GPU will almost always have more bandwidth to the memory on the graphics card than it does to main memory on the motherboard.
In your case, that doesn't matter much though -- you're talking about moving data from CPU space to GPU space regardless. The main speed difference with using DirectX (or OpenGL) happens when you keep all (or most) of the computation on the GPU, and avoid using the CPU (or main memory) at all. They don't (now that AGP is history) provide any substantial improvement in memory->display bandwidth.
Jerry Coffin makes some good points. The thing to bear in mind is what the DI stands for in SetDIBitsToDevice. It stands for Device Independent. Which means you were ALWAYS at the mercy of drivers. Some drivers used to be complete rubbish and it affected the performance massively. DirectDraw suffered from similar issues as well ... but you also had access to the hardware blitters so it was generally more useful. IHVs also tended to put more time in to writing proper drivers for DirectDraw because of its gaming association. Who wants to be the bottom of the performance pile when the hardware is quite capable of doing better?
These days many graphics cards can accept the bit data directly so no conversion happens. If it does need to be swizzled this is also INCREDIBLY quick in this day and age.
The reason your Direct3D performance is so terrible, by comparison, is that Direct3D, by nature of the fact it is meant to be used totally internally to the GPU, uses odd and complex formats to improve cache performance and so forth.
Couple that with the fact that you aren't testing like for like (with DDraw and D3D) by creating a texture/surface, locking it, copying, unlocking and then drawing over the back buffer (via various methods). To get best performance you'd be best off directly locking the backbuffer using a DISCARD lock then memcpy'ing directly into the returned buffer before unlocking. This will bring your performance much closer to the SetDIBitsToDevice. I still would expect D3D to be slower than DDraw, however, for the reasons outlined above.
The reason you will hear people trounce on GDI is that it used to just be old windows API calls. The newer versions of it (that were called GDI+ when I last looked at em) are actually just an API placed on top of DirectX calls. So using GDI may seem fairly simple programming wise at times, but adding a layer between things always slows things down. As mentioned in the response from Jerry Coffin, your examples are about moving the data, and that is the slow time. I am a bit surprised that DirectX is that much slower though but I can not be much more help with out digging through the DirectX documentation (which has been pretty awesome for quite some time really.. Might want to check out www.codesampler.com. I have always found good starting places from him and actually, while I may be insane for saying this, I would swear the improvements to the DirectX SDK in doc and examples were done based on this guys work!)
As for the DirectDraw vs Direct3D (and not the GDI calls) discussion. I would say go to Direct3D. I believe DirectDraw has been deprecated since 8.0 or so, and 9.0 has been around for quite a long while. And at the end of the day all of DirectX is 3D, it just varies on the levels of helpful 2D apis that are around, but you may find you can do some very interesting things in a 2D environment when you are actually using 3D space. (I had a pretty neat randomly generated lightning weapon for a space invaders clone at one time :))
Anywho, hope this helped!
PS: It should be noted that DirectX is not always the fastest. For keyboard input (unless this has changed in 10 or 11) it has pretty much always been recommended to use the windows events.. as DirectInput was actually just a wrapper for that system!.. XInput however is -awesome-!!

Porting DirectX to OpenGL ES (iPhone)

I have been asked to investigate porting 10 year old Direct X (v7-9) games to OpenGL ES, initially for the iPhone
I have never undertaken a game port like this before (and will be hiring someone to do it) but I'd like to understand the process.
Are there any resources/books/blogs that will help me in understanding the process?
Are there any projects like Mono that can accomplish this?
TBH A porting job like this is involved but fairly easy.
First you start by replacing all the DirectX calls with "stubs" (ie empty functions). You do this until you can get the software to compile. Once it has compiled then you start implementing all the stub functions. There will be a number of gotchas along the way but its worth doing.
If you need to port to and support phones before iPhone 3GS you have a more complex task as the hardware only supports GLES 1 which is fixed-function only. You will have to "emulate" these shaders somehow. On mobile platforms I have written, in the past, assembler code that performs "vertex shading" directly on the vertex data. Pixel shading is often more complicated but you can usually provide enough information through the "vertex shading" to get this going. Some graphical features you may just have to drop.
Later versions of the iPhone use GLES 2 so you have access to GLSL ... ATI have written, and Aras P of Unity3D fame has extended, software that will port HLSL code to GLSL.
Once you have done all this you get on to the optimisation stage. You will probably find that your first pass isn't very efficient. This is perfectly normal. At this point you can look at the code from a higher level and see how you can move code around and do things differently to get best performance.
In summary: Your first step will be to get the code to compile without DirectX. Your next step will be the actual porting of DirectX calls to OpenGL ES calls. Finally you will want to refactor the remaining code for best performance.
(P.S: I'd be happy to do the porting work for you. Contact me through my linkedin page in my profile ;)).
Not a complete answer, but in the hope of helping a little...
I'm not aware of anything targeting OpenGL ES specifically, but Cadega, Cider and VirtualBox — amongst others — provide translation of DirectX calls to OpenGL calls, and OpenGL ES is, broadly speaking, OpenGL with a lot of very rarely used bits and some slower and redundant parts removed. So it would probably be worth at least investigating those products; at least VirtualBox is open source.
The SGX part in the iPhone 3GS onwards has a fully programmable pipeline, making it equivalent to a DirectX 10 part, so the hardware is there. The older MBX is fixed pipeline with the dot3 extension but no cube maps and only two texture units. It also has the matrix palette extension, so you can do good animation and pretty good lighting if multiple passes is acceptable.

Why is GUI code so computationally expensive?

All you Stackoverflowers,
I was wondering why GUI code is responsible for sucking away many, many cpu cycles. In principle, the graphical rendering is far less complex than Doom (although most corporate GUIs will introduce lots of window dressing). The event handling layer is also seemingly a heavy cost, however, it seems that a well-written implementation should switch between contexts efficiently on modern processors with a lot of memory/cache.
If anybody has run a profiler on their big GUI application, or a common API itself, I'm interested in where the bottlenecks lie.
Possible explanations (that I imagine) may be:
High levels of abstraction between hardware and application interface
Lots of levels of indirection to the correct code to execute
Low priority (compared to other processes)
Misbehaving applications flooding API with calls
Excessive object orientation?
Complete poor design choices in API (not just issues, but design philosophy)
Some GUI frameworks are much better than others, so I'd like to hear varied perspectives. For example, the Unix/X11 system is much different than Windows and even than WinForms.
Edit: Now a community wiki - go for it. I have one more thing to add -- I'm an algorithms guy in school and would be interested if there are inefficient algorithms in GUI code and which they are. Then again, it's probably just the implementation overhead.
I've no idea generally, but I'd like to add another item to your list - font rendering and calculations. Finding vector glyphs in a font and converting them to bitmap representations with anti-aliasing is no small task. And often it needs to be done twice - first to calculate the width/height of the text for positioning, and then actually drawing the text at the right coordinates.
Also, most drawing code today relies on clipping mechanisms to update just a part of the GUI. So, if just one part needs to be redrawn, the code actually redraws the whole window behind the scenes, and then takes just the needed part to actually update.
Added:
In the comments I found this:
I'm also very interested in this. It can't be that the gui is rendered using only the cpu because if you don't have proper drivers for your gfx-card, desktop graphics render incredibly slow. If you have gfx-drivers however desktop-gfx go kinda fast but never as fast as a directx/opengl app.
Here's the deal as I understand it: every graphic card out there today supports a generic interface for drawing. I'm not sure if it's called "VESA", "SVGA", or if those are just old names from the past. Anyway, this interface involves doing everything through interrupts. For every pixel there is an interrupt call. Or something like that. The proper VGA driver however is able to take advantage of DMA and other enhancements that make the whole process WAY less CPU-intensive.
Added 2: Ah, and for OpenGL/DirectX - that's another feature of today's graphics cards. They are optimized for 3D operations in exclusive mode. That's why the speed. The normal GUI just utilizes basic 2D drawing procedures. So it gets to send the contents of the whole screen every time it wants an update. 3D applications however send a bunch of textures and triangle definitions to the VRAM (video-RAM) and then just reuse them for drawing. They just say something like "take the triangle set #38 with the texture set #25 and draw them". All these things are cached in the VRAM so this is again way faster.
I'm not sure, but I would suspect that the modern 3D-accelerated GUIs (Vista Aero, compiz on Linux, etc.) also might take advantage of this. They could send common bitmaps to the VGA up front and then just reuse them directly from the VRAM. Any application-drawn surfaces however would still need to be sent directly every time for updates.
Added 3: More ideas. :) The modern GUI's for Windows, Linux, etc. are widget-oriented (that's control-oriented for Windows speakers). The problem with this is that each widget has its own drawing code and associated drawing surface (more or less). When the window needs to get redrawn, it calls the drawing code for all its child-widgets, who in turn call the drawing code for their child-widgets, etc.. Every widget redraws its whole surface, even though some of it is obscured by other widgets. With above mentioned clipping techniques some of this drawn information is immediately discarded to reduce flickering and other artifacts. But still it's lots of manual drawing code that includes bitmap blitting, stretching, skewing, drawing lines, text, flood-filling, etc.. And all this gets translated to a series of putpixel calls that get filtered through clipping filters/masks and other stuff. Ah, yes, and alpha blending has also become popular today for nice effects which means even more work. So... yes, you could say this is because of lots of abstraction and indirection. But... could you really do it any better? I don't think so. Only 3D techniques might help, because they take advantage of GPU for alpha-calculations and clipping.
Let's begin by saying that writing libraries is much harder than writing a stand-alone code. The requirement that your abstraction be reusable in as many contexts as possible, including contexts which you haven't though of yet, makes the task challenging even for experienced programmers.
Amongst libraries, writing a GUI toolkit library is a famously difficult problem. This is because the programs which use GUI libraries range over a very wide variety of domains with very different needs. Mr Why and Martin DeMollo discussed the requirements placed of GUI libraries a little while ago.
Writing GUI widgets themselves is difficult because computer users are very sensitive minute details of the behavior of the interface. Non-native widget never feel right, don't they? In order to get non-native widget right -- in order to get any widget right, in fact -- you need to spend an inordinate amount of time tweaking the details of the behavior.
So, GUI are slow because of the inefficiencies introduced by the abstraction mechanisms used to create highly-reusable components, that added to shortness of time available to optimize the code once so much time has been spent just getting the behavior right.
Uhm, that's quite a lot.
The most simple but probably obvious answer is that the programmers behind these GUI apps, are really bad programmers. You can go along way in writing code which does the most bizarre things and it will be faster but few people seem to care how to do this or they deem it to be an expensive non-profitable time wasted effort.
To set things straight off-loading computations to the GPU won't necessarily fix any problems. The GPU is just like the CPU except it's less general purpose and more a data paralleled processor. It can do graphics computations exceptionally well. Whatever graphics API/OS and driver combination you have doesn't really matter that much... well OK, with Vista as an example, they changed the desktop composition engine. This engine is far better composting only that which has changed, and since the number one bottle neck for GUI apps is redrawing is a neat optimization strategy. This idea of virtualizing your computational needs and only update the smallest change every time.
Win32 sends WM_PAINT messages to windows when they need to be redrawn, this can be a result of windows occluding each other. However it's up to the window itself to figure out whats actually changed. More than so nothing did change or the change that was made was trivial enough so that it could have been just preformed on top of what ever top most surface you had.
This kind of graphics handling doesn't necessarily exist today. I would say that people have refrained from writing really efficient and virtualizing rendering solutions because the benefit/cost ration is rather low/high (bad).
Something Windows Presentation Foundation (WPF) does, which I think is far superior to most other GUI API is that it splits layout updates and rendering updates into two separate passes. And while WPF is managed code the rendering engine is not. What happens with rendering is that the managed WPF rendering engine builds a command queue (this is what DirectX and OpenGL does) which is then handed of to the native rendering engine. What's a bit more elegant here is that WPF will then try to retain any computation which didn't change the visual state. A trick if you may, where you avoid costly rendering calls for things that doesn't have to be rendered (virtualizing).
In contrast to WM_PAINT which tells a Win32 window to repaint itself a WPF app would check what parts of that window requires repainting and only repaint the smallest change.
Now WPF is not supreme, it's a solid effort from Microsoft but it's not the holy grail yet... the code which runs the pipeline could still be improved and the memory footprint of any managed app is still more than I would want. But I hope this is the kind of answer you are looking for.
WPF is able to do some things asynchronously rather decent, which is a huge deal if you wanna make a really responsive low-latency/low-cpu UI. Asynchronous operations is more than off-loading work on a different thread.
To summarize things slow and expensive GUI means too much repainting and the kind of repainting which is very expensive i.e. the entire surface area.
I does to some degree depend on the language. You might have noticed that Java and RealBasic applications are a fair bit slower than their C-based (C++, C#, Objective-C) counterparts.
However GUI applications are much more complex than command line apps. The Terminal window needs only to draw a simple window that doesn't support buttons.
There are also multiple loops for extra inputs and features.
I think that you can find some interesting thoughts on this topic in "Window System Design: If I had it to do over again in 2002" by James Gosling (the Java guy, also known for his work on pre-X11 windowing systems). Available online here[pdf].
The article focuses on the positive side (how to make it fast), not on the negative side (what's making it slow), but it is still a good read on the topic.

Resources