Is there a way to determine GPU warp/wavefront/SIMD width on Android? - opengl-es

My question is similar to the question "OpenCL - How to I query for a device's SIMD width?", but I'm wondering whether there's any way to do this outside of OpenCL, CUDA, or anything else that's not really available on Android, which I'm targeting. I am writing an OpenGL ES 3.1 application which makes use of compute shaders, and for certain GPGPU algorithms, such as efficient parallel reduction as described by Nvidia (in the Reduction #5 section), there are optimizations you can make if you are aware of the "warp" (a.k.a. wavefront, a.k.a. SIMD width) size of the GPU the code will be running on. I'm also not sure if it's consistent enough on Android GPUs in order to just make a hard-coded assumption without querying anything, or if there's some table of GPU info I can reference, etc.
I tried Googling if there is any way to do this in OpenGL or even in general on Android, but I could not find anything. Is this possible? If not, is there a "recommended" workaround, like just assuming some minimum possible warp size in cases where that still may produce a small speed-up?

For OpenGL ES if the implementation supports the OpenGL ES KHR_shader_subgroup extension you can use glGetIntegerv(SUBGROUP_SIZE_KHR) to get the subgroup size.
https://www.khronos.org/registry/OpenGL/extensions/KHR/KHR_shader_subgroup.txt
For sake of completeness, for Vulkan 1.1 you can query the subgroup size in the device properties VkPhysicalDeviceSubgroupProperties.subgroupSize.
https://www.khronos.org/blog/vulkan-subgroup-tutorial

Related

Measuring WebGL efficency

I’m working on a ThreeJs project that requires some heavy-duty work done with in a fragment shader so I am looking for a way to use lower quality if the device can’t handle the work.
By pure accident I recently included an ‘uint’ uniform in my shader code and found it just would not run-on older devices. So, the availability of WebGL2 became an obvious and good switch.
The problem is that WebGL2 is a browser choice and some older devices with later software will still run it even if very badly.
Is there a quick test to determine WebGL efficiency so I can fall back to lower quality if needed.
Measuring FPS is not an option since even on a modern device it can take a few seconds for it to stabilize for a new page.
This is not a general solution.
But in my particular situation I am using a very expensive SDF that is needed in both the Hi and Lo versions of the graphicsis. It is generated once and stored in a FBO, then used again multiple times as a texture.
Even on a desktop using a RTX 3060 Ti it takes more than 20mS to generate the texture, on the old S4 it's 320+ mS to generate.
They're not ideal metrics but with a bit tuning they should provide a way of guesstimating the GPU's capability and give a good indication of when to fall back to simpler graphics.
There will always be a cut off of what we support but being able to get the best from older devices is not a bad thing.

Display list vs. VAO performance

I recently implemented functionality in my rendering engine to make it able to compile models into either display lists or VAOs based on a runtime setting, so that I can compare the two to each other.
I'd generally prefer to use VAOs, since I can make multiple VAOs sharing actual vertex data buffers (and also since they aren't deprecated), but I find them to actually perform worse than display lists on my nVidia (GTX 560) hardware. (I want to keep supporting display lists anyway to support older hardware/drivers, however, so there's no real loss in keeping the code for handling them.)
The difference is not huge, but it is certainly measurable. As an example, at a point in the engine state where I can consistently measure my drawing loop using VAOs to take, on a rather consistent average, about 10.0 ms to complete a cycle, I can switch to display lists and observe that cycle time decrease to about 9.1 ms on a similarly consistent average. Consistent, here, means that a cycle normally deviates less than ±0.2 ms, far less than the difference.
The only thing that changes between these settings is the drawing code of a normal mesh. It changes from the VAO code whose OpenGL calls look simply thusly...
glBindVertexArray(id);
glDrawElements(GL_TRIANGLES, num, GL_UNSIGNED_SHORT, NULL); // Using an index array in the VAO
... to the display-list code which looks as follows:
glCallList(id);
Both code paths apply other states as well for various models, of course, but that happens in the exact same manner, so those should be the only differences. I've made explicitly sure to not unbind the VAO unnecessarily between draw calls, as that, too, turned out to perform measurably worse.
Is this behavior to be expected? I had expected VAOs to perform better or at least equally to display lists, since they are more modern and not deprecated. On the other hand, I've been reading on the webs that nVidia's implementation has particularly well optimized display lists and all, so I'm thinking perhaps their VAO implementation might still be lagging behind. Has anyone else got findings that match (or contradict) mine?
Otherwise, could I be doing something wrong? Are there any known circumstances that make VAOs perform worse than they should, on nVidia hardware or in general?
For reference, I've tried the same differences on an Intel HD Graphics (Ironlake) as well, and there it turned out that using VAOs performed just as well as simply rendering directly from memory, while display lists were much worse than either. I wish I had AMD hardware to try on, but I don't.

in OS X, what is the BASE graphics drawing layer?

I am beginning GUI development in OSX, and I am wondering, what is the VERY BASE layer in the system for which to draw graphics? It seems as if there are so many upper level abstractions (AppKit, OpenGL, CG, etc), which are nice and timesaving, but for me unusable until I understand the base layer (unless its binary or assembly, in which case I throw in the towel).
I am beginning GUI development in OSX, and I am wondering, what is the VERY BASE layer in the system for which to draw graphics?
Believe it or not, but ever since MacOS X Tiger the whole graphics stack it based on OpenGL. Below OpenGL is only the GPU driver and then the bare metal.
It seems as if there are so many upper level abstractions (AppKit, OpenGL, CG, etc), which are nice and timesaving, but for me unusable until I understand the base layer (unless its binary or assembly, in which case I throw in the towel).
Why are they unusable for you? What do you expect to gain from the added knowledge? The lower the level is, that you're using, the more intimate you must be with how it works to make efficient use of it. OpenGL itself is already fairly low level. The OpenGL implementation hides some gory details from you, like on demand texture data swapping from fast to regular memory and the likes, and the GLSL compiler is also rather high level. But on the other side to use OpenGL efficiently you should deliver data in the format the GPU natively works with, shaders can be cached in their binary form and buffer objects provide you with a API for DMA transfers.
If you were really interested in the lowest layer, that you'd have to look at the GPU design, i.e. the metal. AMD did actually publish full programming documentation on some of their GPUs (Google for OpenGPU).
You could do a lot worse than have a look at the Quartz 2D Programming Guide. It's the layer you will be using most often and understanding this will form the basis for any further investigation you do.

Is it possible to use GPU for raytracing without CUDA/OpenCL etc?

I'm working on Windows Phone 7 which does not support features like CUDA or OpenCL. I'm new to the GPU side of things, Is there anything on the GPU that I can use to help speed up raytracing? Like triangle intersection tests? Or selecting the correct colour from a texture?
CUDA and the like are really just higher level languages for programming shaders, so any platform that supports programmable shaders allows you some capability to run general purpose calculations on the gpu.
Unfortunately, it looks like Windows Phone 7 does not support custom programmable shaders, so GPU acceleration for a ray tracer is not really possible at this time. Even if it was, it is very difficult to effecticely use a GPU for raytracing because of several very anti-GPU characteristics:
Poor memory coherency (each ray can easily interact with completely different geometry)
High branching factor (shaders work best with code that consistently follows a single path)
Large working set (A lot of geometry has to be accesable in memory at any one time to compute the outcome of even a single ray)
If your goal is to write a raytracer, it would probably be far easier to do completely on the CPU, and only then consider optimizations that are more esoteric.
Raytracing is still a bit slow, even on modern average desktop PC. You can speed it up by shooting just primary rays, but then rasterisation methods will be actually better and faster.
Are you certain, you want to do ray-tracing on a phone, which has even less compute power than PC? They are not designed to do that kind of work.

Porting DirectX to OpenGL ES (iPhone)

I have been asked to investigate porting 10 year old Direct X (v7-9) games to OpenGL ES, initially for the iPhone
I have never undertaken a game port like this before (and will be hiring someone to do it) but I'd like to understand the process.
Are there any resources/books/blogs that will help me in understanding the process?
Are there any projects like Mono that can accomplish this?
TBH A porting job like this is involved but fairly easy.
First you start by replacing all the DirectX calls with "stubs" (ie empty functions). You do this until you can get the software to compile. Once it has compiled then you start implementing all the stub functions. There will be a number of gotchas along the way but its worth doing.
If you need to port to and support phones before iPhone 3GS you have a more complex task as the hardware only supports GLES 1 which is fixed-function only. You will have to "emulate" these shaders somehow. On mobile platforms I have written, in the past, assembler code that performs "vertex shading" directly on the vertex data. Pixel shading is often more complicated but you can usually provide enough information through the "vertex shading" to get this going. Some graphical features you may just have to drop.
Later versions of the iPhone use GLES 2 so you have access to GLSL ... ATI have written, and Aras P of Unity3D fame has extended, software that will port HLSL code to GLSL.
Once you have done all this you get on to the optimisation stage. You will probably find that your first pass isn't very efficient. This is perfectly normal. At this point you can look at the code from a higher level and see how you can move code around and do things differently to get best performance.
In summary: Your first step will be to get the code to compile without DirectX. Your next step will be the actual porting of DirectX calls to OpenGL ES calls. Finally you will want to refactor the remaining code for best performance.
(P.S: I'd be happy to do the porting work for you. Contact me through my linkedin page in my profile ;)).
Not a complete answer, but in the hope of helping a little...
I'm not aware of anything targeting OpenGL ES specifically, but Cadega, Cider and VirtualBox — amongst others — provide translation of DirectX calls to OpenGL calls, and OpenGL ES is, broadly speaking, OpenGL with a lot of very rarely used bits and some slower and redundant parts removed. So it would probably be worth at least investigating those products; at least VirtualBox is open source.
The SGX part in the iPhone 3GS onwards has a fully programmable pipeline, making it equivalent to a DirectX 10 part, so the hardware is there. The older MBX is fixed pipeline with the dot3 extension but no cube maps and only two texture units. It also has the matrix palette extension, so you can do good animation and pretty good lighting if multiple passes is acceptable.

Resources