Is it possible to use GPU for raytracing without CUDA/OpenCL etc? - raytracing

I'm working on Windows Phone 7 which does not support features like CUDA or OpenCL. I'm new to the GPU side of things, Is there anything on the GPU that I can use to help speed up raytracing? Like triangle intersection tests? Or selecting the correct colour from a texture?

CUDA and the like are really just higher level languages for programming shaders, so any platform that supports programmable shaders allows you some capability to run general purpose calculations on the gpu.
Unfortunately, it looks like Windows Phone 7 does not support custom programmable shaders, so GPU acceleration for a ray tracer is not really possible at this time. Even if it was, it is very difficult to effecticely use a GPU for raytracing because of several very anti-GPU characteristics:
Poor memory coherency (each ray can easily interact with completely different geometry)
High branching factor (shaders work best with code that consistently follows a single path)
Large working set (A lot of geometry has to be accesable in memory at any one time to compute the outcome of even a single ray)
If your goal is to write a raytracer, it would probably be far easier to do completely on the CPU, and only then consider optimizations that are more esoteric.

Raytracing is still a bit slow, even on modern average desktop PC. You can speed it up by shooting just primary rays, but then rasterisation methods will be actually better and faster.
Are you certain, you want to do ray-tracing on a phone, which has even less compute power than PC? They are not designed to do that kind of work.

Related

Measuring WebGL efficency

I’m working on a ThreeJs project that requires some heavy-duty work done with in a fragment shader so I am looking for a way to use lower quality if the device can’t handle the work.
By pure accident I recently included an ‘uint’ uniform in my shader code and found it just would not run-on older devices. So, the availability of WebGL2 became an obvious and good switch.
The problem is that WebGL2 is a browser choice and some older devices with later software will still run it even if very badly.
Is there a quick test to determine WebGL efficiency so I can fall back to lower quality if needed.
Measuring FPS is not an option since even on a modern device it can take a few seconds for it to stabilize for a new page.
This is not a general solution.
But in my particular situation I am using a very expensive SDF that is needed in both the Hi and Lo versions of the graphicsis. It is generated once and stored in a FBO, then used again multiple times as a texture.
Even on a desktop using a RTX 3060 Ti it takes more than 20mS to generate the texture, on the old S4 it's 320+ mS to generate.
They're not ideal metrics but with a bit tuning they should provide a way of guesstimating the GPU's capability and give a good indication of when to fall back to simpler graphics.
There will always be a cut off of what we support but being able to get the best from older devices is not a bad thing.

in OS X, what is the BASE graphics drawing layer?

I am beginning GUI development in OSX, and I am wondering, what is the VERY BASE layer in the system for which to draw graphics? It seems as if there are so many upper level abstractions (AppKit, OpenGL, CG, etc), which are nice and timesaving, but for me unusable until I understand the base layer (unless its binary or assembly, in which case I throw in the towel).
I am beginning GUI development in OSX, and I am wondering, what is the VERY BASE layer in the system for which to draw graphics?
Believe it or not, but ever since MacOS X Tiger the whole graphics stack it based on OpenGL. Below OpenGL is only the GPU driver and then the bare metal.
It seems as if there are so many upper level abstractions (AppKit, OpenGL, CG, etc), which are nice and timesaving, but for me unusable until I understand the base layer (unless its binary or assembly, in which case I throw in the towel).
Why are they unusable for you? What do you expect to gain from the added knowledge? The lower the level is, that you're using, the more intimate you must be with how it works to make efficient use of it. OpenGL itself is already fairly low level. The OpenGL implementation hides some gory details from you, like on demand texture data swapping from fast to regular memory and the likes, and the GLSL compiler is also rather high level. But on the other side to use OpenGL efficiently you should deliver data in the format the GPU natively works with, shaders can be cached in their binary form and buffer objects provide you with a API for DMA transfers.
If you were really interested in the lowest layer, that you'd have to look at the GPU design, i.e. the metal. AMD did actually publish full programming documentation on some of their GPUs (Google for OpenGPU).
You could do a lot worse than have a look at the Quartz 2D Programming Guide. It's the layer you will be using most often and understanding this will form the basis for any further investigation you do.

Detect whether a Quartz Composition in a QCView will be rendered through software or hardware

I have a feeling there are combinations of Cocoa Quartz Compositions and GPUs which can't be handled by the GPU and which fall back on the software renderer, even if Core Image is "accelerated" normally. How would I detect such a situation?
Or more generally, how do I detect that a machine is too underpowered to handle a certain composition of a certain size, without actually playing the composition and measuring the FPS?
(Measuring the FPS through playing the composition in a hidden window is unlikely to work, since the QCView might detect that situation and optimise away the whole operation, or parts thereof. And even if it didn't do that today it might start doing that with the next update from Apple - it'd be an unreliable solution.)
Update: to be thorough I did write some code to test render the composition at full resolution in an ordered out but properly sized window, trying to force the render to happen with [self startRendering];[self snapshotImage];[self stopRendering];. This took an amount of time which looked reasonable at first, until it turned out the slow machine was faster at running this test than the fast one. ;) In reality the slow machine renders the composition at a measly 2.24 FPS vs 27 FPS on the fast machine.
I'm guessing you're asking so that you can make a simpler fallback animation for weaker systems?
One option may be to check the user's hardware string as is mentioned here:
GPU Chipset Detection.
glGetString can return GL_VENDOR, GL_RENDERER, GL_VERSION, or GL_EXTENSIONS. You could theoretically use GL_VENDOR to identify Intel GMA's as too slow, or compare GL_RENDERER to a list of known poor-performing GPUs. If you're writing code for 10.6+ only, you only have to compare to GPUs used in Intel Macs, so the list shouldn't be too long.
This might not be quite the elegant solution you're looking for, but it should do the trick. I would also provide the user with an override to choose the higher or lower quality graphics if they wish.

Why not use GDI to repeatedly fill a window with RGB data from an array?

This is a follow-up to this question. I'm currently writing a simple game and am looking for the fastest way to (repeatedly) display an array of RGB data in a Win32 window, without flickering or other artifacts.
Several different approaches were recommended in the answers to the previous question, but there was no consensus on which would be the fastest. So, I threw together a test program. The code simply displays a framebuffer on the screen repeatedly, as fast as possible.
These are the results I obtained, for 32-bit data running in a 32-bit video mode - they may surprise some people:
- Direct3D (1): 500 fps
- Direct3D (2): 650 fps
- DirectDraw (3): 1100 fps
- DirectDraw (4): 800 fps
- GDI (SetDIBitsToDevice): 2000 fps
Given these figures:
Why are many people adamant that GDI is simply too slow for this operation?
Is there any reason to prefer DirectDraw or Direct3D over SetDIBitsToDevice?
Here is a brief summary of the calls made by each of the Direct* codepaths. If anyone knows a more efficient way to use DirectDraw/Direct3D, please comment.
1. CreateTexture(D3DUSAGE_DYNAMIC, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); DrawPrimitive()
2. CreateTexture(0, D3DPOOL_SYSTEMMEM); CreateTexture(0, D3DPOOL_DEFAULT);
LockRect(); memcpy(); UnlockRect(); UpdateTexture(); DrawPrimitive()
3. CreateSurface(); SetSurfaceDesc(lpSurface = &frameBuffer[0]);
memcpy(); primarySurface->Blt();
4. CreateSurface();
Lock(); memcpy(); Unlock(); primarySurface->Blt();
There are a couple of things to keep in mind here. First of all, a lot of "common knowledge" is based on some facts that no longer really apply.
In the days of AGP, when the CPU talked directly to the GPU, it always used the base PCI protocol, which happened at the "1x" rate (always and inevitably). AGX 2x/4x/8x only applied when the GPU was taking to the memory controller directly. In other words, depending on when you looked, it was up to 8 times as fast to have the GPU load a texture from memory as it was for the CPU to send the same data directly to the GPU. Of course, the CPU also had a great deal more bandwidth to memory than the PCI bus supported.
When things switched to PCI-E, however, that changed completely. While there can be differences in bandwidth depending on path, there's no general rule that memory->GPU will be faster than CPU->GPU. The one generalization that's (mostly) safe is that if you have a dedicated graphics card, then the GPU will almost always have more bandwidth to the memory on the graphics card than it does to main memory on the motherboard.
In your case, that doesn't matter much though -- you're talking about moving data from CPU space to GPU space regardless. The main speed difference with using DirectX (or OpenGL) happens when you keep all (or most) of the computation on the GPU, and avoid using the CPU (or main memory) at all. They don't (now that AGP is history) provide any substantial improvement in memory->display bandwidth.
Jerry Coffin makes some good points. The thing to bear in mind is what the DI stands for in SetDIBitsToDevice. It stands for Device Independent. Which means you were ALWAYS at the mercy of drivers. Some drivers used to be complete rubbish and it affected the performance massively. DirectDraw suffered from similar issues as well ... but you also had access to the hardware blitters so it was generally more useful. IHVs also tended to put more time in to writing proper drivers for DirectDraw because of its gaming association. Who wants to be the bottom of the performance pile when the hardware is quite capable of doing better?
These days many graphics cards can accept the bit data directly so no conversion happens. If it does need to be swizzled this is also INCREDIBLY quick in this day and age.
The reason your Direct3D performance is so terrible, by comparison, is that Direct3D, by nature of the fact it is meant to be used totally internally to the GPU, uses odd and complex formats to improve cache performance and so forth.
Couple that with the fact that you aren't testing like for like (with DDraw and D3D) by creating a texture/surface, locking it, copying, unlocking and then drawing over the back buffer (via various methods). To get best performance you'd be best off directly locking the backbuffer using a DISCARD lock then memcpy'ing directly into the returned buffer before unlocking. This will bring your performance much closer to the SetDIBitsToDevice. I still would expect D3D to be slower than DDraw, however, for the reasons outlined above.
The reason you will hear people trounce on GDI is that it used to just be old windows API calls. The newer versions of it (that were called GDI+ when I last looked at em) are actually just an API placed on top of DirectX calls. So using GDI may seem fairly simple programming wise at times, but adding a layer between things always slows things down. As mentioned in the response from Jerry Coffin, your examples are about moving the data, and that is the slow time. I am a bit surprised that DirectX is that much slower though but I can not be much more help with out digging through the DirectX documentation (which has been pretty awesome for quite some time really.. Might want to check out www.codesampler.com. I have always found good starting places from him and actually, while I may be insane for saying this, I would swear the improvements to the DirectX SDK in doc and examples were done based on this guys work!)
As for the DirectDraw vs Direct3D (and not the GDI calls) discussion. I would say go to Direct3D. I believe DirectDraw has been deprecated since 8.0 or so, and 9.0 has been around for quite a long while. And at the end of the day all of DirectX is 3D, it just varies on the levels of helpful 2D apis that are around, but you may find you can do some very interesting things in a 2D environment when you are actually using 3D space. (I had a pretty neat randomly generated lightning weapon for a space invaders clone at one time :))
Anywho, hope this helped!
PS: It should be noted that DirectX is not always the fastest. For keyboard input (unless this has changed in 10 or 11) it has pretty much always been recommended to use the windows events.. as DirectInput was actually just a wrapper for that system!.. XInput however is -awesome-!!

How are 3D games so efficient? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
There is something I have never understood. How can a great big PC game like GTA IV use 50% of my CPU and run at 60fps while a DX demo of a rotating Teapot # 60fps uses a whopping 30% ?
Patience, technical skill and endurance.
First point is that a DX Demo is primarily a teaching aid so it's done for clarity not speed of execution.
It's a pretty big subject to condense but games development is primarily about understanding your data and your execution paths to an almost pathological degree.
Your code is designed around two things - your data and your target hardware.
The fastest code is the code that never gets executed - sort your data into batches and only do expensive operations on data you need to
How you store your data is key - aim for contiguous access this allows you to batch process at high speed.
Parellise everything you possibly can
Modern CPUs are fast, modern RAM is very slow. Cache misses are deadly.
Push as much to the GPU as you can - it has fast local memory so can blaze through the data but you need to help it out by organising your data correctly.
Avoid doing lots of renderstate switches ( again batch similar vertex data together ) as this causes the GPU to stall
Swizzle your textures and ensure they are powers of two - this improves texture cache performance on the GPU.
Use levels of detail as much as you can -- low/medium/high versions of 3D models and switch based on distance from camera player - no point rendering a high-res version if it's only 5 pixels on screen.
In general, it's because
The games are being optimal about what they need to render, and
They take special advantage of your hardware.
For instance, one easy optimization you can make involves not actually trying to draw things that can't be seen. Consider a complex scene like a cityscape from Grand Theft Auto IV. The renderer isn't actually rendering all of the buildings and structures. Instead, it's rendering only what the camera can see. If you could fly around to the back of those same buildings, facing the original camera, you would see a half-built hollowed-out shell structure. Every point that the camera cannot see is not rendered -- since you can't see it, there's no need to try to show it to you.
Furthermore, optimized instructions and special techniques exist when you're developing against a particular set of hardware, to enable even better speedups.
The other part of your question is why a demo uses so much CPU:
... while a DX demo of a rotating Teapot # 60fps uses a whopping 30% ?
It's common for demos of graphics APIs (like dxdemo) to fall back to what's called a software renderer when your hardware doesn't support all of the features needed to show a pretty example. These features might include things like shadows, reflection, ray-tracing, physics, et cetera.
This mimics the function of a completely full-featured hardware device which is unlikely to exist, in order to show off all the features of the API. But since the hardware doesn't actually exist, it runs on your CPU instead. That's much more inefficient than delegating to a graphics card -- hence your high CPU usage.
3D games are great at tricking your eyes. For example, there is a technique called screen space ambient occlusion (SSAO) which will give a more realistic feel by shadowing those parts of a scene that are close to surface discontinuities. If you look at the corners of your wall, you will see they appear slightly darker than the centers in most cases.
The very same effect can be achieved using radiosity, which is based on rather accurate simulation. Radiosity will also take into account more effects of bouncing lights, etc. but it is computationally expensive - it's a ray tracing technique.
This is just one example. There are hundreds of algorithms for real time computer graphics and they are essentially based on good approximations and typically make a lot assumptions. For example, spatial sorting must be chosen very carefully depending on the speed, typical position of the camera as well as the amount of changes to the scene geometry.
These 'optimizations' are huge - you can implement an algorithm efficiently and make it run 10 times faster, but choosing a smart algorithm that produces a similar result ("cheating") can make you go from O(N^4) to O(log(N)).
Optimizing the actual implementation is what makes games even more efficient, but that is only a linear optimization.
Eeeeek!
I know that this question is old, but its exciting that no one has mentioned VSync!!!???
You compared the CPU usage of the game at 60fps to CPU usage of the teapot demo at 60fps.
Isn't it apparent, that both run (more or less) at exactly 60fps? That leads to the answer...
Both apps run with vsync enabled! This means (dumbed-down) that the rendering framerate is locked to the "vertical blank interval" of your monitor. The graphics hardware (and/or driver) will only render at max. 60fps. 60fps = 60Hz (Hz=per second) refresh rate. So you probably use a rather old, flickering CRT or a common LCD display. On a CRT running at 100Hz you will probably see framerates of up to 100Hz. VSync also applies in a similar way to LCD displays (they usually have a refresh rate of 60Hz).
So, the teapot demo may actually run much more efficient! If it uses 30% of CPU time (compared to 50% CPU time for GTA IV), then it probably uses less cpu time each frame, and just waits longer for the next vertical blank interval. To compare both apps, you should disable vsync and measure again (you will measure much higher fps for both apps).
Sometimes its ok to disable vsync (most games have an option in its settings). Sometimes you will see "tearing artefacts" when vsync is disabled.
You can find details of it and why it is used at wikipedia: http://en.wikipedia.org/wiki/Vsync
Whilst many answers here provide excellent indications of how I will instead answer the simpler question of why
GTA4 took $400 Million dollars in it's first week
Crytech wrote an extremely impressive graphics demo to allow nVidia to 'show off' at a trade show. The resulting impressions got them the leg up to create what would become FarCry.
Valve's 2005 revenue and operating profit have been stated as 70 and 55 million USD respectively.
Perhaps the best example (certainly one of the best known) is Id software. They realised very early, in the days of Commander Keen (well before 3D) that coming up with a clever way to achieve something1, even if it relied on modern hardware (in this case an EGA graphics card!) that was graphically superior to the competition that this would make your game stand out. This was true but they further realised that, rather than then having to come up with new games and content themselves they could licence the technology, thus getting income from others whilst being able to develop the next generation of engine and thus leap frog the competition again.
The abilities of these programmers (coupled with business savvy) is what made them rich.
That said it is not necessarily money that motivates such people. It is likely just as much the desire to achieve, to accomplish. The money they earned in the early days simply means that they now have time to devote to what they enjoy. And whilst many have outside interests almost all still program and try to work out ways to do better than the last iteration.
Put simply the person who wrote the teapot demo likely had one or more of the following issues:
less time
less resources
less reward incentive
less internal and external competition
lesser goals
less talent
The last may sound harsh2 but clearly there are some who are better than others, bell curves sometimes have extreme ends and they tend to be attracted to the corresponding extreme ends of what is done with that skill.
The lesser goals one is actually likely to be the main reason. The target of the teapot demo was just that, a demo. But not a demo of the programmers skill3. It would be a demo of one small facet of a (big) OS, in this case DX rendering.
To those viewing the demo it wouldn't mater it it used way more CPU than required so long as it looked good enough. There would be no incentive to eliminate waste when there would be no beneficiary. In comparison a game would love to have spare cycles for better AI, better sound, more polygons, more effects.
in that case smooth scrolling on PC hardware
Likely more than me so we're clear about that
strictly speaking it would have been a demo to his/her manager too, but again the drive here would be time and/or visual quality.
Because of a few reasons
3D game engines are highly optimized
most of the work is done by your graphics adapter
50% Hm, let me guess you have a dual core and only one core is used ;-)
EDIT: To give few numbers
2.8 Ghz Athlon-64 with NV-6800 GPU. The results are:
CPU: 72.78 Mflops
GPU: 2440.32 Mflops
Sometimes a scene may have more going on than it appears. For example, a rotating teapot with thousands of vertices, environment mapping, bump mapping, and other complex pixel shaders all being rendered simultaneously amounts to a whole lot of processing. A lot of times these teapot demos are simply meant to show off some sort of special effect. They also may not always make the best use of the GPU when absolute performance isn't the goal.
In a game you may see similar effects but they're usually done in a compromised fashion in effort to maximize the frame rate. These optimizations extend to everything you see in the game. The issue becomes, "How can we create the most spectacular and realistic scene with the least amount of processing power?" It's what makes game programmers some of the best optimizers around.
Scene management. kd-trees, frustrum culling, bsps, heirarchical bounding boxes, partial visibility sets.
LOD. Switching out lower detail versions to substitute in for far away objects.
Impostors. Like LOD but not even an object just a picture or 'billboard'.
SIMD.
Custom memory management. Aligned memory, less fragmentation.
Custom data structures (ie no STL, relatively minimal templating).
Assembly in places, mainly for SIMD.
By all the qualified and good answers given, the one that matter is still missing: The CPU utilization counter of Windows is not very reliable. I guess that this simple teapot demo just calls the rendering function in it's idle loop, blocking at the buffer swap.
Now the Windows CPU utilization counter just looks at how much CPU time is spent within each process, but not how this CPU time is used. Try adding a
Sleep(0);
just after returning from the rendering function, and compare.
In addition, there are many many tricks from an artistic standpoint to save computational power. In many games, especially older ones, shadows are precalculated and "baked" right into the textures of the map. Many times, the artists tried to use planes (two triangles) to represent things like trees and special effects when it would look mostly the same. Fog in games is an easy way to avoid rendering far-off objects, and often, games would have multiple resolutions of every object for far, mid, and near views.
The core of any answer should be this -- The transformations that 3D engines perform are mostly specified in additions and multiplications (linear algebra) (no branches or jumps), the operations of a drawing a single frame is often specified in a way that multiple such add-mul's jobs can be done in parallel. GPU cores are very good add add-mul's, and they have dozens or hundreds of add-mull cores.
The CPU is left with doing simple stuff -- like AI and other game logic.
How can a great big PC game like GTA IV use 50% of my CPU and run at 60fps while a DX demo of a rotating Teapot # 60fps uses a whopping 30% ?
While GTA is quite likely to be more efficient than DX demo, measuring CPU efficiency this way is essentially broken. Efficiency could be defined e.g. by how much work you do per given time. A simple counterexample: spawn one thread per a logical CPU and let a simple infinite loop run on it. You will get CPU usage of 100 %, but it is not efficient, as no useful work is done.
This also leads to an answer: how can a game be efficient? When programming "great big games", a huge effort is dedicated to optimize the game in all aspects (which nowadays usually also includes multi-core optimizations). As for the DX demo, its point is not running fast, but rather demonstrating concepts.
I think you should take a look to GPU utilisation rather than CPU... I bet the graphic card is much busier in GTA IV than in the Teapot sample (it should be practically idle).
Maybe you could use something like this monitor to check that:
http://downloads.guru3d.com/Rivatuner-GPU-Monitor-Vista-Sidebar-Gadget-download-2185.html
Also the framerate is something to consider, maybe the teapot sample is running at full speed (maybe 1000fps) and most games are limited to the refresh frequency of the monitor (about 60fps).
Look at the answer on vsync; that is why they are running at same frame rate.
Secondly, CPU is miss leading in a game. A simplified explanation is that the main game loop is just an infinite loop:
while(1) {
update();
render();
}
Even if your game (or in this case, teapot) isn't doing much you are still eating up CPU in your loop.
The 50% cpu in GTA is "more productive" then the 30% in the demo, since more than likely it's not doing much at all; but the GTA is updating tons of details. Even adding a "Sleep (10)" to the demo will probably drop it's CPU by a ton.
Lastly look at GPU usage. The demo is probably taking <1% on a modern video card while the GTA will probably be taking majority during game play.
In short, your benchmarks and measurements aren't accurate.
The DX teapot demo is not using 30% of the CPU doing useful work. It's busy-waiting because it has nothing else to do.
From what I know of the Unreal series some conventions are broken like encapsulation. Code is compiled to bytecode or directly into machine code depending on the game. Also, objects are rendered and packaged under the form of a meshes and things such as textures, lighting and shadows are precalculated whereas as a pure 3d animation requires this to this real time. When the game is actually running there are also some optimizations such as only rendering only the visible parts of an object and displaying texture detail only when close up. Finally, it's probable that video games are designed to get the best out of a platform at a given time (ex: Intelx86 MMX/SSE, DirectX, ...).
I think there is an important part of the answer missing here. Most of the answers tell you to "Know your data". The fact is that you must, in the same way and with the same degree of importance, also know your:
CPU (clock and caches)
Memory (frequency and latency)
Hard drive (in term of speed and seek times)
GPU (#cores, clock and its Memory/Caches)
Interfaces: Sata controllers, PCI revisions, etc.
BUT, on top of that, with the current modern computers, you would never be able to player a real 1080p video at >>30ftp (a single 1080p image in 64bits would take 15 000 Ko/14.9 MB). The reason for that is because of the sampling/precision. A video game would never use a double precision (64bits) for pixels, images, data, etc..., but rather use a lower custom precision (~4-8 bits) and sometimes less precision rescaled with interpolation techniques to allow reasonable computation time.
There are other techniques as well such as Clipping the data (both with OpenGL standard and software implementation), Data compression, etc. Keep also in mind, that current GPUs can be >300 times faster than the current CPUs in term of hardware capability. However, a good programmer may get a 10-20x factor, unless your problem is fully optimized and completely parallelizable (particularly task parallelizable).
By experience, I can tell you that optimization is like an exponential curve. To reach optimal performance, the time required may be incredibly important.
So to get back to the teapot, you should see how the geometry is represented, sampled and with what precision Vs see in GTA 5, in term of geometry/textures and most important, the details (precision, sampling, etc.)

Resources