Low Flash AS3 Performance - performance

I'm working on a game coded in AS3 using the Alternativa3D 7.8 engine and it just doesn't have the FPS I was hoping to achieve with it and I'm trying to fully understand why. I get it that having 3D objects in a scene can be very taxing on performance but I'm using only a very limited number of 3d objects and each of those has a relatively small polygon count.
I'm wondering if there is something else like a memory leak causing this on top of the actual rendering of the scene.
I'd like to figure out a way to view how the performance is being distributed in my code to see if there are certain areas that are causing this. I usually only get about 10-15 FPS on my computer and I'd like to get that to around a constant 20-24 or higher if possible.

I don't think that this question should be downvoted necessarily, though it is a bit broad. OP is asking about general performance tips for AS3 applications.
It's true that we can't give him specific pointers without seeing his code, but we could still provide him with more general tips/tricks. Here's some analysis, pretty general:
I don't think your performance problems necessarily have anything to do with your 3D, though they might. The instant the game world comes on screen even the mouse movement is tremendously slowed, whereas the instant I pause it the framerate improves - which suggests to me that you are doing a lot of iteration and calculation on every frame.
I'd start with this: do you have any computationally intensive loops going on inside of your main game loop? For instance, I see that you're working with sea level as it effects landmass - are you doing something like calculating all of your water properties on every frame?
Having a lot of "3D" objects isn't necessarily a problem, because a 3D object is just a set of points. They're more intensive to position than 2d objects because you're including an additional dimension, but not so much more intensive that a few 3d objects would cause this kind of performance. I don't think that they are your problem (though I could be wrong).
Rather, it's what kind of calculations you're performing. Look for loops, figure out what you can comment out and instantly see better performance, and then once you've isolated it see what you can do about caching the outputs of those computations so that you don't have to recalc them on every frame.
Cheers,
mb

Related

Game performance optimization interview

This question came up:
You're searching for bottlenecks in your game, but nothing you're changing is making the game any faster, be it anything in the GPU pipeline or the CPU. Nothing is spiking, and the slowness appears to be distributed across everywhere. What do you do next?
I was flummoxed. Is it a trick question? When fixing perf issues, I always assume that this was the point at which you need to scale everything back. I don't think it's mem alloc, as that shows up in CPU perf.
I would have asked for more information. "Slow" is a poor indicator of bad performance and is a classification of a symptom rather than a symptom itself. For example, you might describe "slow" as being:
Low frame rate
Poor responsiveness to input
High responsiveness and smooth framerate, but slow game mechanics (i.e.: the player and entities move smoothly but very slowly)
In the case of networked games, apparent network lag
All of these problems have different potential causes and solutions:
Low but consistent frame rate may be due to inefficiencies in your game loop. Simply running your favorite profiler may indicate that large amounts of time are spent in one particular piece of code. In a game I wrote, for instance, I discovered that low FPS was the result of a bad loop that calculated distances between entities multiple times without caching. In another game, I discovered that the data structure I was using to perform lookups against the terrain was O(N) rather than O(1) (python stdlib...ick). You can't diagnose a problem you can't see, and profiling is the first line of defense.
Poor responsiveness may be due to a number of things. If the FPS is high but the controls are sluggish to respond, the API that you're using to access the controls may simply be bad. Some controllers may have crappy drivers that can kill responsiveness. It might even be your game loop: you might simply not be checking for input from the controller frequently enough (perhaps you're not checking on every tick). In one of the aforementioned games, I had an issue where certain actions had a delayed effect: you'd use an item and the game would respond a half second or so later. It turned out that the issue was caused by the client making a full round-trip to the server to perform the action, verify that it happened, and wait for the server to broadcast back that the item was used. Simply having the behavior take place instantaneously on the client remedied the issue.
Slow game mechanics might indicate that game constants simply aren't set high enough. If everything is smooth and beautiful but everything just moves very slowly, it's quite possible that default velocities or accelerations aren't turned up enough.
Network lag can be caused by any number of things: the router you're connected to might be failing, the VPS you're developing against might be on a host that's being DDoSed, you might be using a protocol that's overly (but uniformly) chatty, or you're simply sending too much data over the wire. In a piece of simulation software I wrote in college, the computations were performed on some beefy computers in a lab, while the visualizations were being run on my MBP in my dorm. It turned out that the sheer amount of data that I was sending from the lab computers to my dorm was enough to overload the cheap network switches in the building and drop packets, resulting in horrible lag but perfectly reasonable log output.
So I guess the answer here is to have the interviewer describe the symptoms more fully. #Ali's answer is great, but it could be that there's a more nuanced problem at hand that requires some coaxing to diagnose.
You're searching for bottlenecks in your game, but nothing you're
changing is making the game any faster, be it anything in the GPU
pipeline or the CPU. Nothing is spiking, and the slowness appears to
be distributed across everywhere.
It pretty much sounds like the definition of Uniformly Slow Code. Let's assume it is really what is meant by this (and not some I/O bottleneck or creation of unnecessary objects in a loop or some poor choice for the datastructures or for the algorithms, etc).
To make a uniformly slow code faster, you usually have to go against good practices, and that is why I usually stop optimizing my code when it is uniformly slow. (I suppose "stop optimizing" is not a good answer at an interview...)
One way to make things faster is to identify an appropriate sequence of small operations, collect them together in one place, and then manually improve the things; sort of "manually inlining" these operations then doing high-level simplifications on the code that emerges. It requires good intuition where this might be worth doing and excellent understanding of the involved code. This answer calls it bunching and horizontal optimization.
Another thing that might be worth looking into if your really have uniformly slow code is Andrei Alexandrescu's optimization tips.
Maybe this is about thinking about more efficient algorithms. "Micro-optimization" has its limits; you can perfectly optimize a bubble sort, for example, but to get real big speedup you'd invent another sorting algorithm.
Also, in games you may introduce different kinds of adjustable quality/speed (or precision/speed) tradeoffs. Typically all games have some settings that change graphic detail level.
anecdotal:
i can tell you what the problem is without actually knowing the answer to the question ;p
sloppy directx calls. too many objects. especially bad on some old dx9 games, since dx9 needed to make a new directdraw call for every object. or something like that, the story goes. basically resulted in the cpu waiting idle for the gpu to process all the messages.
although def not the solution to every issue, i though it was worth mentioning as an interesing piece of information ;p didn't see it in the other comments.
it's almost like having too many pixel shaders, except at least the gpu works at 100% with a mass of those :D good for frying omelettes. (also, using occlusion to save performance and then adding a mass of pixel shaders to that model is a BAD idea)
i hope you can see the humor in this ;p

Performance comparision

In Opengl es, I come to know two ways to do light effects,
1> using light map
2> using stencil buffers
Which is more efficient way in term of performance comparision?
The answer is it depends. If you are heavily using stencilling, then your application may become fill rate limited. Doing a texture lookup on a light map can slow down your fragment/vertex shaders, also making your application fill rate limited. Generally though, light maps are more efficient, as they don't involve extra passes like most stencilling effects do. Moral of the story is, only bench marking will accuratly tell you what is more efficient.

Data-oriented design in practice?

There has been one more question on what data-oriented design is, and there's one article which is often referred to (and I've read it like 5 or 6 times already). I understand the general concept of this, especially when dealing with, for example, 3d models, where you'd like to keep all vertexes together, and not pollute your faces with normals, etc.
However, I do have a hard time visualizing how data-oriented design might work for anything but the most trivial cases (3d models, particles, BSP-trees, and so on). Are there any good examples out there which really embraces data-oriented design and shows how this might work in practice? I can plow through large code-bases if needed.
What I'm especially interested in is the mantra "where there's one there are many", which I can't really seem to connect with the rest here. Yes, there are always more than one enemy, yet, you still need to update each enemy individually, cause they're not moving the same way now are they? Same goes for the 'balls'-example in the accepted answer to the question above (I actually asked this in a comment to that answer, but haven't gotten a reply yet). Is it simply that the rendering will only need the positions, and not the velocities, whereas the game simulation will need both, but not the materials? Or am I missing something? Perhaps I've already understood it and it's a far more straightforward concept than I thought.
Any pointers would be much appreciated!
So, what is DOD all about? Obviously, it's about performance, but it's not just that. It's also about well-designed code that is readable, easy to understand and even reusable.
Now Object Oriented design is all about designing code and data to fit into encapsulated virtual "objects". Each object is a seperate entity with variables for properties that object might have and methods to take action on itself or other objects in the world. The advantage of OO design is that it's easy to mentally model your code into objects because the whole (real) world around us seems to work in the same way. Objects with properties that can interact with each other.
Now the problem is that the cpu in your computer works in a completely different way. It works best when you let it do the same things again and again. Why is that? Because of a little thing called cache. Accessing RAM on a modern computer can take 100 or 200 CPU cycles (and the CPU has to wait all that time!), which is way too long. So there's this small portion of memory on the CPU that can be accessed really quickly, cache memory. Problem is it's only a few MB tops. So every time you need data that wasn't in cache, you still need to go the long way to RAM. That's not just that way for data, the same goes for code. Trying to execute a function that's not in instruction cache will cause a stall while the code is loaded from RAM.
Back to OO programming. Objects are big, but most functions need only a small portion of that data, so we're wasting cache by loading unnecessary data. Methods call other methods which call other methods, thrashing your instruction cache. Still, we often do a lot of the same stuff over and over again. Let's take a bullet from a game for example. In a naive implementation each bullet could be a separate object. There might be a bullet manager class. It calls the first bullet's update function. It updates the 3D position using the direction/velocity. This causes a lot of other data from the object to be loaded into the cache. Next, we call the World Manager class to check for a collision with other objects. This loads lots of other stuff into the cache, maybe it even causes code from the original bullet manager class to get dropped from instruction cache. Now we return to the bullet update, there was no collision, so we return to bullet manager. It might need to load some code again. Next up, bullet #2 update. This loads lots of data into the cache, calls world... etc. So in this hypthetical situation, we've got 2 stalls for loading code and let's say 2 stalls for loading data. That's at least 400 cycles wasted, for 1 bullet, and we haven't taken bullets that hit something else into account. Now a CPU runs at 3+ GHz so we're not going to notice one bullet, but what if we've got 100 bullets? Or even more?
So this is the where there's one there's many story. Yes, there are some cases where you've only got on object, your manager classes, file access, etc. But more often, there's a lot of similar cases. Naive, or even not-naive object oriented design will lead to lots of problems. So enter data oriented design. The key of DOD is to model your code around your data, not the other way around as with OO-design. This starts at the first stages of design. You do not first design your OO code and then optimize it. You start by listing and examining your data and thinking out how you want to modify it(I'll get to a practical example in a moment). Once you know how your code is going to modify the data you can lay it out in a way that makes it as efficient as possible to process it. Now you may think this can only lead to a horrible soup of code and data everywhere but that is only the case if you design it badly (bad design is just as easy with OO programming). If you design it well, code and data can be neatly designed around specific functionality, leading to very readable and even very reusable code.
So back to our bullets. Instead of creating a class for each bullet, we only keep the bullet manager. Each bullet has a position and a velocity. Each bullet's position needs to be updated. Each bullet has to have a collision check and all bullets that have hit something need to take some action accordingly. So just by taking a look at this description I can design this whole system in a much better way. Let's put the positions of all bullets in an array/vector. Let's put the velocity of all bullets in an array/vector. Now let's start by iterating allong those two arrays and updating each position value with it's corresponding velocity. Now, all data loaded into the data cache is data we're going to use. We can even put a smart pre-loading command to already pre-load some array data in advance so the data's in cache when we get to it. Next, collision check. I'm not going into detail here, but you can imagine how updating all bullets after each other can help. Also note that if there's a collision, we're not going to call a new function or do anything. We just keep a vector with all bullets that had collision and when collision checking is done, we can update all those after each other. See how we just went from lots of memory access to almost none by laying our data out differently? Did you also notice how our code and data, even though not designed in an OO way any more, are still easy to understand and easy to reuse?
So to get back to the "where there's one there's many". When designing OO code you think about one object, the prototype/class. A bullet has a velocity, a bullet has a position, a bullet will move each frame by it's velocity, a bullet can hit something, etc. When you think about that, you will think about a class, with velocity, position, and an update function which moves the bullet and checks for collision. However, when you have multiple objects you need to think about all of them. Bullets have positions, velocity. Some bullets may have collision. Do you see how we're not thinking about an individual object any longer? We're thinking about all of them and are designing code a lot differently now.
I hope this helps answer your second part of the question. It's not about whether you need to update each enemy or not, it's about the most efficient way to update them. And while designing only your enemies using DOD may not help gain much performance, designing the entire game around these principles (only where applicable!) may lead to a lot of performance gains!
So onto the first part of the question, that is other examples of DOD. I'm sorry but I don't have that much there. There is one really good example though, I came across this some time ago, a series on data oriented design of a behavior tree by Bjoern Knafla: http://bjoernknafla.com/data-oriented-behavior-tree-overview You probably want to start at the first one in the series of 4, links are in the article itself.
Hope this still helps, despite the old question. Or maybe some other SO user come across this question and have some use from this answer.
I read the question you linked to and the article.
I've read one book on the subject of data driven design.
I'm pretty much in the same boat as you.
The way I understand Noel's article is that you design your game in the typical object oriented way. You have classes and methods that work on the classes.
After you've done your design, you ask yourself the following question:
How can I arrange all of the data I've designed in one huge blob?
Think of it in terms of writing your entire design as one functional method, with lots of subordinate methods. It reminds me of the massive 500,000 line Cobol programs of my youth.
Now, you probably won't write the entire game as one huge functional method. Really, in the article, Noel is talking about the rendering portion of a game. Think of it as a game engine (the one huge functional method) and the code to drive the game engine (the OOP code).
What I'm especially interested in is the mantra "where there's one there are many", which I can't really seem to connect with the rest here. Yes, there are always more than one enemy, yet, you still need to update each enemy individually, cause they're not moving the same way now are they?
You're thinking in terms of objects. Try thinking in terms of functionality.
Each enemy update is an iteration of a loop.
What's important is that the enemy data is structured to be in one memory location, rather than spread across enemy object instantiations.

How are 3D games so efficient? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
There is something I have never understood. How can a great big PC game like GTA IV use 50% of my CPU and run at 60fps while a DX demo of a rotating Teapot # 60fps uses a whopping 30% ?
Patience, technical skill and endurance.
First point is that a DX Demo is primarily a teaching aid so it's done for clarity not speed of execution.
It's a pretty big subject to condense but games development is primarily about understanding your data and your execution paths to an almost pathological degree.
Your code is designed around two things - your data and your target hardware.
The fastest code is the code that never gets executed - sort your data into batches and only do expensive operations on data you need to
How you store your data is key - aim for contiguous access this allows you to batch process at high speed.
Parellise everything you possibly can
Modern CPUs are fast, modern RAM is very slow. Cache misses are deadly.
Push as much to the GPU as you can - it has fast local memory so can blaze through the data but you need to help it out by organising your data correctly.
Avoid doing lots of renderstate switches ( again batch similar vertex data together ) as this causes the GPU to stall
Swizzle your textures and ensure they are powers of two - this improves texture cache performance on the GPU.
Use levels of detail as much as you can -- low/medium/high versions of 3D models and switch based on distance from camera player - no point rendering a high-res version if it's only 5 pixels on screen.
In general, it's because
The games are being optimal about what they need to render, and
They take special advantage of your hardware.
For instance, one easy optimization you can make involves not actually trying to draw things that can't be seen. Consider a complex scene like a cityscape from Grand Theft Auto IV. The renderer isn't actually rendering all of the buildings and structures. Instead, it's rendering only what the camera can see. If you could fly around to the back of those same buildings, facing the original camera, you would see a half-built hollowed-out shell structure. Every point that the camera cannot see is not rendered -- since you can't see it, there's no need to try to show it to you.
Furthermore, optimized instructions and special techniques exist when you're developing against a particular set of hardware, to enable even better speedups.
The other part of your question is why a demo uses so much CPU:
... while a DX demo of a rotating Teapot # 60fps uses a whopping 30% ?
It's common for demos of graphics APIs (like dxdemo) to fall back to what's called a software renderer when your hardware doesn't support all of the features needed to show a pretty example. These features might include things like shadows, reflection, ray-tracing, physics, et cetera.
This mimics the function of a completely full-featured hardware device which is unlikely to exist, in order to show off all the features of the API. But since the hardware doesn't actually exist, it runs on your CPU instead. That's much more inefficient than delegating to a graphics card -- hence your high CPU usage.
3D games are great at tricking your eyes. For example, there is a technique called screen space ambient occlusion (SSAO) which will give a more realistic feel by shadowing those parts of a scene that are close to surface discontinuities. If you look at the corners of your wall, you will see they appear slightly darker than the centers in most cases.
The very same effect can be achieved using radiosity, which is based on rather accurate simulation. Radiosity will also take into account more effects of bouncing lights, etc. but it is computationally expensive - it's a ray tracing technique.
This is just one example. There are hundreds of algorithms for real time computer graphics and they are essentially based on good approximations and typically make a lot assumptions. For example, spatial sorting must be chosen very carefully depending on the speed, typical position of the camera as well as the amount of changes to the scene geometry.
These 'optimizations' are huge - you can implement an algorithm efficiently and make it run 10 times faster, but choosing a smart algorithm that produces a similar result ("cheating") can make you go from O(N^4) to O(log(N)).
Optimizing the actual implementation is what makes games even more efficient, but that is only a linear optimization.
Eeeeek!
I know that this question is old, but its exciting that no one has mentioned VSync!!!???
You compared the CPU usage of the game at 60fps to CPU usage of the teapot demo at 60fps.
Isn't it apparent, that both run (more or less) at exactly 60fps? That leads to the answer...
Both apps run with vsync enabled! This means (dumbed-down) that the rendering framerate is locked to the "vertical blank interval" of your monitor. The graphics hardware (and/or driver) will only render at max. 60fps. 60fps = 60Hz (Hz=per second) refresh rate. So you probably use a rather old, flickering CRT or a common LCD display. On a CRT running at 100Hz you will probably see framerates of up to 100Hz. VSync also applies in a similar way to LCD displays (they usually have a refresh rate of 60Hz).
So, the teapot demo may actually run much more efficient! If it uses 30% of CPU time (compared to 50% CPU time for GTA IV), then it probably uses less cpu time each frame, and just waits longer for the next vertical blank interval. To compare both apps, you should disable vsync and measure again (you will measure much higher fps for both apps).
Sometimes its ok to disable vsync (most games have an option in its settings). Sometimes you will see "tearing artefacts" when vsync is disabled.
You can find details of it and why it is used at wikipedia: http://en.wikipedia.org/wiki/Vsync
Whilst many answers here provide excellent indications of how I will instead answer the simpler question of why
GTA4 took $400 Million dollars in it's first week
Crytech wrote an extremely impressive graphics demo to allow nVidia to 'show off' at a trade show. The resulting impressions got them the leg up to create what would become FarCry.
Valve's 2005 revenue and operating profit have been stated as 70 and 55 million USD respectively.
Perhaps the best example (certainly one of the best known) is Id software. They realised very early, in the days of Commander Keen (well before 3D) that coming up with a clever way to achieve something1, even if it relied on modern hardware (in this case an EGA graphics card!) that was graphically superior to the competition that this would make your game stand out. This was true but they further realised that, rather than then having to come up with new games and content themselves they could licence the technology, thus getting income from others whilst being able to develop the next generation of engine and thus leap frog the competition again.
The abilities of these programmers (coupled with business savvy) is what made them rich.
That said it is not necessarily money that motivates such people. It is likely just as much the desire to achieve, to accomplish. The money they earned in the early days simply means that they now have time to devote to what they enjoy. And whilst many have outside interests almost all still program and try to work out ways to do better than the last iteration.
Put simply the person who wrote the teapot demo likely had one or more of the following issues:
less time
less resources
less reward incentive
less internal and external competition
lesser goals
less talent
The last may sound harsh2 but clearly there are some who are better than others, bell curves sometimes have extreme ends and they tend to be attracted to the corresponding extreme ends of what is done with that skill.
The lesser goals one is actually likely to be the main reason. The target of the teapot demo was just that, a demo. But not a demo of the programmers skill3. It would be a demo of one small facet of a (big) OS, in this case DX rendering.
To those viewing the demo it wouldn't mater it it used way more CPU than required so long as it looked good enough. There would be no incentive to eliminate waste when there would be no beneficiary. In comparison a game would love to have spare cycles for better AI, better sound, more polygons, more effects.
in that case smooth scrolling on PC hardware
Likely more than me so we're clear about that
strictly speaking it would have been a demo to his/her manager too, but again the drive here would be time and/or visual quality.
Because of a few reasons
3D game engines are highly optimized
most of the work is done by your graphics adapter
50% Hm, let me guess you have a dual core and only one core is used ;-)
EDIT: To give few numbers
2.8 Ghz Athlon-64 with NV-6800 GPU. The results are:
CPU: 72.78 Mflops
GPU: 2440.32 Mflops
Sometimes a scene may have more going on than it appears. For example, a rotating teapot with thousands of vertices, environment mapping, bump mapping, and other complex pixel shaders all being rendered simultaneously amounts to a whole lot of processing. A lot of times these teapot demos are simply meant to show off some sort of special effect. They also may not always make the best use of the GPU when absolute performance isn't the goal.
In a game you may see similar effects but they're usually done in a compromised fashion in effort to maximize the frame rate. These optimizations extend to everything you see in the game. The issue becomes, "How can we create the most spectacular and realistic scene with the least amount of processing power?" It's what makes game programmers some of the best optimizers around.
Scene management. kd-trees, frustrum culling, bsps, heirarchical bounding boxes, partial visibility sets.
LOD. Switching out lower detail versions to substitute in for far away objects.
Impostors. Like LOD but not even an object just a picture or 'billboard'.
SIMD.
Custom memory management. Aligned memory, less fragmentation.
Custom data structures (ie no STL, relatively minimal templating).
Assembly in places, mainly for SIMD.
By all the qualified and good answers given, the one that matter is still missing: The CPU utilization counter of Windows is not very reliable. I guess that this simple teapot demo just calls the rendering function in it's idle loop, blocking at the buffer swap.
Now the Windows CPU utilization counter just looks at how much CPU time is spent within each process, but not how this CPU time is used. Try adding a
Sleep(0);
just after returning from the rendering function, and compare.
In addition, there are many many tricks from an artistic standpoint to save computational power. In many games, especially older ones, shadows are precalculated and "baked" right into the textures of the map. Many times, the artists tried to use planes (two triangles) to represent things like trees and special effects when it would look mostly the same. Fog in games is an easy way to avoid rendering far-off objects, and often, games would have multiple resolutions of every object for far, mid, and near views.
The core of any answer should be this -- The transformations that 3D engines perform are mostly specified in additions and multiplications (linear algebra) (no branches or jumps), the operations of a drawing a single frame is often specified in a way that multiple such add-mul's jobs can be done in parallel. GPU cores are very good add add-mul's, and they have dozens or hundreds of add-mull cores.
The CPU is left with doing simple stuff -- like AI and other game logic.
How can a great big PC game like GTA IV use 50% of my CPU and run at 60fps while a DX demo of a rotating Teapot # 60fps uses a whopping 30% ?
While GTA is quite likely to be more efficient than DX demo, measuring CPU efficiency this way is essentially broken. Efficiency could be defined e.g. by how much work you do per given time. A simple counterexample: spawn one thread per a logical CPU and let a simple infinite loop run on it. You will get CPU usage of 100 %, but it is not efficient, as no useful work is done.
This also leads to an answer: how can a game be efficient? When programming "great big games", a huge effort is dedicated to optimize the game in all aspects (which nowadays usually also includes multi-core optimizations). As for the DX demo, its point is not running fast, but rather demonstrating concepts.
I think you should take a look to GPU utilisation rather than CPU... I bet the graphic card is much busier in GTA IV than in the Teapot sample (it should be practically idle).
Maybe you could use something like this monitor to check that:
http://downloads.guru3d.com/Rivatuner-GPU-Monitor-Vista-Sidebar-Gadget-download-2185.html
Also the framerate is something to consider, maybe the teapot sample is running at full speed (maybe 1000fps) and most games are limited to the refresh frequency of the monitor (about 60fps).
Look at the answer on vsync; that is why they are running at same frame rate.
Secondly, CPU is miss leading in a game. A simplified explanation is that the main game loop is just an infinite loop:
while(1) {
update();
render();
}
Even if your game (or in this case, teapot) isn't doing much you are still eating up CPU in your loop.
The 50% cpu in GTA is "more productive" then the 30% in the demo, since more than likely it's not doing much at all; but the GTA is updating tons of details. Even adding a "Sleep (10)" to the demo will probably drop it's CPU by a ton.
Lastly look at GPU usage. The demo is probably taking <1% on a modern video card while the GTA will probably be taking majority during game play.
In short, your benchmarks and measurements aren't accurate.
The DX teapot demo is not using 30% of the CPU doing useful work. It's busy-waiting because it has nothing else to do.
From what I know of the Unreal series some conventions are broken like encapsulation. Code is compiled to bytecode or directly into machine code depending on the game. Also, objects are rendered and packaged under the form of a meshes and things such as textures, lighting and shadows are precalculated whereas as a pure 3d animation requires this to this real time. When the game is actually running there are also some optimizations such as only rendering only the visible parts of an object and displaying texture detail only when close up. Finally, it's probable that video games are designed to get the best out of a platform at a given time (ex: Intelx86 MMX/SSE, DirectX, ...).
I think there is an important part of the answer missing here. Most of the answers tell you to "Know your data". The fact is that you must, in the same way and with the same degree of importance, also know your:
CPU (clock and caches)
Memory (frequency and latency)
Hard drive (in term of speed and seek times)
GPU (#cores, clock and its Memory/Caches)
Interfaces: Sata controllers, PCI revisions, etc.
BUT, on top of that, with the current modern computers, you would never be able to player a real 1080p video at >>30ftp (a single 1080p image in 64bits would take 15 000 Ko/14.9 MB). The reason for that is because of the sampling/precision. A video game would never use a double precision (64bits) for pixels, images, data, etc..., but rather use a lower custom precision (~4-8 bits) and sometimes less precision rescaled with interpolation techniques to allow reasonable computation time.
There are other techniques as well such as Clipping the data (both with OpenGL standard and software implementation), Data compression, etc. Keep also in mind, that current GPUs can be >300 times faster than the current CPUs in term of hardware capability. However, a good programmer may get a 10-20x factor, unless your problem is fully optimized and completely parallelizable (particularly task parallelizable).
By experience, I can tell you that optimization is like an exponential curve. To reach optimal performance, the time required may be incredibly important.
So to get back to the teapot, you should see how the geometry is represented, sampled and with what precision Vs see in GTA 5, in term of geometry/textures and most important, the details (precision, sampling, etc.)

Why is GUI code so computationally expensive?

All you Stackoverflowers,
I was wondering why GUI code is responsible for sucking away many, many cpu cycles. In principle, the graphical rendering is far less complex than Doom (although most corporate GUIs will introduce lots of window dressing). The event handling layer is also seemingly a heavy cost, however, it seems that a well-written implementation should switch between contexts efficiently on modern processors with a lot of memory/cache.
If anybody has run a profiler on their big GUI application, or a common API itself, I'm interested in where the bottlenecks lie.
Possible explanations (that I imagine) may be:
High levels of abstraction between hardware and application interface
Lots of levels of indirection to the correct code to execute
Low priority (compared to other processes)
Misbehaving applications flooding API with calls
Excessive object orientation?
Complete poor design choices in API (not just issues, but design philosophy)
Some GUI frameworks are much better than others, so I'd like to hear varied perspectives. For example, the Unix/X11 system is much different than Windows and even than WinForms.
Edit: Now a community wiki - go for it. I have one more thing to add -- I'm an algorithms guy in school and would be interested if there are inefficient algorithms in GUI code and which they are. Then again, it's probably just the implementation overhead.
I've no idea generally, but I'd like to add another item to your list - font rendering and calculations. Finding vector glyphs in a font and converting them to bitmap representations with anti-aliasing is no small task. And often it needs to be done twice - first to calculate the width/height of the text for positioning, and then actually drawing the text at the right coordinates.
Also, most drawing code today relies on clipping mechanisms to update just a part of the GUI. So, if just one part needs to be redrawn, the code actually redraws the whole window behind the scenes, and then takes just the needed part to actually update.
Added:
In the comments I found this:
I'm also very interested in this. It can't be that the gui is rendered using only the cpu because if you don't have proper drivers for your gfx-card, desktop graphics render incredibly slow. If you have gfx-drivers however desktop-gfx go kinda fast but never as fast as a directx/opengl app.
Here's the deal as I understand it: every graphic card out there today supports a generic interface for drawing. I'm not sure if it's called "VESA", "SVGA", or if those are just old names from the past. Anyway, this interface involves doing everything through interrupts. For every pixel there is an interrupt call. Or something like that. The proper VGA driver however is able to take advantage of DMA and other enhancements that make the whole process WAY less CPU-intensive.
Added 2: Ah, and for OpenGL/DirectX - that's another feature of today's graphics cards. They are optimized for 3D operations in exclusive mode. That's why the speed. The normal GUI just utilizes basic 2D drawing procedures. So it gets to send the contents of the whole screen every time it wants an update. 3D applications however send a bunch of textures and triangle definitions to the VRAM (video-RAM) and then just reuse them for drawing. They just say something like "take the triangle set #38 with the texture set #25 and draw them". All these things are cached in the VRAM so this is again way faster.
I'm not sure, but I would suspect that the modern 3D-accelerated GUIs (Vista Aero, compiz on Linux, etc.) also might take advantage of this. They could send common bitmaps to the VGA up front and then just reuse them directly from the VRAM. Any application-drawn surfaces however would still need to be sent directly every time for updates.
Added 3: More ideas. :) The modern GUI's for Windows, Linux, etc. are widget-oriented (that's control-oriented for Windows speakers). The problem with this is that each widget has its own drawing code and associated drawing surface (more or less). When the window needs to get redrawn, it calls the drawing code for all its child-widgets, who in turn call the drawing code for their child-widgets, etc.. Every widget redraws its whole surface, even though some of it is obscured by other widgets. With above mentioned clipping techniques some of this drawn information is immediately discarded to reduce flickering and other artifacts. But still it's lots of manual drawing code that includes bitmap blitting, stretching, skewing, drawing lines, text, flood-filling, etc.. And all this gets translated to a series of putpixel calls that get filtered through clipping filters/masks and other stuff. Ah, yes, and alpha blending has also become popular today for nice effects which means even more work. So... yes, you could say this is because of lots of abstraction and indirection. But... could you really do it any better? I don't think so. Only 3D techniques might help, because they take advantage of GPU for alpha-calculations and clipping.
Let's begin by saying that writing libraries is much harder than writing a stand-alone code. The requirement that your abstraction be reusable in as many contexts as possible, including contexts which you haven't though of yet, makes the task challenging even for experienced programmers.
Amongst libraries, writing a GUI toolkit library is a famously difficult problem. This is because the programs which use GUI libraries range over a very wide variety of domains with very different needs. Mr Why and Martin DeMollo discussed the requirements placed of GUI libraries a little while ago.
Writing GUI widgets themselves is difficult because computer users are very sensitive minute details of the behavior of the interface. Non-native widget never feel right, don't they? In order to get non-native widget right -- in order to get any widget right, in fact -- you need to spend an inordinate amount of time tweaking the details of the behavior.
So, GUI are slow because of the inefficiencies introduced by the abstraction mechanisms used to create highly-reusable components, that added to shortness of time available to optimize the code once so much time has been spent just getting the behavior right.
Uhm, that's quite a lot.
The most simple but probably obvious answer is that the programmers behind these GUI apps, are really bad programmers. You can go along way in writing code which does the most bizarre things and it will be faster but few people seem to care how to do this or they deem it to be an expensive non-profitable time wasted effort.
To set things straight off-loading computations to the GPU won't necessarily fix any problems. The GPU is just like the CPU except it's less general purpose and more a data paralleled processor. It can do graphics computations exceptionally well. Whatever graphics API/OS and driver combination you have doesn't really matter that much... well OK, with Vista as an example, they changed the desktop composition engine. This engine is far better composting only that which has changed, and since the number one bottle neck for GUI apps is redrawing is a neat optimization strategy. This idea of virtualizing your computational needs and only update the smallest change every time.
Win32 sends WM_PAINT messages to windows when they need to be redrawn, this can be a result of windows occluding each other. However it's up to the window itself to figure out whats actually changed. More than so nothing did change or the change that was made was trivial enough so that it could have been just preformed on top of what ever top most surface you had.
This kind of graphics handling doesn't necessarily exist today. I would say that people have refrained from writing really efficient and virtualizing rendering solutions because the benefit/cost ration is rather low/high (bad).
Something Windows Presentation Foundation (WPF) does, which I think is far superior to most other GUI API is that it splits layout updates and rendering updates into two separate passes. And while WPF is managed code the rendering engine is not. What happens with rendering is that the managed WPF rendering engine builds a command queue (this is what DirectX and OpenGL does) which is then handed of to the native rendering engine. What's a bit more elegant here is that WPF will then try to retain any computation which didn't change the visual state. A trick if you may, where you avoid costly rendering calls for things that doesn't have to be rendered (virtualizing).
In contrast to WM_PAINT which tells a Win32 window to repaint itself a WPF app would check what parts of that window requires repainting and only repaint the smallest change.
Now WPF is not supreme, it's a solid effort from Microsoft but it's not the holy grail yet... the code which runs the pipeline could still be improved and the memory footprint of any managed app is still more than I would want. But I hope this is the kind of answer you are looking for.
WPF is able to do some things asynchronously rather decent, which is a huge deal if you wanna make a really responsive low-latency/low-cpu UI. Asynchronous operations is more than off-loading work on a different thread.
To summarize things slow and expensive GUI means too much repainting and the kind of repainting which is very expensive i.e. the entire surface area.
I does to some degree depend on the language. You might have noticed that Java and RealBasic applications are a fair bit slower than their C-based (C++, C#, Objective-C) counterparts.
However GUI applications are much more complex than command line apps. The Terminal window needs only to draw a simple window that doesn't support buttons.
There are also multiple loops for extra inputs and features.
I think that you can find some interesting thoughts on this topic in "Window System Design: If I had it to do over again in 2002" by James Gosling (the Java guy, also known for his work on pre-X11 windowing systems). Available online here[pdf].
The article focuses on the positive side (how to make it fast), not on the negative side (what's making it slow), but it is still a good read on the topic.

Resources