DirectX9 - Efficiently Drawing Sprites - performance

I'm trying to create a platformer game, and I am taking various sprite blocks, and piecing them together in order to draw the level. This requires drawing a large number of sprites on the screen every single frame. A good computer has no problem handling drawing all the sprites, but it starts to impact performance on older computers. Since this is NOT a big game, I want it to be able to run on almost any computer. Right now, I am using the following DirectX function to draw my sprites:
D3DXVECTOR3 center(0.0f, 0.0f, 0.0f);
D3DXVECTOR3 position(static_cast<float>(x), static_cast<float>(y), z);
(my LPD3DXSPRITE object)->Draw((sprite texture pointer), NULL, &center, &position, D3DCOLOR_ARGB(a, r, g, b));
Is there a more efficient way to draw these pictures on the screen? Is there a way that I can use less complex picture files (I'm using regular png's right now) to speed things up?
To sum it up: What is the most performance friendly way to draw sprites in DirectX? thanks!

The ID3DXSPRITE interface you are using is already pretty efficient. Make sure all your sprite draw calls happen in one batch if possible between the sprite begin and end calls. This allows the sprite interface to arrange the draws in the most efficient way.
For extra performance you can load multiple smaller textures in to one larger texture and use texture coordinates to get them out. This makes it so textures don't have to be swapped as frequently. See:
http://nexe.gamedev.net/directknowledge/default.asp?p=ID3DXSprite
The file type you are using for the textures does not matter as long as they are are preloaded into textures. Make sure you load them all in to textures once when the game/level is loading. Once you have loaded them in to textures it does not matter what format they were originally in.
If you still are not getting the performance you want, try using PIX to profile your application and find where the bottlenecks really are.
Edit:
This is too long to fit in a comment, so I will edit this post.
When I say swapping textures I mean binding them to a texture stage with SetTexture. Each time SetTexture is called there is a small performance hit as it changes the state of the texture stage. Normally this delay is fairly small, but can be bad if DirectX has to pull the texture from system memory to video memory.
ID3DXsprite will reorder the draws that are between begin and end calls for you. This means SetTexture will typically only be called once for each texture regardless of the order you draw them in.
It is often worth loading small textures into a large one. For example if it were possible to fit all small textures in to one large one, then the texture stage could just stay bound to that texture for all draws. Normally this will give a noticeable improvement, but testing is the only way to know for sure how much it will help. It would look terrible, but you could just throw in any large texture and pretend it is the combined one to test what performance difference there would be.

I agree with dschaeffer, but would like to add that if you are using a large number different textures, it may better to smush them together on a single (or few) larger textures and adjust the texture coordinates for different sprites accordingly. Texturing state changes cost a lot and this may speed things up on older systems.

Related

Dividing a sphere into multiple texture

I have a sphere with texture of earth that I generate on the fly with the canvas element from an SVG file and manipulate it.
The texture size is 16384x8192 , and less than this - it's look blurry on close zoom.
But this is a huge texture size and causing memory problems... (But it's look very good when it is working)
I think a better approach would be to split the sphere into 32 separated textures, each in size of 2048x2048
A few questions:
How can I split the sphere and assign the right textures?
Is this approach better in terms of memory and performance from a single huge texture?
Is there a better solution?
Thanks
You could subdivide a cube, and cubemap this.
Instead of having one texture per face, you would have NxN textures. 32 doesn't sound like a good number, but 24 for example does, (6x2x2).
You will still use the same amount of memory. If the shape actually needs to be spherical you can further subdivide the segments and normalize the entire shape (spherify it).
You probably cant even use such a big texture anyway.
notice the top sphere (cubemap, ignore isocube):
Typically, that's not something you'd do programmatically, but in a 3D program like Blender or 3D max. It involves some trivial mesh separation, UV mapping and material assignment. One other approach that's worth experimenting with would be to have multiple materials but only one mesh - you'd still get (somewhat) progressive loading. BUT
Are you sure you'd be better off with "chunks" loading sequentially rather than one big texture taking a huge amount of time? Sure, it'll improve a bit in terms of timeouts and caching, but the tradeoff is having big chunks of your mesh be textureless, which is noticeable and unasthetic.
There are a few approaches that would mitigate your problem. First, it's important to understand that texture loading optimization techniques - while common in game engines - aren't really part of threejs or what it's built for. You'll never get the near-seamless LODs or GPU optimization techniques that you'll get with UE4 or Unity. Furthermore webGL - while having made many strides over the past decade - is not ideal for handling vast texture sizes, not at the GPU level (since it's based on OpenGL ES, suited primarily for mobile devices) and certainly not at the caching level - we're still dealing with broswers here. You won't find a lot of webGL work done with vast textures of the dimensions you refer to.
Having said that,
A. A loader will let you do other things while your textures are loading so your user isn't staring at an 'unfinished mesh'. It lets you be pretty clever with dynamic loading times and UX design. Additionally, take a look at this gist to give you an idea for what a progressive texture loader could look like. A much more involved technique, that's JPEG specific, can be found here but I wouldn't approach it unless you're comfortable with low-level graphics programming.
B. Threejs does have a basic implementation of LOD although I haven't tinkered with it myself and am not sure it's useful for textures; that said, the basic premise to inquire into is whether you can load progressively higher-resolution files on a per-need basis, just like Google Earth does it for example.
C. This is out of the scope of your question - but I'd look into what happens under the hood in Unity's webgl export (which is based on threejs), and what kind of clever tricks are being employed there for similar purposes.
Finally, does your project have to be in webgl? For something ambitious and demanding, sometimes "proper" openGL / DX makes much more sense.

SDL accelerated rendering

I am trying to understand the whole 2D accelerated rendering process using SDL 2.0.
So my question is which would be the most efficient way to draw circles in the screen and why?
Some ways would be:
First to create a software surface and then draw the necessary pixels on that surface then create a texture out of that surface and lastly copy that texture to the rendering target.
Also another implementation would be to draw a circle using multiple times SDL_RenderDrawLine.And I think this is the way it is being implemented in SDL 2.0 gfx
Or there is a more efficient way to do all of this?
Take this question more generally in means of if I would wanted to draw other shapes manually, which probably, couldn't be rendered easily with the 2D rendering API that SDL provides(using draw line or rectangle).
With the example of circles this is a fairly complicated question, it is more based on the visual quality you wish to achieve which will drive performance. Drawing lots of short lines will vary vastly based on how close to a circle you wish to get, if you are happy to use say, 60 lines, which will work on small shapes nearly seamlessly but if scaled up will begin to appear not to be a circle, the performance will likely be better (depending on the user's hardware). Note also SDL_RenderDrawLines will be much much faster for many lines as it avoids lots of context switches for rendering calls.
However if you need a very accurate circle with thousands of lines to get a good approximation it will be faster to simply use a bitmap and scale and blit it. This will also give you a 'smoother' feel to the circle.
In my personal opinion I do not think the hardware accelerated render API has much use outside of some special uses such as graph rendering and perhaps very simple GUI drawing. For anything more complex I would usually use bitmap based drawing.
With regards to the second part, it again depends on the accuracy of any arcs you need to draw. If you can easily approximate the shape into a few tens of lines it will be fast, otherwise the pixel method is better.

OpenGL tile rendering: most efficient way?

I am creating a tile-based 2D game as a way of learning basic "modern" OpenGL concepts. I'm using shaders with OpenGL 2.1., and am familiar with the rendering pipeline and how to actually draw geometry on-screen. What I'm wondering is the best way to organize a tilemap to render quickly and efficiently. I have thought of several potential methods:
1.) Store the quad representing a single tile (vertices and texture coordinates) in a VBO and render each tile with a separate draw* call, translating it to the correct position onscreen and using uniform2i to give the location in the texture atlas for that particular tile;
2.) Keep a VBO containing every tile onscreen (already-computed screen coordinates and texture atlas coordinates), using BufferSubData to update the tiles every frame but using a single draw* call;
3.) Keep VBOs containing static NxN "chunks" of tiles, drawing however many chunks of tiles are at least partially visible onscreen and translating them each into position.
*I'd like to stay away from the last option if possible unless rendering chunks of 64x64 is not too inefficient. Tiles are loaded into memory in blocks of that size, and even though only about 20x40 tiles are visible onscreen at a time, I would have to render up to four chunks at once. This method would also complicate my code in several other ways.
So, which of these is the most efficient way to render a screen of tiles? Are there any better methods?
You could do any one of these and they would probably be fine; what you're proposing to render is very, very simple.
#1 will definitely be worse in principle than the other options, because you would be drawing many extremely simple “models” rather than letting the GPU do a whole lot of batch work on one draw call. However, if you have only 20×40 = 800 tiles visible on screen at once, then this is a trivial amount of work for any modern CPU and GPU (unless you're doing some crazy fragment shader).
I recommend you go with whichever is simplest to program for you, so that you can continue work on your game. I imagine this would be #1, or possibly #2. If and when you find yourself with a performance problem, do whichever of #2 or #3 (64×64 sounds like a fine chunk size) lets you spend the least CPU time on your program's part of drawing (i.e. updating the buffer(s)).
I've been recently learning modern OpenGL myself, through OpenGL ES 2.0 on Android. The OpenGL ES 2.0 Programming Guide recommends an "array of structures", that is,
"Store vertex attributes together in a single buffer. The structure represents all attributes of a vertex and we have an array of these attributes per vertex."
While this may seem like it would initially consume a lot of space, it allows for efficient rendering using VBOs and flexibility in texture mapping each tile. I recently did a tiled hex grid using interleaved arrays containing vertex, normals, color, and texture data for a 20x20 tile hex grid on a Droid 2. So far things are running smoothly.

OpenGL - Fast Textured Quads?

I am trying to display as many textured quads as possible at random positions in the 3D space. In my experience so far, I cannot display even a couple of thousands of them without dropping the fps significantly under 30 (my camera movement script becomes laggy).
Right now I am following an ancient tutorial. After initializing OpenGL:
glEnable(GL_TEXTURE_2D);
glShadeModel(GL_SMOOTH);
glClearColor(0, 0, 0, 0);
glClearDepth(1.0f);
glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST);
I set the viewpoint and perspective:
glViewport(0,0,width,height);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(60.0f,(GLfloat)width/(GLfloat)height,0.1f,100.0f);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
Then I load some textures:
glGenTextures(TEXTURE_COUNT, &texture[0]);
for (int i...){
glBindTexture(GL_TEXTURE_2D, texture[i]);
glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_LINEAR_MIPMAP_NEAREST);
gluBuild2DMipmaps(GL_TEXTURE_2D,3,TextureImage[0]->w,TextureImage[0]->h,GL_RGB,GL_UNSIGNED_BYTE,TextureImage[0]->pixels);
}
And finally I draw my GL_QUADS using:
glBindTexture(GL_TEXTURE_2D, q);
glTranslatef(fDistanceX,fDistanceZ,-fDistanceY);
glBegin(GL_QUADS);
glNormal3f(a,b,c);
glTexCoord2f(d, e); glVertex3f(x1, y1, z1);
glTexCoord2f(f, g); glVertex3f(x2, y2, z2);
glTexCoord2f(h, k); glVertex3f(x3, y3, z3);
glTexCoord2f(m, n); glVertex3f(x4, y4, z4);
glEnd();
glTranslatef(-fDistanceX,-fDistanceZ,fDistanceY);
I find all that code very self explaining. Unfortunately that way to do things is deprecated, as far as I know. I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them. I don't even know if these objects are suited to realize what I am trying to do here (a billion quads on the screen without a lag). Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result? And if you happen to have one more minute of spare time, could you give me a short summary of how these functions are used (just as i did with the deprecated ones above)?
Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result?
What is "the result"? You have not explained very well what exactly it is that you're trying to accomplish. All you've said is that you're trying to draw a lot of textured quads. What are you trying to do with those textured quads?
For example, you seem to be creating the same texture, with the same width and height, given the same pixel data. But you store these in different texture objects. OpenGL does not know that they contain the same data. Therefore, you spend a lot of time swapping textures needlessly when you render quads.
If you're just randomly drawing them to test performance, then the question is meaningless. Such tests are pointless, because they are entirely artificial. They test only this artificial scenario where you're changing textures every time you render a quad.
Without knowing what you are trying to ultimately render, the only thing I can do is give general performance advice. In order (ie: do the first before you do the later ones):
Stop changing textures for every quad. You can package multiple images together in the same texture, then render all of the quads that use that texture at once, with only one glBindTexture call. The texture coordinates of the quad specifies which image within the texture that it uses.
Stop using glTranslate to position each individual quad. You can use it to position groups of quads, but you should do the math yourself to compute the quad's vertex positions. Once those glTranslate calls are gone, you can put multiple quads within the space of a single glBegin/glEnd pair.
Assuming that your quads are static (fixed position in model space), consider using a buffer object to store and render with your quad data.
I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them.
Did you try the OpenGL Wiki, which has a pretty good list of tutorials (as well as general information on OpenGL)? In the interest of full disclosure, I did write one of them.
I heard, in modern games milliards of polygons are rendered in real time
Actually its in the millions. I presume you're German: "Milliarde" translates into "Billion" in English.
Right now I am following an ancient tutorial.
This is your main problem. Contemporary OpenGL applications don't use ancient rendering methods. You're using the immediate mode, which means that you're going through several function calls to just submit a single vertex. This is highly inefficient. Modern applications, like games, can reach that high triangle counts because they don't waste their CPU time on calling as many functions, they don't waste CPU→GPU bandwidth with the data stream.
To reach that high counts of triangles being rendered in realtime you must place all the geometry data in the "fast memory", i.e. in the RAM on the graphics card. The technique OpenGL offers for this is called "Vertex Buffer Objects". Using a VBO you can draw large batches of geometry using a single drawing call (glDrawArrays, glDrawElements and their relatives).
After getting the geometry out of the way, you must be nice to the GPU. GPUs don't like it, if you switch textures or shaders often. Switching a texture invalidates the contents of the cache(s), switching a shader means stalling the GPU pipeline, but worse it means invalidating the execution path prediction statistics (the GPU takes statistics which execution paths of a shader are the most probable to be executed and which memory access patterns it exhibits, this used to iteratively optimize the shader execution).

graphics: best performance with floating point accumulation images

I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?
The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.
It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.
Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).
If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.

Resources