XNA Particle system performance - performance

I have a particle system using basic spritebatch, where the particles are being created and destroyed based on decremental alpha value till 0.
The perfomance of the system is quite poor on pc, and very poor on xbox, with about a hundred particles on screen before significant fps slow down, I've read around regarding how to improve performance but does anyone have any tips on how to implment them, for exmaple what is the best way to - reuse particles rather than kill()? Does the image size of each particle make a difference? if I don't rotate each particle will this help?
I've played around with each of these suggestions but don't receive any significant improvement - does anyone have any advice - is it worth going gpu rather than cpu based?

From what I recall destroying and creating particles slows down the performance substantially.
You might want to reuse particles.
Not sure about image size or rotation drastically reducing performance as long as the image isn't substantially large.
I would have an array with swapping dead particles to the end of the active particles therefore processing only active particles.
For example:
Make an array of MAX particles;
When you need a particle grab particle_array[count];
Increment count.
When a particle dies, decrement count, swap the particle with particle_array[count];
Update only count particles;
Hope this helps.

I entirely agree with subsonic's answer, but I wanted to expand on it.
Creating a new particle every time and destroying (or letting go of) old particles creates a LOT of garbage. The garbage avoidance mantra in C#, particularly on Xbox (due to the compact framework's poor garbage handling), is to NEVER new a class type in your update/draw loop. ALWAYS pre-create into pools. Shawn H explains: http://blogs.msdn.com/b/shawnhar/archive/2007/07/02/twin-paths-to-garbage-collector-nirvana.aspx.
One other thing to consider is that you using multiple textures can cause slowdown for sprite batch due to multiple draw calls. Try to merge multiple particle textures into one and use the source rectangle SpriteBatch.Draw parameter.

Related

Marchingcube planet vegetation; Large amount of meshes performance

I would like to inquire some insights into rendering a large amount of meshes with the best performance.
I'm working on generative mine-able planets incorporating marching cube chunked terrain. Currently I'm trying to add vegetation/rocks to spruce up the planet surfaces (get it?). I am using the actual chunk loading to (next to the terrain) also load smaller rocks and some grass stuff. That runs pretty well. I am having issues with tree's and boulders (visible on the entire planet surface but LODed, obviously).
Testing different methods have lead me on the road of;
Custom shaders with material clipping based on camera distance; Works okay for about half a million trees made from 2 perpendicular planes (merged into one single bufferGeometry). But those 'models' are not good enough.
THREE.LOD's; Which sucks up fps like crazy, to slow for large amounts of meshes.
THREE.InstancedMesh's; Works pretty well, however I'd have to disable frustumCulling, since the originpoint of the vegetation is not always on screen. Which makes it inefficient.
THREE.InstancedGeometry combined with the custom clipping shaders; I had high hopes for this, it gives the best performance while using actual models. But it still eats up half of the frameRate. The vertexshader still has to process all the vertices to determine if it is within clipping range. Also the same frustumCulling issue applies.
Material.clippingPlanes? Combined with InstancedMeshes; This is what I'm trying now, did not have any luck with it, still trying to figure out exactly how that works..
Does anyone have experience with rendering large amounts of meshes or has some advice for me? Is there a technique I do not yet know about?
Would it help to split up the trees in multiple InstancedMeshes? Would the clippingPlanes give me better performance?

OpenGL ES, Z-Buffer, 2D sprites, discard, performance

I have a retro-looking 2D game with a lot of sprites (reminiscent of Sega's Super Scaler arcades) which do not use semi-transparency. I have thought about using the Z-Buffer over sorting to simplify things. Ok, but by default writes are done to the Z-buffer even though alpha is zero, giving the effect illustrated here:
http://i.stack.imgur.com/ubLlp.png
Now, since I'm in OpenGL ES 2, I don't have alpha testing, so from what I understand my only possibility is to discard the pixel from the fragment shader if alpha is 0 so that it doesn't get written to the Z-Buffer. But in terms of performance this is SO wrong: not only the if is slow, but the discard basically kills the purpose since it disables early depth testing and the result is way worse than doing it in software.
if (val.a < 0.5) {
discard;
}
Is there any other solution I could use which would not kill the performance? Do all 2D games sort sprites themselves and not use depth buffer?
It's a tradeoff really. If you let the z-buffer do the sorting and use discard in your shaders then it's more expensive on the GPU because of branching and late depth testing as you say.
If you do the depth sorting yourself, then you'll find it's harder to issue your draw calls in an optimal order (e.g. you'll keep having to change texture). Draw calls on GLES2 have a very significant CPU hit on lower end devices and the count will probably go up.
If performance is a big concern, then probably the second option is better if you do it in conjunction with a big effort on the texture atlasing front to minimize your draw call count, this might be particularly effective if your sprites are low resolution retro sprites because you'll be able to get a lot of sprites per texture atlas. It isn't a clear winner by any stretch and I can imagine that different games take different approaches.
Also, you should take into account that the vast majority of target hardware is going to perform just fine whichever path you choose, and maybe you should just choose the one that is faster to implement and makes your code simpler (which is probably letting the z-buffer do the sorting).
If you fancy a technical challenge, I've often thought the best approach might be divide up your sprites into fully opaque sections and sections with transparency and render the two parts as separate meshes (they won't be quads any more). You'd have to do a lot of preprocessing and draw a lot more triangles, but by being able to do some rendering with fully-opaque parts then you can take advantage of the hidden-surface-removal tech in all iOS devices and lots of Android devices. Certainly by doing this you should be able to reduce your fill rate burden, but at a cost of increased draw calls, and there might be an unnecessarily high amount of added complexity to your code and your tools.

Performance: Offline surface for hit testing vs Triangle Intersection

First, a disclaimer. I'm well aware of the std answer for X vs Y - "it depends". However, I'm working on a very general purpose product, and I'm trying to figure out "it depends on what". I'm also not really able to test the wide variety of hardware, so try-and-see is an imperfect measure at best.
I've been doing some googling, and I've found very little reference to using an offline render target/surface for hittesting. I'm not sure of the nomenclature, what I'm talking about though is using very simple shaders to render a geometry ID (for example) to a buffer, then reading the pixel value under the mouse to see what geometry is directly under the mouse pointer.
I have however, found 101 different tutorials on doing triangle intersection, a la D3DXIntersect & DirectX sample "Pick".
I'm a little curious on this - I would have thought using HW was the standard method. By all rights, it should be many orders of magnitude faster, and should scale far better.
I'm relatively new to graphics programming, so here are my assumptions, for you to disabuse.
1) A simple shader that does geometry transform & writes a Node + UV value should be nearly free.
2) The main cost in the HW pick method would be the buffer fetch, when getting the rendered surface back off the GPU for the CPU to read over. I have no idea how costly this is. us? ms? seconds? minutes?
3) This may be obvious, but I am assuming that Triangle Intersection (D3DXIntersect) is only possible on the CPU.
4) A possible cost people want to avoid is the cost of the extra render target(s) (zbuffer+surface). I'm a'guessing about 10 megs for 1024x1280 (std screen size?). This is acceptable to me, although if I could render a smaller surface (trade accuracy for memory) I would do so (is that possible?).
This all leads to a few thoughts.
1) For very simple scenes, triangle intersection may be faster. Quite what is simple/complex is hard to guess at this point. I'm looking at possible 100s of tris to 10000s. Probably not much more than that.
2) The HW buffer needs to be rendered regardless of whether or not its used (in my case). However, it can be reused without cost (ie, click-drag, where mouse tracks across a static scene)
2a) Possibly, triangle intersection may be preferable if my scene updates every frame, or if I have limited mouse interaction.
Now I've finished writing, I see a similar question has been asked: (3D Graphics Picking - What is the best approach for this scenario). My problem with this is (a) why would you need to re-render your picking surface for click-drag as your scene hasn't actually changed, and (b) wouldn't it still be faster than triangle intersection?
I welcome thoughts, criticism, and any manner of side-tracking :-)

Efficient way of drawing in OpenGL ES

In my application I draw a lot of cubes through OpenGL ES Api. All the cubes are of same dimensions, only they are located at different coordinates in space. I can think of two ways of drawing them, but I am not sure which is the most efficient one. I am no OpenGL expert, so I decided to ask here.
Method 1, which is what I use now: Since all the cubes are of identical dimensions, I calculate vertex buffer, index buffer, normal buffer and color buffer only once. During a refresh of the scene, I go over all cubes, do bufferData() for same set of buffers and then draw the triangle mesh of the cube using drawElements() call. Since each cube is at different position, I translate the mvMatrix before I draw. bufferData() and drawElements() is executed for each cube. In this method, I probably save a lot of memory, by not calculating the buffers every time. But I am making lot of drawElements() calls.
Method 2 would be: Treat all cubes as set of polygons spread all over the scene. Calculate vertex, index, color, normal buffers for each polygon (actually triangles within the polygons) and push them to graphics card memory in single call to bufferData(). Then draw them with single call to drawElements(). The advantage of this approach is, I do only one bindBuffer and drawElements call. The downside is, I use lot of memory to create the buffers.
My experience with OpenGL is limited enough, to not know which one of the above methods is better from performance point of view.
I am using this in a WebGL app, but it's a generic OpenGL ES question.
I implemented method 2 and it wins by a landslide. The supposed downside of high amount of memory seemed to be only my imagination. In fact the garbage collector got invoked in method 2 only once, while it was invoked for 4-5 times in method 1.
Your OpenGL scenario might be different from mine, but if you reached here in search of performance tips, the lesson from this question is: Identify the parts in your scene that don't change frequently. No matter how big they are, put them in single buffer set (VBOs) and upload to graphics memory minimum number of times. That's how VBOs are meant to be used. The memory bandwidth between client (i.e. your app) and graphics card is precious and you don't want to consume it often without reason.
Read the section "Vertex Buffer Objects" in Ch. 6 of "OpenGL ES 2.0 Programming Guide" to understand how they are supposed to be used. http://opengles-book.com/
I know that this question is already answered, but I think it's worth pointing out the Google IO presentation about WebGL optimization:
http://www.youtube.com/watch?v=rfQ8rKGTVlg
They cover, essentially, this exact same issue (lot's of identical shapes with different colors/positions) and talk about some great ways to optimize such a scene (and theirs is dynamic too!)
I propose following approach:
On load:
Generate coordinates buffer (for one cube) and load it into VBO (gl.glGenBuffers, gl.glBindBuffer)
On draw:
Bind buffer (gl.glBindBuffer)
Draw each cell (loop)
2.1. Move current position to center of current cube (gl.glTranslatef(position.x, position.y, position.z)
2.2. Draw current cube (gl.glDrawArrays)
2.3. Move position back (gl.glTranslatef(-position.x, -position.y, -position.z))

graphics: best performance with floating point accumulation images

I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?
The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.
It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.
Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).
If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.

Resources