Non power of two textures are very slow in OpenGL ES 2.0.
But in every "render-to-texture" tutorial I saw, people just take screen size (which is never pow2), and just make texture from it.
Should I render to pow2 texture (with projection matrix correction), or there is some kind of magic with FBO?
I don't buy into the "non power of two textures are very slow" premise in your question. First of all, these kinds of performance characteristics can be highly hardware dependent. So saying that this is true for ES 2.0 in general does not really make sense.
I also doubt that any GPU architectures developed within the last 5 to 10 years would be significantly slower when rendering to NPOT textures. If there's data that shows otherwise, I would be very interested in seeing it.
Unless you have conclusive data that shows POT textures to be faster for your target platform, I would simply use the natural size for your render targets.
If you're really convinced that you want to use POT textures, you can use glViewport() to render to part of them, as #MaticOblak also points out in a comment.
There's one slight caveat to the above: ES 2.0 has some limitations on how NPOT textures can be used. According to the standard, they do not support mipmapping, and not all wrap modes. The GL_OES_texture_npot extension, which is supported on many devices, gets rid of these limitations.
Related
I have two modes to continue programming a hexagonal map in this moment, and I don't know what way is better. Maybe you can help me :)
I used a texture to represent the "grid", so the squad with this texture is static and don't move or edit in runtime.
In the first hand, I have a texture with 7700x6736 pixles, however, his size it's only 3.131KB, when I run in a random engine (Unity in this case) the frame rate it's nice (constants 60fps with VSynk and +100 without VSynk)
This texture is associated in one transparent material to the squad (2 triangles)
With the second mode, I have a 14 textures to 550x496 pixels and 21KB. But with this mode, I need 14 squads (28 triangles against 2) and 14 materials with differents textures, against 1 in the other way.
Too, with this second mode, I need asking the distance of every squad to hide or not hide (a simple occlusion culling)
What is the better way in your opinion?
While your 7k texture works on your dev machine it may be not supported in some of the platforms you'll target. I'd use a 2048^2 as a safe maximum, or even a 1024^2.
The second problem is that it may use 3MB as a JPG/PNG compressed file, but in your video memory it will be as an uncompressed one (unless you use some texture-specific compression, but you may have problems with platform support again).
Additionally - you should consider if you really need the Non Power Of Two textures, officially they should be supported ATM, but you can still get into problems on some older hardware.
In general your solution depends on the platforms that you want to target, and especially if you plan to target mobile devices (and which ones).
I wanted to come up with a crude way to "benchmark" the performance improvement of a tweak I made to a fragment shader (to be specific, I wanted to test the performance impact of the removal of the computation of the gamma for the resulting color using pow in the fragment shader).
So I figured that if a frame was taking 1ms to render an opaque cube model using my shader that if I set glDisable(GL_DEPTH_TEST) and loop my render call 100 times, that the frame would take 100ms to render.
I was wrong. Rendering it 100 times only results in about a 10x slowdown. Obviously if depth test is still enabled, most if not all of the fragments in the second and subsequent draw calls would not be computed because they would all fail the depth test.
However I must still be experiencing a lot of fragment culls even with depth test off.
My question is about whether my hardware (in this particular situation it is an iPad3 on iOS6.1 that I am experiencing this on -- a PowerVR SGX543MP4) is just being incredibly smart and is actually able to use the geometry of later draw calls to occlude and discard fragments from the earlier geometry. If this is not what's happening, then I cannot explain the better-than-expected performance that I am seeing. The question applies to all flavors of OpenGL and desktop GPUs as well, though.
Edit: I think an easy way to "get around" this optimization might be glEnable(GL_BLEND) or something of that sort. I will try this and report back.
PowerVR hardware is based on tile-based deferred rendering. It does not begin drawing fragments until after it receives all of the geometry information for a tile on screen. This is a more advanced hidden-surface removal technique than z-buffering, and what you have actually discovered here is that enabling alpha blending breaks the hardware's ability to exploit this.
Alpha blending is very order-dependent, and so no longer can rasterization and shading be deferred to the point where only the top-most geometry in a tile has to be drawn. Without alpha blending, since there is no data dependency on the order things are drawn in, completely obscured geometry can be skipped before expensive per-fragment operations occur. It is only when you start blending fragments that a true order-dependent situation arises and completely destroys the hardware's ability to defer/cull fragment processing for hidden surfaces.
In all honesty, if you are trying to optimize for a platform based on PowerVR hardware you should probably make this one of your goals. By that, I mean, before optimizing shaders first consider whether you are drawing things in an order and/or with states that hurt the PowerVR hardware's ability to do TBDR. As you have just discovered, blending is considerably more expensive on PowerVR hardware than other hardware... the operation itself is no more complicated, it just prevents PVR hardware from working the special way it was designed to.
I can confirm that only after adding both lines:
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA);
did the frame render time increase in a linear fashion in response to the repeated draw calls. Now back to my crude benchmarking.
I am finally making the move to OpenGL ES 2.0 and am taking advantage of a VBO to load all of the scene data onto the graphics cards memory. However my scene is only around the 200,000 vertices in size ( and I know it depends on hardware somewhat ) but does anyone think an octree would make any sense in this instance ? ( incidentally because of the view point at least 60% of the scene is visible most of the time ) Clearly I am trying to avoid having to implementing an Octree at such an early stage of my GLSL coding life !
There is no need to be worried about optimization and performance if the App you are coding is for learning purpose only. But given your question, apparently you intend to make a commercial App.
Only using VBO will not solve problems on performance for you App, specially as you mentioned that you meant it to run on mobile devices. OpenGL ES has an optimized option for drawing called GL_TRIANGLE_STRIP, which is worth particularly for complex geometry.
Also interesting to add up in improving performance is to apply Bump Masking, for the case you have textures in your model. With these two approaches you App will be remarkably improved.
As you mention that your entire scenery is visible all the time, you should also use level of detail (LOD). To implement geometry LOD, you need a different mesh for each LOD that you wish to use, and each level has fewer polygons than the closest one. You can make yourself the geometry for each LOD, or you can also apply some 3D software to make it automatically.
Some tools are free and you can access and use it to automatically perform generic optimization directly on your GLSL ES code, and it is really worth checking.
I have analyzed my game running OpenGL Analyzer on XCode. I am using Cococs2d 2.0 as static library in my game and wonder whether any of the following suggestions will improve my performance. I have read some post in other forums saying that I should not worry about this but as I do have some performance issues I would like to understand if those suggestion will be likely to improve them.
Suggestions:
Overview:
Thinking:
In particular I refer to the suggestion where it says:
"reccomended using VAO and VBO"
Then I wonder also why there are "Many small batch draw calls". I am using a spritebatch node and this should avoid this issue.
Also the other suggestions seems to make sense, but those are the most "frequent" ones so would like to start analyzing those.
A "small batch draw call" is anything with fewer than n-many vertices. I am not sure the exact threshold used, but it is probably on the order of 100-200. What spritebatches really do is eliminate the need to split your draw calls up multiple times in order to switch bound textures, this does not automatically imply that each draw call is going to have more than 100 (or whatever n is defined as in this context) vertices; it is a strong possibility, but not necessary.
I would be far more concerned about non-VBO draw calls and not using VAOs to be honest, especially if you want your code to be forward-compatible.
The "Logical Buffer Load" and "Mipmapping Usage" warnings are very likely related; probably both having to do with FBOs. One of them is related to not using glClear (...) properly and the other is related to using a texture that does not have mipmaps.
Regarding logical buffer loads, you should look into GL_EXT_discard_framebuffer, clearing the framebuffer this way is a really healthy optimization strategy for Tile-Based Deferred Rendering GPUs (such as the ones used by all iOS devices).
As for the mipmap usage warning, I believe this is being triggered because you are drawing into an FBO texture and then applying that texture using a mipmap filter mode. The mip-chain/pyramid for drawn FBOs has to be built manually using glGenerateMipamp (...).
If you can point me to some individual lines that trigger these warnings, I would be happy to explain them in further detail for you.
I'm trying to get an idea of the practicality of WebGL for rendering large interior scenes, consisting of 100K's of triangles. These triangles are distributed over many objects, and there are many materials in the scene. On the other hand, there are no moving parts. And the materials tend to be fairly simple, mostly based on texture maps. There is a lot of texture map sharing .. for example all the chairs in scene will share a common map. There is also some multitexturing - up to three textures overlaid in a material.
I've been doing a little experimentation and reading, and gather that frequently switching materials during a rendering pass will slow things down. For example, a scene with 200K triangles will have significant performance differences, depending on whether there are 10 or 1000 objects, assuming that each time an object is displayed a new material is set up.
So it seems that if performance is important the scene should be sorted by materials so as to minimize material switching. What I'm looking for is guidelines on how to think of the overhead of various state changes, and where do I get the biggest bang for the buck. For example,
what are the relative performance costs of, say, gl.useProgram(), gl.uniformMatrix4fv(), gl.drawElements()
should I try to write ubershaders to minimize shader switching?
should I try to aggregate geometry to minimize the number of gl.drawElements() calls
I realize that mileage may vary depending on browser, OS, and graphics hardware. And I'm also not looking for heroic measures. Just some guidelines from people who have already had some experience in making scenes fast. I'll add that while I've had some experience with fixed-pipeline OpenGL programming in the past, I'm rather new to the WebGL/OpenGL ES 2.0 way of doing things.
Have you read batch, batch, batch? Admittedly, it focuses on directX, but the reasoning applies to a lesser extent to Open/WebGL also: Each API call has significant overhead on the CPU. The advice is use all the API's options to share textures, use instancing (if available), write complex shaders to avoid many draw calls. So if you can draw the whole house as a single mesh in a single call, that would be better than 1000 calls for each room. Writing ubershaders is reccomended but mostly because it may allow you to remove draw calls, not because GPU state switching is expensive.
This assumes recent hardware. For low end platforms (iPad?) or Intel GMA chips, the bottlenecks will be elsewhere (like in software vertex processing).