Why is a Sprite Batcher faster? - opengl-es

I am reading Beginning Android Games (Mario Zechner) at the moment.
While reading about 2D games with OpenGL ES 1.0 the author introduces the concept of the SpriteBatcher that takes for each sprite it shall render the coordinates and an angle. The SpriteBatcher then calculates the final coordinates of the sprite rectangle and puts that into a single big buffer.
In the render method the SpriteBatcher sets the state for all the sprites once (texture, blending, vertex buffer, texture coordinates buffer). All sprites use the same texture but not the same texture coordinates.
The advantages of this behavior are:
The rendering pipeline does not stall, since there are no state changes while rendering all the sprites.
There are less OpenGL calls. (= less JNI overhead)
But I see a major disadvantage:
For rotation the CPU has to calculate the sine and cosine and perform 16 multiplication for each sprite. As far as I know calculating sine and cosine is very expensive and slow.
But the SpriteBatcher approach is lots faster than using lots of glRotate/glTranslate for rendering the sprites one by one.
Finally my questions:
Why is it faster? Are OpenGL state changes really that expensive?
The GPU is optimized for vector multiplications and rotations, while the CPU is not. Why doesn't that matter?
Would one use a SpriteBatcher on a desktop with a dedicated GFX-card?
Is there a point where the SpriteBatcher becomes inefficient?

But I see a major disadvantage:
For rotation the CPU has to calculate the sine and cosine and perform 16 multiplication for each sprite. As far as I know calculating sine and cosine is very expensive and slow.
Actually sin and cos are quite fast, on modern architectures they take 1 clock cycle to execute, if the pipeline has not been stalled before. However if the each sprite is rotated individually and an ordinary frustum perspective projection is used, the author of this code doesn't know his linear algebra.
The whole task can be simplified a lot if one recalls, that the modelview matrix maps linear local/world coordinates map to eye space. The rotation is in the upper left 3×3 submatrix, the column forming the local base vectors. By taking the inverse of this submatrix you're given exactly those vectors you need as sprite base, to map planar into eye space. In case of only rotations (and scaling, maybe) applied, the inverse of the upper left 3×3 is the transpose; so by using the upper left 3×3 rows as the sprite base you get that effect without doing any trigonometry at all:
/* populates the currently bound VBO with sprite geometry */
void populate_sprites_VBO(std::vector<vec3> sprite_positions)
{
GLfloat mv[16];
GLfloat sprite_left[3];
GLfloat sprite_up[3];
glGetMatrixf(GL_MODELVIEW_MATRIX, mv);
for(int i=0; i<3; i++) {
sprite_left[i] = mv[i*4];
sprite_up[i] = mv[i*4 + 4];
}
std::vector<GLfloat> sprite_geom;
for(std::vector<vec3>::iterator sprite=sprite_positions.begin(), end=sprite_positions.end();
sprite != end;
sprite++ ){
sprite_geom.append(sprite->x + (-sprite_left[0] - sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + (-sprite_left[1] - sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + (-sprite_left[2] - sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + ( sprite_left[0] - sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + ( sprite_left[1] - sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + ( sprite_left[2] - sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + ( sprite_left[0] + sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + ( sprite_left[1] + sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + ( sprite_left[2] + sprite_up[2])*sprite->scale);
sprite_geom.append(sprite->x + (-sprite_left[0] + sprite_up[0])*sprite->scale);
sprite_geom.append(sprite->y + (-sprite_left[1] + sprite_up[1])*sprite->scale);
sprite_geom.append(sprite->z + (-sprite_left[2] + sprite_up[2])*sprite->scale);
}
glBufferData(GL_ARRAY_BUFFER,
sprite_positions.size() * sizeof(sprite_positions[0]), &sprite_positions[0],
GL_DRAW_STREAM);
}
If shaders are available, then instead of rebuilding the sprite data on CPU each frame, one could use the geometry shader or the vertex shader. A geometry shader would take a vector of position, scale, texture, etc. and emit the quads. Using a vertex shader you'd send a lot of [-1,1] quads, where each vertex would carry the center position of the sprite it belongs to as an additional vec3 attribute.
Finally my questions:
Why is it faster? Are OpenGL state changes really that expensive?
Some state changes are extremely expensive, you'll try to avoid those, wherever possible. Switching textures is very expensive, switching shaders is mildly expensive.
The GPU is optimized for vector multiplications and rotations, while the CPU is not. Why doesn't that matter?
This is not the difference between GPU and CPU. Where a GPU differs from a CPU is, that it performs the same sequence of operations on a huge chunk of records in parallel (each pixel of the framebuffer rendered to). A CPU on the other hand runs the program one record at a time.
But CPUs do vector operations just as well, if not even better than GPUs. Especially where precision matters CPUs are still preferred over GPUs. MMX, SSE and 3DNow! are vector math instruction sets.
Would one use a SpriteBatcher on a desktop with a dedicated GFX-card?
Probably not in this form, since today one has geometry and vertex shaders available, liberating the CPU for other things. But more importantly this saves bandwidth between CPU and GPU. Bandwidth is the tighter bottleneck, processing power is not the number one problem these days (of course one never has enough processing power).
Is there a point where the SpriteBatcher becomes inefficient?
Yes, namely the CPU → GPU transfer bottleneck. Today one uses geometry shaders and instancing to do this kind of thing, really fast.

I don't know about SpriteBatcher, but looking at the information you provided here are my thoughts:
It is faster, because it uses less state changes and, what is more important, less draw calls. Mobile platforms have especially strict constraints on draw call number per frame.
That doesn't matter because, probably, they are using CPU for rotations. I, personally, see no reason not to use GPU for that, which would be way faster and nullify bandwidth load.
I guess it would still be a good optimization considering point 1.
I can mind two extreme cases: when there are too few sprites or when the compound texture (containing all rotated sprites) grows too big (mobile devices have lower size limits).

Related

Optimizing texture fetches with higher mip levels

Let's say I have some shader program in DirectX or OpenGL rendering a full screen quad. And in a pixel/fragment shader I sample some huge textures at random texture coordinates. That is one same texture coordinate for all texture samplings in one shader invocation, but it is various among different shader invocations. These fetch operations produce performance drop, I even think that due to the size of the textures the GPU texture cache is not big enough and is used not efficiently.
Now I have a theoretical question: can I optimize the performance by using some low-resolution like 32x32 mask textures, which are built by mipmapping the large textures, and if a value in a mask texture at given texture coordinate at some higher mip level is not appropriate, then I don't need to perform texture fetches at full-size level 0? Something like this in HLSL (GLSL code is pretty similar, but there is no [branch] attribute):
float2 tc = calculateTexCoordinates();
bool performHeavyComputations = testValue(largeMipmappedTexture.SampleLevel(sampler, tc, 5));
float result = 0;
[branch]
if (performHeavyComputations)
{
result += largeMipmappedTexture.SampleLevel(sampler, tc, 0);
}
About 50% of texels at mip level 5 will not pass the test. And so a lot of shader invocations should not sample the full-size textures.
But I am introducing branching in the code. May this branching hurt the performance even worse than sampling the full-size texture even if that is not needed? Different GPUs may behave differently, some may not even support branching, will they perform two fetches instead of one?
I can test this code on some machines later, but my question is theoretical.
And can you suggest another optimizations, if this won't work properly ?

How use raw Gryoscope Data °/s for calculating 3D rotation?

My question may seem trivial, but the more I read about it - the more confused I get... I have started a little project where I want to roughly track the movements of a rotating object. (A basketball to be precise)
I have a 3-axis accelerometer (low-pass-filtered) and a 3-axis gyroscope measuring °/s.
I know about the issues of a gyro, but as the measurements will only be several seconds and the angles tend to be huge - I don't care about drift and gimbal right now.
My Gyro gives me the rotation speed of all 3 axis. As I want to integrate the acceleration twice to get the position at each timestep, I wanted to convert the sensors coordinate-system into an earthbound system.
For the first try, I want to keep things simple, so I decided to go with the big standard rotation matrix.
But as my results are horrible I wonder if this is the right way to do so. If I understood correctly - the matrix is simply 3 matrices multiplied in a certain order. As rotation of a basketball doesn't have any "natural" order, this may not be a good idea. My sensor measures 3 angular velocitys at once. If I throw them into my system "step by step" it will not be correct since my second matrix calculates the rotation around the "new y-axis" , but my sensor actually measured an angular velocity around the "old y-axis". Is that correct so far?
So how can I correctly calculate the 3D rotation?
Do I need to go for quaternoins? but how do I get one from 3 different rotations? And don't I have the same issue here again?
I start with a unity-matrix ((1, 0, 0)(0, 1, 0)(0, 0, 1)) multiplied with the acceleration vector to give me the first movement.
Then I want use the Rotation matrix to find out, where the next acceleration is really heading so I can simply add the accelerations together.
But right now I am just too confused to find a proper way.
Any suggestions?
btw. sorry for my poor english, I am tired and (obviously) not a native speaker ;)
Thanks,
Alex
Short answer
Yes, go for quaternions and use a first order linearization of the rotation to calculate how orientation changes. This reduces to the following pseudocode:
float pose_initial[4]; // quaternion describing original orientation
float g_x, g_y, g_z; // gyro rates
float dt; // time step. The smaller the better.
// quaternion with "pose increment", calculated from the first-order
// linearization of continuous rotation formula
delta_quat = {1, 0.5*dt*g_x, 0.5*dt*g_y, 0.5*dt*g_z};
// final orientation at start time + dt
pose_final = quaternion_hamilton_product(pose_initial, delta_quat);
This solution is used in PixHawk's EKF navigation filter (it is open source, check out formulation here). It is simple, cheap, stable and accurate enough.
Unit matrix (describing a "null" rotation) is equivalent to quaternion [1 0 0 0]. You can get the quaternion describing other poses using a suitable conversion formula (for example, if you have Euler angles you can go for this one).
Notes:
Quaternions following [w, i, j, k] notation.
These equations assume angular speeds in SI units, this is, radians per second.
Long answer
A gyroscope describes the rotational speed of an object as a decomposition in three rotational speeds around the orthogonal local axes XYZ. However, you could equivalently describe the rotational speed as a single rate around a certain axis --either in reference system that is local to the rotated body or in a global one.
The three rotational speeds affect the body simultaneously, continously changing the rotation axis.
Here we have the problem of switching from the continuous-time real world to a simpler discrete-time formulation that can be easily solved using a computer. When discretizing, we are always going to introduce errors. Some approaches will lead to bigger errors, while others will be notably more accurate.
Your approach of concatenating three simultaneous rotations around orthogonal axes work reasonably well with small integration steps (let's say smaller than 1/1000 s, although it depends on the application), so that you are simulating the continuous change of rotation axis. However, this is computationally expensive, and error grows as you make time steps bigger.
As an alternative to first-order linearization, you can calculate pose increments as a small delta of angular speed gradient (also using quaternion representation):
quat_gyro = {0, g_x, g_y, g_z};
q_grad = 0.5 * quaternion_product(pose_initial, quat_gyro);
// Important to normalize result to get unit quaternion!
pose_final = quaternion_normalize(pose_initial + q_grad*dt);
This technique is used in Madgwick rotation filter (here an implementation), and works pretty fine for me.

Summed area table in GLSL and GPU fragment shader execution

I am trying to compute the integral image (aka summed area table) of a texture I have in the GPU memory (a camera capture), the goal being to compute the adaptive threshold of said image. I'm using OpenGL ES 2.0, and still learning :).
I did a test with a simple gaussian blur shader (vertical/horizontal pass), which is working fine, but I need a way bigger variable average area for it to give satisfactory results.
I did implement a version of that algorithm on CPU before, but I'm a bit confused on how to implement that on a GPU.
I tried to do a (completely incorrect) test with just something like this for every fragment :
#version 100
#extension GL_OES_EGL_image_external : require
precision highp float;
uniform sampler2D u_Texture; // The input texture.
varying lowp vec2 v_TexCoordinate; // Interpolated texture coordinate per fragment.
uniform vec2 u_PixelDelta; // Pixel delta
void main()
{
// get neighboring pixels values
float center = texture2D(u_Texture, v_TexCoordinate).r;
float a = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, 0.0)).r;
float b = texture2D(u_Texture, v_TexCoordinate + vec2(0.0, u_PixelDelta.y * 1.0)).r;
float c = texture2D(u_Texture, v_TexCoordinate + vec2(u_PixelDelta.x * -1.0, u_PixelDelta.y * 1.0)).r;
// compute value
float pixValue = center + a + b - c;
// Result stores value (R) and original gray value (G)
gl_FragColor = vec4(pixValue, center, center, 1.0);
}
And then another shader to get the area that I want and then get the average. This is obviously wrong as there's multiple execution units operating at the same time.
I know that the common way of computing a prefix sum on a GPU is to do it in two pass (vertical/horizontal, as discussed here on this thread or or here), but isn't there a problem here as there is a data dependency on each cell from the previous (top or left) one ?
I can't seem to understand the order in which the multiple execution units on a GPU will process the different fragments, and how a two-pass filter can solve that issue. As an example, if I have some values like this :
2 1 5
0 3 2
4 4 7
The two pass should give (first columns then rows):
2 1 5 2 3 8
2 4 7 -> 2 6 13
6 8 14 6 14 28
How can I be sure that, as an example, the value [0;2] will be computed as 6 (2 + 4) and not 4 (0 + 4, if the 0 hasn't been computed yet) ?
Also, as I understand that fragments are not pixels (If I'm not mistaken), would the values I store back in one of my texture in the first pass be the same in another pass if I use the exact same coordinates passed from the vertex shader, or will they be interpolated in some way ?
Tommy and Bartvbl address your questions about a summed-area table, but your core problem of an adaptive threshold may not need that.
As part of my open source GPUImage framework, I've done some experimentation with optimizing blurs over large radii using OpenGL ES. Generally, increasing blur radii leads to a significant increase in texture sampling and calculations per pixel, with an accompanying slowdown.
However, I found that for most blur operations you can apply a surprisingly effective optimization to cap the number of blur samples. If you downsample the image before blurring, blur at a smaller pixel radius (radius / downsampling factor), and then linearly upsample, you can arrive at a blurred image that is the equivalent of one blurred at a much larger pixel radius. In my tests, these downsampled, blurred, and then upsampled images look almost identical to the ones blurred based on the original image resolution. In fact, precision limits can lead to larger-radii blurs done at a native resolution breaking down in image quality past a certain size, where the downsampled ones maintain the proper image quality.
By adjusting the downsampling factor to keep the downsampled blur radius constant, you can achieve near constant-time blurring speeds in the face of increasing blur radii. For a adaptive threshold, the image quality should be good enough to use for your comparisons.
I use this approach in the Gaussian and box blurs within the latest version of the above-linked framework, so if you're running on Mac, iOS, or Linux, you can evaluate the results by trying out one of the sample applications. I have an adaptive threshold operation based on a box blur that uses this optimization, so you can see if the results there are what you want.
AS per the above, it's not going to be fantastic on a GPU. But assuming the cost of shunting data between the GPU and CPU is more troubling it may still be worth persevering.
The most obvious prima facie solution is to split horizontal/vertical as discussed. Use an additive blending mode, create a quad that draws the whole source image then e.g. for the horizontal step on a bitmap of width n issue a call that requests the quad be drawn n times, the 0th time at x = 0, the mth time at x = m. Then ping pong via an FBO, switching the target of buffer of the horizontal draw into the source texture for the vertical.
Memory accesses are probably O(n^2) (i.e. you'll probably cache quite well, but that's hardly a complete relief) so it's a fairly poor solution. You could improve it by divide and conquer by doing the same thing in bands — e.g. for the vertical step, independently sum individual rows of 8, after which the error in every row below the final is the failure to include whatever the sums are on that row. So perform a second pass to propagate those.
However an issue with accumulating in the frame buffer is clamping to avoid overflow — if you're expecting a value greater than 255 anywhere in the integral image then you're out of luck because the additive blending will clamp and GL_RG32I et al don't reach ES prior to 3.0.
The best solution I can think of to that, without using any vendor-specific extensions, is to split up the bits of your source image and combine channels after the fact. Supposing your source image were 4 bit and your image less than 256 pixels in both directions, you'd put one bit each in the R, G, B and A channels, perform the normal additive step, then run a quick recombine shader as value = A + (B*2) + (G*4) + (R*8). If your texture is larger or smaller in size or bit depth then scale up or down accordingly.
(platform specific observation: if you're on iOS then you've hopefully already got a CVOpenGLESTextureCache in the loop, which means you have CPU and GPU access to the same texture store, so you might well prefer to kick this step off to GCD. iOS is amongst the platforms supporting EXT_shader_framebuffer_fetch; if you have access to that then you can write any old blend function you like and at least ditch the combination step. Also you're guaranteed that preceding geometry has completed before you draw so if each strip writes its totals where it should and also to the line below then you can perform the ideal two-pixel-strips solution with no intermediate buffers or state changes)
What you attempt to do cannot be done in a fragment shader. GPU's are by nature very different to CPU's by executing their instructions in parallel, in massive numbers at the same time. Because of this, OpenGL does not make any guarantees about execution order, because the hardware physically doesn't allow it to.
So there is not really any defined order other than "whatever the GPU thread block scheduler decides".
Fragments are pixels, sorta-kinda. They are pixels that potentially end up on screen. If another triangle ends up in front of another, the previous calculated colour value is discarded. This happens regardless of whatever colour was stored at that pixel in the colour buffer previously.
As for creating the summed area table on the GPU, I think you may first want to look at GLSL "Compute Shaders", which are specifically made for this sort of thing.
I think you may be able to get this to work by creating a single thread for each row of pixels in the table, then have every thread "lag behind" by 1 pixel compared to the previous row.
In pseudocode:
int row_id = thread_id()
for column_index in (image.cols + image.rows):
int my_current_column_id = column_index - row_id
if my_current_column_id >= 0 and my_current_column_id < image.width:
// calculate sums
The catch of this method is that all threads should be guaranteed to execute their instructions simultaneously without getting ahead of one another. This is guaranteed in CUDA, but I'm not sure whether it is in OpenGL compute shaders. It may be a starting point for you, though.
It may look surprising for the beginner but the prefix sum or SAT calculation is suitable for parallelization. As the Hensley algorithm is the most intuitive to understand (also implemented in OpenGL), more work-efficient parallel methods are available, see CUDA scan. The paper from Sengupta discuss parallel method which seems state-of-the-art efficient method with reduce and down swap phases. These are valuable materials but they do not enter OpenGL shader implementations in detail. The closest document is the presentation you have found (it refers to Hensley publication), since it has some shader snippets. This is the job which is doable entirely in fragment shader with FBO Ping-Pong. Note that the FBO and its texture need to have internal format set to high precision - GL_RGB32F would be best but I am not sure if it is supported in OpenGL ES 2.0.

Opengl ES 2.0: Model Matrix vs Per Vertex Calculation

I may be asking a silly question but i'm a bit curious about opengl es 2.0 perfomance.
Let's say I have an drawing object that contains a Vertex Array "VA", A Buffer Array "BA", and/or a Model Matrix "MM", and I want to do at least one Translation and one Rotation per frame. So, what is the best alternative?
Do the operations (Rot and Trans) on VA and pass to BA.
Do the operations (Ror and Trans) directly on BA.
Do the operations on MM and pass it to Opengl Vertex Shader.
My conecern is about perfomance, the processing/memory ratio. I think that the 3rd option may be the best because of the GPU, but also the most expensive on terms of memory because every object would have to have a MM, right?
Another Solution that I thought was to pass the translation and rotation parameters to the shaders and assemble the MM on the Shader.
How this is best done?
It is far from a silly question but unfortunately it all depends on the case. Generally even using the vertex buffers on the GPU might not be the best idea if the vertex data is constantly changing but I guess this is not the case you are having.
So the two main differences in what you are thinking would be:
Modify each of the vertex in the CPU and then send the vertex data to the GPU.
Leaving the data on the GPU as it is and change them in the vertex shader by using a matrix.
So the first option is actually good if the vertex data are changing beyond what you can present with a matrix or any other type of analytically presented vertex transformation. For instance if you kept generating random positions on the CPU. In such cases there is actually little sense in even using a vertex buffer since you will need to keep streaming the vertex data every fame anyway.
The second one is great in cases where the base vertex data are relatively static (not changing too much on every frame). You push the vertex data to the GPU once (or once every now and then) and then use the vertex shader to transform the vertex data for you. The vertex shader on the GPU is very affective in doing so and will be much faster then applying the same algorithm on the CPU.
So about your questions:
The third option would most likely be the best if you have significant amount of vertex data but I wouldn't say it is expensive on terms of memory as a matrix consists of 16 floats which should be relatively small since 6 3d vertex positions would take more memory then that so you should not worry about that at all. If anything you should worry about how much data you stream to the GPU which again is the least with this option.
To pass a translation and rotation to the vertex shader and than compose the matrix for every vertex is probably not the best idea. What happens here is you gain a little in traffic to the GPU sending 4+3 floats instead of 16 floats but simply to begin with you send it in two chunks which can produce an overhead. Next to that you consume rather more memory then less since you need to create the matrix in the shader anyway. And if you do that you will be computing a new matrix for every vertex shader which means for each and every vertex.
Now about these matrices and the memory it is hard to say it will actually have any influence on the memory itself. The stack size is usually fixed or at least rounded so adding a matrix into the shader or not will most likely have no difference in any memory consumption at all.
When it comes to openGL and performance you primarily need to watch for:
Memory consumption. This is mostly taken with textures, a 1024x1024 RGBA will take about 4MB which equals to a million floats or about 350k vertices containing a 3D position vectors so something like a matrix really has little effect.
Data stream. This is how much data you need to pass to the GPU on every frame for processing. This should be reduced as much as possible but again sending up to a few MB should not be a problem at all.
Overall efficiency in the shader
Number of draw calls. If possible try to pack as much similar data as possible to reduce the draw calls.

Is it better to use a single texture or multiple textures for a YUV image

This question is for OpenGL ES 2.0 (on Android) but may be more general to OpenGL.
Ultimately all performance questions are implementation-dependent, but if anyone can answer this question in general or based on their experience that would be helpful. I'm writing some test code as well.
I have a YUV (12bpp) image I'm loading into a texture and color-converting in my fragment shader. Everything works fine but I'd like to see where I can improve performance (in terms of frames per second).
Currently I'm actually loading three textures for each image - one for the Y component (of type GL_LUMINANCE), one for the U component (of type GL_LUMINANCE and of course 1/4 the size of the Y component), and one for the V component (of type GL_LUMINANCE and of course 1/4 the size of the Y component).
Assuming I can get the YUV pixels in any arrangement (e.g. the U and V in separate planes or interspersed), would it be better to consolidate the three textures into only two or only one? Obviously it's the same number of bytes to push to the GPU no matter how you do it, but maybe with fewer textures there would be less overhead. At the very least, it would use fewer texture units. My ideas:
If the U and V pixels were interspersed with each other, I could load them in a single texture of type GL_LUMINANCE_ALPHA which has two components.
I could load the entire YUV image as a single texture (of type GL_LUMINANCE but 3/2 the size of the image) and then in the fragment shader I could call texture2D() three times on the same texture, doing a bit of arithmetic figure out the correct co-ordinates to pass to texture2D to get the correct texture co-ordinates for the Y, U and V components.
I would combine the data into as few textures as possible. Fewer textures is usually a better option for a few reasons.
Fewer state changes to setup the draw call.
The fewer texture fetches in a fragment shader the better.
Less upload time.
Sources:
I understand some of these are focused on more specific hardware, but the principles apply to most Mobile graphics architectures.
Best Practices for Working with Texture Data
Optimize OpenGL for Tegra
Optimizing performance of a heavy fragment shader
"Binding to a texture takes time for OpenGL ES to process. Apps that reduce the number of changes they make to OpenGL ES state perform better. "
"In my experience mobile GPU performance is roughly proportional to the number of texture2D calls." "There are two texture loads, so the minimum cycle count for the texture sub-unit is two." (Tegra has a texture unit which has to run a cycle for reach texture read)
"making calls to the glTexSubImage and glCopyTexSubImage functions particularly expensive" - upload operations must stall the pipeline until textures are uploaded. It is faster to batch these into a single upload than block a bunch of separate times.

Resources