Why Compute Shader is slowing down the rendering API calls? - opengl-es

I am using compute shader to process the input buffer data and store it as output texture using imagestore().
After executing the compute shader, I have 3 render calls sequentially.
Compute Shader Code:
#version 310 es
precision mediump image2D;
layout(std430) buffer; // Sets the default layout for SSBOs
layout(local_size_x = 256) in; // 256 threads per work group
layout(binding = 0) readonly buffer InputBuf
{
uint input_buff[];
} inputbuff;
layout (rgba32f, binding = 1 ) uniform writeonly image2D out_teximg;
void main()
{
int idx = int(gl_GlobalInvocationID.x);
int idy = int(gl_GlobalInvocationID.y);
unsigned int inputpix = inputbuff[1024 * idy + idx];
// some calculation on inputpix and output is rcolor, bcolor, gcolor
imageStore(out_teximg, ivec2(idx , idy), vec4(rcolor, bcolor, gcolor, 1.0));
barrier();
};
Code:
void initCompute()
{
glGenTextures(1, &computeOutTex);
glGenBuffers(1, &inSSBOId);
}
uint inputBuffData = { .... }; // input buffer data
void execute_compute()
{
// compute shader code starts...
glUseProgram(computePgmId);
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, computeOutTex);
glTexStorage2D(GL_TEXTURE_2D, 1, GL_RGBA32F, width, height);
glBindImageTexture(1, computeOutTex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F); // binding is 1
glUniform1i( glGetUniformLocation(computePgmId, "out_teximg"), 0);
uint inputBuffSize = 1024 * 512 * 3;
glBindBuffer(GL_SHADER_STORAGE_BUFFER, inSSBOId);
glBufferData(GL_SHADER_STORAGE_BUFFER, inputBuffSize, inputBuffData, GL_STATIC_DRAW);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0 , inSSBOId); // binding is 0
glDispatchCompute(width / 256, height, 1);
glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);
// glFinish();
glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);
glBindImageTexture(1, 0, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F); // binding is 1
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, 0);// binding is 0
}
int draw()
{
glBindFramebuffer(GL_FRAMEBUFFER, m_FBOId); // Offscreen Rendering
glClear(GL_COLOR_BUFFER_BIT);
glUseProgram(compute_pgm);
execute_compute();
glUseProgram(render_pgm1);
glViewport(0,0,w,h);
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, computeOutTex);
glDrawElements(); // Render the texture data
// 2nd draw call
glUseProgram(render_pgm2);
....
....
glDrawElements();
// 3rd draw call
glUseProgram(render_pgm3);
....
....
glDrawElements();
glBindFramebuffer(GL_FRAMEBUFFER, 0); // unbind FBO
}
Here, the only 2nd draw call is taking more time after using compute shader.
If glFinish() is called after glMemoryBarrier(), then only execute_compute() call is slowed down.
Why compute shader is slowing down the subsequent draw calls?
Is glFinish() really needed?

The compute shader does not slow down the subsequent draw call. However, the compute shader itself takes some time to execute. Since you are setting a memory barrier, the subsequent draws have to wait.
The OpenGL commands are cached and are not executed immediately when they are called. GPU and CPU work in parallel. The CPU sends instructions to the GPU and the GPU processes them as quickly as possible.
glFinish gets everything ready and does not return until all previously called commands have been completed. glFinish itself is not "costly". It just seems "costly" when measuring the time on the CPU since it measures the time it takes to complete the previously called OpenGL commands.
Anyway glFinish is not needed here. All you need is the memory barrier. When using the memory barrier, the following OpenGL commands, which depend on this barrier, appear to take longer to complete. However they don't need any longer, they just have to wait until the condition indicated by the barrier is met.
In your case you need to use GL_ALL_BARRIER_BITS or GL_TEXTURE_FETCH_BARRIER_BIT, which reflects incoherent memory writes (e.g.: Image store) prior to the barrier to texture fetches after the barrier.

Related

OpenGL ES: Handle large amount matrixdata improve performance

I am using instancing in my OpenGL-app and since only one drawcall are made I have to calculate a larger matrix that consists of smaller matrices and that larger matrix is sent to the shader where gl_InstanceID can distinguish between successive matrices.
Its put on the bus with the following call
GLES30.glUniformMatrix4fv(mMVPMatrixHandleBall, nBalls, false, mMVPMatrixMajor, 0);
and in the shader the multiplication si made by
gl_Position = u_MVPMatrix[gl_InstanceID] * a_Position;
simple!
On the client-side the larger matrix is created by the following code:
private void setLargeMVPmatrix() {
int cnt = 0;
for (Iterator<Ball> shapeIterator = arrayListBalls.iterator(); shapeIterator.hasNext(); ) {
Ball ball = shapeIterator.next();
mModelMatrix = ball.getmModelMatrix();
//multipl.
Matrix.multiplyMM(mMVPMatrix, 0, mViewMatrix, 0, mModelMatrix, 0);
Matrix.multiplyMM(mMVPMatrix, 0, mProjectionMatrix, 0, mMVPMatrix, 0);
//subst. in matrisdata i en större vektor dvs vi får en stor matris som innehhåller flera mindre matriser
for (int i = 0; i < CreateGLContext.MATRIX_SIZE; i++) {
mMVPMatrixMajor[i + CreateGLContext.MATRIX_SIZE * cnt] = mMVPMatrix[i];
}
cnt++;
}
}
If I have moving-objects on the screen, like bouncing balls, for instance 100 balls bouncing around it means I have to continously translate their positions each frame which in turn means I have to call this method every frame. And the consequence is that it becomes a real performance bottelneck. I know it by just commenting out the method just to see what happends - and a real performance boost but the balls doesnt not move any longer, of course.
So my question - Is there a soluition to this problem? If I use instancing, I have to send a large matrix according to above.
Edit:
I've even tried the following which I thought could at least partially solve my problem. In the drawMethod:
int cnt = 0;
for (Iterator <Ball> it = arrayListBalls.iterator(); it.hasNext();) {
Ball ball = it.next();
mModelMatrix = ball.getmModelMatrix();
//multipl.
Matrix.multiplyMM(mMVPMatrix, 0, mViewMatrix, 0, mModelMatrix, 0);
Matrix.multiplyMM(mMVPMatrix, 0, mProjectionMatrix, 0, mMVPMatrix, 0);
GLES30.glUniformMatrix4fv( (mMVPMatrixHandleBall + cnt), 1, false, mMVPMatrix, 0);
cnt++;
}
Thanks in advance!!!
If the data that change are positions and rotations then that's what you should update to the shader.
Doing most of matrix stuff at CPU is slow, unless the needed operations are tiny, like computing the new view and projection matrices, same for all objects, and they are cheap to pass as uniforms
For every frame I'd re-fill a BufferData, perhaps with the help of glMapBufferRange or glBufferSubData, with the new positions and rotations.
Then, in the shader, build the matrices needed and do matrices multiplication there.
If initial positions and rotations are needed to build new matrices, then you must also pass them in another buffer, although just update it for the first frame.
With the proper attributes order you read in the shader these positions and rotations. The gl_InstanceID is then not needed for gl_Position calculus, perhaps needed for other object property.
If you need help on how to build matrices inside the shaders, look for glRotate and glTranslate in OpenGL 2.1 docs where you can find the definitions.
Also note that passing a big matrix for all objects by an uniform may exceed the limit on the size for the whole uniform data.

Unbinding a WebGL buffer, worth it?

In various sources I've seen recommendations for 'unbinding' buffers after use, i.e. setting it to null. I'm curious if there is really a need for this. e.g.
var buffer = gl.createBuffer();
gl.bindBuffer(gl.ARRAY_BUFFER, buffer);
// ... buffer related operations ...
gl.bindBuffer(gl.ARRAY_BUFFER, null); // unbinding
On the one hand, it's likely better for debugging as you'll probably get better error messages, but is there any significant performance loss from unbinding buffers all the time? It's generally recommended to reduce WebGL calls where possible.
The reason people often unbind buffers and other objects is to minimize the side effects of functions/methods. It's a general software development principle that functions should only perform their advertised operations, and not have any unexpected side effects. Therefore, it's a common practice that if a function binds objects, it unbinds them before returning.
Let's look at a typical example (with no particular language syntax). First, we define a function that creates a texture without any defined content:
function GLuint createEmptyTexture(int texWidth, int texHeight) {
GLuint texId;
glGenTextures(1, &texId);
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, texWidth, texHeight, 0,
GL_RGBA, GL_UNSIGNED_BYTE, 0);
return texId;
}
Then, let's have another function to create a texture. But this one fills the texture with data from a buffer (which I believe is not supported in WebGL yet, but it still helps illustrates the general principle):
function GLuint createTextureFromBuffer(int texWidth, int texHeight,
GLuint bufferId) {
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, bufferId);
GLuint texId;
glGenTextures(1, &texId);
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, texWidth, texHeight, 0,
GL_RGBA, GL_UNSIGNED_BYTE, 0);
return texId;
}
Now, I can call these functions, and everything works as expected:
GLuint tex1 = createEmptyTexture(width, height);
GLuint tex2 = createTextureFromBuffer(width, height, bufferId);
But see what happens if I call them in the opposite order:
GLuint tex1 = createTextureFromBuffer(width, height, bufferId);
GLuint tex2 = createEmptyTexture(width, height);
This time, both textures will be filled with the buffer content, because the pixel unpack buffer was still bound after the first function returned, and therefore when the second function was called.
One way of avoiding this is to unbind the pixel unpack buffer at the end of the function that binds it. And to make sure that similar issues can not happen because the texture is still bound, it can unbind that one as well:
function GLuint createTextureFromBuffer(int texWidth, int texHeight,
GLuint bufferId) {
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, bufferId);
GLuint texId;
glGenTextures(1, &texId);
glBindTexture(GL_TEXTURE_2D, texId);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, texWidth, texHeight, 0,
GL_RGBA, GL_UNSIGNED_BYTE, 0);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glBindTexture(GL_TEXTURE_2D, 0);
return texId;
}
With this implementation, both call sequences of using these two functions will produce the same result.
There are other approaches to address this. For example:
Each function documents its preconditions and side effects, and the caller is responsible to make any necessary state changes to meet the preconditions of the next function after calling a function with side effects.
Each function is completely responsible for setting up all it's state. In the example above, this would mean that the createEmptyTexture() function would have to unbind the pixel unpack buffer, because it relies on none being bound.
Approach 1 does not really scale well, and will be painful to maintain in larger systems. Approach 2 is also unsatisfactory because OpenGL has a lot of state, and having to set up all relevant state in every function would be verbose and inefficient.
This is really part of a bigger question: How do you deal with the state based nature of OpenGL in a modular software architecture? Buffer bindings are just one example of state you need to deal with. This is typically not very difficult to handle in small programs that you write by yourself, but is a possible trouble spot in larger systems. It gets worse if components from different sources (e.g. different vendors) are mixed.
I don't think there's one single approach that is ideal in all possible scenarios. The important thing is that you pick one clearly defined strategy, and use it consistently. How to handle this best in various scenarios is somewhat beyond the scope of an answer here.
While unbinding buffers should be fairly cheap, I'm not a fan of unnecessary calls. So I would try to avoid those calls, unless you really feel you need them to enforce a clear and consistent policy for the software you are writing.

Image Rotation by using Opengl ES

I'm working on Opengl ES 2.0 using OMAP3530 development board on Windows CE 7.
My Task is to Load a 24-Bit Image File & rotate it about an angle in z-Axis & export the image file(Buffer).
For this task I've created a FBO for off-screen rendering & loaded this image file as a Texture by using glTexImage2D() & I've applied this Texture to a Quad & rotate that QUAD by using PVRTMat4::RotationZ() API & Read-Back by using ReadPixels() API. Since it is a single frame process i just made only 1 loop.
Here are the problems I'm facing now.
1) All API's are taking distinct processing time on every run.ie Sometimes when i run my application i get different processing time for all API's.
2) glDrawArrays() is taking too much time (~50 ms - 80 ms)
3) glReadPixels() is also taking too much time ~95 ms for Image(800x600)
4) Loading 32-Bit image is much faster than 24-Bit image so conversion is needed.
I'd like to ask you all if anybody facing/Solved similar problem kindly suggest me any
Here is the Code snippet of my Application.
[code]
[i]
void BindTexture(){
glGenTextures(1, &m_uiTexture);
glBindTexture(GL_TEXTURE_2D, m_uiTexture);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA, ImageWidth, ImageHeight, 0, GL_RGBA, GL_UNSIGNED_BYTE, pTexData);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER,GL_LINEAR );
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
}
int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, TCHAR *lpCmdLine, int nCmdShow)
{
// Fragment and vertex shaders code
char* pszFragShader = "Same as in RenderToTexture sample;
char* pszVertShader = "Same as in RenderToTexture sample;
CreateWindow(Imagewidth, ImageHeight);//For this i've referred OGLES2HelloTriangle_Windows.cpp example
LoadImageBuffers();
BindTexture();
Generate& BindFrame,Render Buffer();
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, m_auiFbo, 0);
glRenderbufferStorage(GL_RENDERBUFFER, GL_DEPTH_COMPONENT16, ImageWidth, ImageHeight);
glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_DEPTH_ATTACHMENT, GL_RENDERBUFFER, m_auiDepthBuffer);
BindTexture();
GLfloat Angle = 0.02f;
GLfloat afVertices[] = {Vertices to Draw a QUAD};
glGenBuffers(1, &ui32Vbo);
LoadVBO's();//Aps's to load VBO's refer
// Draws a triangle for 1 frames
while(g_bDemoDone==false)
{
glBindFramebuffer(GL_FRAMEBUFFER, m_auiFbo);
glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT);
PVRTMat4 mRot,mTrans, mMVP;
mTrans = PVRTMat4::Translation(0,0,0);
mRot = PVRTMat4::RotationZ(Angle);
glBindBuffer(GL_ARRAY_BUFFER, ui32Vbo);
glDisable(GL_CULL_FACE);
int i32Location = glGetUniformLocation(uiProgramObject, "myPMVMatrix");
mMVP = mTrans * mRot ;
glUniformMatrix4fv(i32Location, 1, GL_FALSE, mMVP.ptr());
// Pass the vertex data
glEnableVertexAttribArray(VERTEX_ARRAY);
glVertexAttribPointer(VERTEX_ARRAY, 3, GL_FLOAT, GL_FALSE, m_ui32VertexStride, 0);
// Pass the texture coordinates data
glEnableVertexAttribArray(TEXCOORD_ARRAY);
glVertexAttribPointer(TEXCOORD_ARRAY, 2, GL_FLOAT, GL_FALSE, m_ui32VertexStride, (void*) (3 * sizeof(GLfloat)));
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);//
glReadPixels(0,0,ImageWidth ,ImageHeight,GL_RGBA,GL_UNSIGNED_BYTE,pOutTexData) ;
glBindBuffer(GL_ARRAY_BUFFER, 0);
glBindFramebuffer(GL_FRAMEBUFFER, 0);
eglSwapBuffers(eglDisplay, eglSurface);
}
DeInitAll();[/i][/code]
The PowerVR architecture can not render a single frame and allow the ARM to read it back quickly. It is just not designed to work that way fast - it is a deferred rendering tile-based architecture. The execution times you are seeing are too be expected and using an FBO is not going to make it faster either. Also, beware that the OpenGL ES drivers on OMAP for Windows CE are really poor quality. Consider yourself lucky if they work at all.
A better design would be to display the OpenGL ES rendering directly to the DSS and avoid using glReadPixels() and the FBO completely.
I've got improved performance for rotating a Image Buffer my using multiple FBO's & PBO's.
Here is the pseudo code snippet of my application.
InitGL()
GenerateShaders();
Generate3Textures();//Generate 3 Null Textures
Generate3FBO();//Generate 3 FBO & Attach each Texture to 1 FBO.
Generate3PBO();//Generate 3 PBO & to readback from FBO.
DrawGL()
{
BindFBO1;
BindTexture1;
UploadtoTexture1;
Do Some Processing & Draw it in FBO1;
BindFBO2;
BindTexture2;
UploadtoTexture2;
Do Some Processing & Draw it in FBO2;
BindFBO3;
BindTexture3;
UploadtoTexture3;
Do Some Processing & Draw it in FBO3;
BindFBO1;
ReadPixelfromFBO1;
UnpackToPBO1;
BindFBO2;
ReadPixelfromFBO2;
UnpackToPBO2;
BindFBO3;
ReadPixelfromFBO3;
UnpackToPBO3;
}
DeinitGL();
DeallocateALL();
By this way I've achieved 50% increased performance for overall processing.

How can I improve performance of Direct3D when I'm writing to a single vertex buffer thousands of times per frame?

I am trying to write an OpenGL wrapper that will allow me to use all of my existing graphics code (written for OpenGL) and will route the OpenGL calls to Direct3D equivalents. This has worked surprisingly well so far, except performance is turning out to be quite a problem.
Now, I admit I am most likely using D3D in a way it was never designed. I am updating a single vertex buffer thousands of times per render loop. Every time I draw a "sprite" I send 4 vertices to the GPU with texture coordinates, etc and when the number of "sprites" on the screen at one time gets to around 1k to 1.5k, then the FPS of my app drops to below 10fps.
Using the VS2012 Performance Analysis (which is awesome, btw), I can see that the ID3D11DeviceContext->Draw method is taking up the bulk of the time:
Screenshot Here
Is there some setting I'm not using correctly while setting up my vertex buffer, or during the draw method? Is it really, really bad to be using the same vertex buffer for all of my sprites? If so, what other options do I have that wouldn't drastically alter the architecture of my existing graphics code base (which are built around the OpenGL paradigm...send EVERYTHING to the GPU every frame!)
The biggest FPS killer in my game is when I'm displaying a lot of text on the screen. Each character is a textured quad, and each one requires a separate update to the vertex buffer and a separate call to Draw. If D3D or hardware doesn't like many calls to Draw, then how else can you draw a lot of text to the screen at one time?
Let me know if there is any more code you'd like to see to help me diagnose this problem.
Thanks!
Here's the hardware I'm running on:
Core i7 # 3.5GHz
16 gigs of RAM
GeForce GTX 560 Ti
And here's the software I'm running:
Windows 8 Release Preview
VS 2012
DirectX 11
Here is the draw method:
void OpenGL::Draw(const std::vector<OpenGLVertex>& vertices)
{
auto matrix = *_matrices.top();
_constantBufferData.view = DirectX::XMMatrixTranspose(matrix);
_context->UpdateSubresource(_constantBuffer, 0, NULL, &_constantBufferData, 0, 0);
_context->IASetInputLayout(_inputLayout);
_context->VSSetShader(_vertexShader, nullptr, 0);
_context->VSSetConstantBuffers(0, 1, &_constantBuffer);
D3D11_PRIMITIVE_TOPOLOGY topology = D3D11_PRIMITIVE_TOPOLOGY_TRIANGLESTRIP;
ID3D11ShaderResourceView* texture = _textures[_currentTextureId];
// Set shader texture resource in the pixel shader.
_context->PSSetShader(_pixelShaderTexture, nullptr, 0);
_context->PSSetShaderResources(0, 1, &texture);
D3D11_MAPPED_SUBRESOURCE mappedResource;
D3D11_MAP mapType = D3D11_MAP::D3D11_MAP_WRITE_DISCARD;
auto hr = _context->Map(_vertexBuffer, 0, mapType, 0, &mappedResource);
if (SUCCEEDED(hr))
{
OpenGLVertex *pData = reinterpret_cast<OpenGLVertex *>(mappedResource.pData);
memcpy(&(pData[_currentVertex]), &vertices[0], sizeof(OpenGLVertex) * vertices.size());
_context->Unmap(_vertexBuffer, 0);
}
UINT stride = sizeof(OpenGLVertex);
UINT offset = 0;
_context->IASetVertexBuffers(0, 1, &_vertexBuffer, &stride, &offset);
_context->IASetPrimitiveTopology(topology);
_context->Draw(vertices.size(), _currentVertex);
_currentVertex += (int)vertices.size();
}
And here is the method that creates the vertex buffer:
void OpenGL::CreateVertexBuffer()
{
D3D11_BUFFER_DESC bd;
ZeroMemory(&bd, sizeof(bd));
bd.Usage = D3D11_USAGE_DYNAMIC;
bd.ByteWidth = _maxVertices * sizeof(OpenGLVertex);
bd.BindFlags = D3D11_BIND_VERTEX_BUFFER;
bd.CPUAccessFlags = D3D11_CPU_ACCESS_FLAG::D3D11_CPU_ACCESS_WRITE;
bd.MiscFlags = 0;
bd.StructureByteStride = 0;
D3D11_SUBRESOURCE_DATA initData;
ZeroMemory(&initData, sizeof(initData));
_device->CreateBuffer(&bd, NULL, &_vertexBuffer);
}
Here is my vertex shader code:
cbuffer ModelViewProjectionConstantBuffer : register(b0)
{
matrix model;
matrix view;
matrix projection;
};
struct VertexShaderInput
{
float3 pos : POSITION;
float4 color : COLOR0;
float2 tex : TEXCOORD0;
};
struct VertexShaderOutput
{
float4 pos : SV_POSITION;
float4 color : COLOR0;
float2 tex : TEXCOORD0;
};
VertexShaderOutput main(VertexShaderInput input)
{
VertexShaderOutput output;
float4 pos = float4(input.pos, 1.0f);
// Transform the vertex position into projected space.
pos = mul(pos, model);
pos = mul(pos, view);
pos = mul(pos, projection);
output.pos = pos;
// Pass through the color without modification.
output.color = input.color;
output.tex = input.tex;
return output;
}
What you need to do is batch vertexes as aggressively as possible, then draw in large chunks. I've had very good luck retrofitting this into old immediate-mode OpenGL games. Unfortunately, it's kind of a pain to do.
The simplest conceptual solution is to use some sort of device state (which you're probably tracking already) to create a unique stamp for a particular set of vertexes. Something like blend modes and bound textures is a good set. If you can find a fast hashing algorithm to run on the struct that's in, you can store it pretty efficiently.
Next, you need to do the vertex caching. There are two ways to handle that, both with advantages. The most aggressive, most complicated, and in the case of many sets of vertexes with similar properties, most efficient is to make a struct of device states, allocate a large (say 4KB) buffer, and proceed to store vertexes with matching states in that array. You can then dump the entire array into a vertex buffer at the end of the frame, and draw chunks of the buffer (to recreate original order). Keeping track of all the buffer and state and order is difficult, however.
The simpler method, which can provide a good bit of caching under good circumstances, is to cache vertexes in a large buffer until device state changes. At that point, prior to actually changing state, dump the array into a vertex buffer and draw. Then reset the array index, commit state changes, and go again.
If your application has large numbers of similar vertexes, which is very possible working with sprites (texture coordinates and colors may change, but good sprites will use a single texture atlas and few blending modes), even the second method can give some performance boosts.
The trick here is to build up a cache in system memory, preferably a large chunk of pre-allocated memory, then dump it to video memory just prior to drawing. This allows you to perform far fewer writes to video memory and draw calls, which tend to be expensive (especially together). As you've seen, the number of calls you make gets to be slow, and batching stands a good chance of helping with that. The trick is to not allocate memory each frame if you can help it, batch large enough chunks to be worthwhile, and maintain correct device state and order for each draw.

Optimizing an OpenGL ES operation?

I'm using the following code to draw characters on screen (each UTF8 character is a texture):
int row = 0;
glTexCoordPointer(2, GL_FLOAT, 0, texCoords);
glVertexPointer(2, GL_FLOAT, 0, vertices);
for (StdStr* line in _lines) {
const char* str = [line cString];
for (int i = 0; i < strlen(str); i++) {
((XX::OGL::GLESContext*)context)->viewport(C_WIDTH*i,
C_HEIGHT*row,
C_WIDTH,
C_HEIGHT);
glColor4f(1.0, 1.0, 1.0, 1.0);
glBindTexture(GL_TEXTURE_2D, _textures[0] + *(str + i));
glDrawArrays(GL_TRIANGLE_STRIP, 0, 4);
}
row++;
}
When there are a lot of characters, the code takes longer to run. In this case, almost 99% of the time is spent in the glDrawArrays routine. Is it possible to minimise the amount of calls to glDrawArrays? The OpenGL ES version is 1.1.
Actually I think that you should try to limit the amount of calls to viewport, glBindTexture and glDrawArrays.
Technically, you should pack all your characters in a single texture, so that you can bind it once.
Then, you could compute the vertices and texcoords in a loop like you do actually, but doing the viewport maths yourself, and accumulating results in a CPU array. Once your array constituted, you should submit a draw call once, providing this array.
You can probably find inpiration here:
http://www.angelcode.com/products/bmfont/
http://sourceforge.net/projects/oglbmfont/

Resources