OpenGL slow on windows - windows

i am currently writing a small game engine using OpenGL.
The mesh data is uploaded to vbos using GL_STATIC_DRAW.
Since I read that glBindBuffer is rather slow, I tried to minimize its use by accumulating the informations needed for rendering and then rendering a vbo multiple times using only one glBindBuffer per vbo (sort of like batch-rendering?).
Here is the code I use for the actual rendering:
int lastID = -1;
for(list<R_job>::iterator it = jobs.begin(); it != jobs.end(); ++it) {
if(lastID == -1) {
lastID = *it->vboID;
glBindTexture(GL_TEXTURE_2D, it->texID);
glBindBuffer(GL_ARRAY_BUFFER, *it->vboID);
glVertexPointer(3, GL_FLOAT, 4*(3+2+3), 0);
glTexCoordPointer(2, GL_FLOAT, 4*(3+2+3), (void*)(4*3));
glNormalPointer(GL_FLOAT, 4*(3+2+3), (void*)(4*(3+2)));
}
if(lastID != *it->vboID) {
glBindTexture(GL_TEXTURE_2D, it->texID);
glBindBuffer(GL_ARRAY_BUFFER, *it->vboID);
glVertexPointer(3, GL_FLOAT, 4*(3+2+3), 0);
glTexCoordPointer(2, GL_FLOAT, 4*(3+2+3), (void*)(4*3));
glNormalPointer(GL_FLOAT, 4*(3+2+3), (void*)(4*(3+2)));
lastID = *it->vboID;
}
glPushMatrix();
glMultMatrixf(value_ptr(*it->mat)); //the model matrix
glDrawArrays(GL_TRIANGLES, 0, it->size); //render
glPopMatrix();
}
The list is sorted by the id of the vbos. The data is interleaved.
My question is about speed. This code can render about 800 vbos (being the same, only drawArrays is called multiple times) at 30 fps on my 2010 macbook. On my PC (Phenom II X4 955 / HD 5700) at only 400 calls the fps go lower than 30. Could someone explain this to me? I hoped for a speedup on my pc. I am also using GLFW, GLEW, Xcode and VS2012 for each machine.
EDIT:
The mesh I am rendering has about 600 verts.

I'm sure you were running in Release mode, but you might also consider running in Release mode without the debugger attached. I think you will find that doing so will solve your performance issues with list::sort. In my experience the VS debugger can make a significant performance impact when it's attached - far more so than gdb.
2000 entities is a reasonable place to start seeing some FPS drop. At that point you are making nearly 10,000 API calls per frame. To comfortably get higher than that, you will need to start doing something like instancing, so that you are drawing multiple entities with one call.
Finally, I would like to say that glBindBuffer is not an expensive operation, and not really something that you should be batching to avoid. If you are going to batch, batch to avoid changing shaders and shader state (uniforms). But don't batch just to avoid changing buffer objects.

So I guess I found the answer.
Enabling VSync in AMD Control Center increased the fps greatly (probably because of GLFWs' swapBuffers), though they were still below the values of my macbook. After some debugging I found that list::sort is somehow VERY slow on windows and was pulling my fps down...
So right now(without list::sort), the fps drop(<60) at around 2000 Entities: 600*2000 = 1 200 000
Should that be my graphics cards limit?

Related

Chrome WebGL performance wildly inconsistent?

function render(time, scene) {
if (useFramebuffer) {
gl.bindFramebuffer(gl.FRAMEBUFFER, scene.fb);
}
gl.viewport(0.0, 0.0, canvas.width, canvas.height);
gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
gl.enable(gl.DEPTH_TEST);
renderScene(scene);
gl.disable(gl.DEPTH_TEST);
if (useFramebuffer) {
gl.bindFramebuffer(gl.FRAMEBUFFER, null);
copyFBtoBackBuffer(scene.fb);
}
window.requestAnimationFrame(function(time) {
render(time, scene);
});
}
I'm not able to share the exact code I use, but a mockup will illustrate my point.
I'm rendering a fairly complex scene and am also doing some ray tracing in WebGL. I've noticed two very strange performance issues.
1) Inconsistent frame rate between runs.
Sometimes, when the page starts the first ~100 frames render in 25ms, then it suddenly drops to 45ms, without any user input or changes to the scene. I'm not updating any buffer or texture data in a frame, only shader uniforms. When this happens the GPU memory stays constant.
2) Rendering to the default framebuffer is slower than using an extra pass.
If I render to a created frambuffer and then blit to the HTML canvas (the default framebuffer), I get 10% performance increase. So in the code snippet, if useFramebuffer == true performance is gained, which seems very counter intuitive.
Edit 1:
Due to changes in requirements, the scene will always be rendered to a framebuffer and then copied to the canvas. This makes question 2) a non-issue.
Edit 2:
System specs of the PC this was tested on:
OS: Win 10
CPU: Intel i7-7700
Nvidia GTX 1080
RAM: 16 GB
Edit 3:
I profiled the scene using chrome:tracing. The first ~100-200 frames render 16.6ms.
Then it starts dropping frames.
I'll try profiling everything with timer queries, but I'm afraid that each render actually takes the same amount of time, and buffer swap will randomly take twice as long.
Another thing I noticed is that this starts happening when I use Chrome for a while. When the problems start, clearing the browser cache or killing the Chrome process don't help, only a system reboot does.
Is it possible that Chrome is throttling the GPU on a whim?
P.S.
The frame times changed because of some optimizations, but the core problem persists.

No smooth animation for processing sketch, yet normal GPU/CPU load and framerate

I'm working on the visualizations of an interactive installation as seen here: http://vimeo.com/78977964. But I'm running into some issues with the smoothness of the animation. While it tells me it runs on a steady 30 or 60 fps, the actual image is not smooth at all; imagine a 15fps animation with an unsteady clock. Can you guys give me some pointers on where to look in optimizing my sketch?
What I'm doing is receiving relative coordinates (0.-1. on x and y axis) through oscP5. This goes through a data handler to check if there hasn't been input in that area for x amount of time. If all is ok, a new Wave object is created, which will draw an expanding (modulated) circle on its location. As the installation had to be very flexible, all visual parameters are adjustable through a controlP5 GUI.
All of this is running on a computer with i7 3770 3.4Ghz,8 GB RAm and two Radeon HD7700's to drive 4 to 10 Panasonic EX600 XGA projectors over VGA (simply drawing a 3072x1536 window). The CPU and GPU load is reasonable ( http://imgur.com/a/usNVC ) but the performance is not what we want it to be.
We tried a number of solutions including: changing rendering mode; trying a different GPU; different drawing methods; changing process priority; exporting to application; etc. But nothing seemed to make a noticeable improvement. So now I'm guessing its either just processing/java not being able to run smoothly over multiple monitors or something is causing this in my code...
How I draw the waves within the wave class (this is called from the main draw loop for every wave object)
public void draw(){
this.diameter = map(this.frequency, lowLimitFrequency, highLimitFrequency, speedLowFreq, speedHighFreq) * (millis()-date)/5f;
strokeWeight(map(this.frequency, lowLimitFrequency, highLimitFrequency, lineThicknessLowFreq, lineThicknessHighFreq)*map(this.diameter, 0, this.maxDiameter, 1., 0.1)*50);
stroke(255,255,255, constrain((int)map(this.diameter, 0, this.maxDiameter, 255, 0),0,255));
pushMatrix();
beginShape();
translate(h*this.x*width, v*this.y*height);
//this draws a circle from line segments, and is modified by a sinewave
for (int i = 0;i<segments;i++) {
vertex(
(this.distortion*sin(map(i, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter*sin(i*TWO_PI/segments),
(this.distortion*sin(map(i, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter* cos(i*TWO_PI/segments)
);
}
vertex(
(this.distortion*sin(map(0, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter*sin(0*TWO_PI/segments),
(this.distortion*sin(map(0, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter* cos(0*TWO_PI/segments)
);
endShape();
popMatrix();
}
I hope I've provided enough information to grasp whats going wrong!
My colleagues and I have had similar issues here running a PowerWall (6x3 monitors) from one PC using an Eyefinity setup. The short version is that, as you've discovered, there are a lot of problems running Processing sketches across multiple cards.
We've tended to work around it by using a different approach - multiple copies of the application, which each span one monitor only, render a subsection and sync themselves up. This is the approach people tend to use when driving large displays from multiple machines, but it seems to sidestep these framerate problems as well.
For Processing, there're a couple of libraries that support this: Dan Shiffman's Most Pixels Ever and the Massive Pixel Environment from the Texas Advanced Computing Center. They've both got reasonable examples that should help you through the setup phase.
One proviso though, we kept encountering crashes from JOGL if we tried this with OpenGL rendering - this was about 6 months ago, so maybe that's fixed now. Your draw loop looks like it'll be OK using Java2D, so hopefully that won't be an issue for you.

OpenGL rendering performance - default shader or no shader?

I have some code for rendering a video, so the OpenGL side of it (once the rendered frame is available in the target texture) is very simple: Just render it to the target rectangle.
What complicates things a bit is that I am using a third-party SDK to render the UI, so I cannot know what state changes it makes, and therefore every time I am rendering a frame I have to make sure all the states I need are set correctly.
I am using a vertex and a texture coordinate buffer to draw my rectangle like this:
glActiveTexture(GL_TEXTURE0);
glEnable(GL_TEXTURE_RECTANGLE_ARB);
glBindTexture(GL_TEXTURE_RECTANGLE_ARB, texHandle);
glTexEnvi(GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE);
glPushClientAttrib( GL_CLIENT_VERTEX_ARRAY_BIT );
glEnableClientState( GL_VERTEX_ARRAY );
glEnableClientState( GL_TEXTURE_COORD_ARRAY );
glBindBuffer(GL_ARRAY_BUFFER, m_vertexBuffer);
glVertexPointer(4, GL_FLOAT, 0, 0);
glBindBuffer(GL_ARRAY_BUFFER, m_texCoordBuffer);
glTexCoordPointer(2, GL_FLOAT, 0, 0);
glDrawArrays(GL_QUADS, 0, 4);
glPopClientAttrib();
(Is there anything that I can skip - even when not knowing what is happening inside the UI library?)
Now I wonder -and this is more theoretical as I suppose there won't be much difference when just drawing one Quad- if it is theoretically faster to just render like the above, or instead write a simple default vertex and fragment shader that does maybe nothing more than returning ftransform() for the position and uses the default way for the fragment color too?
I wonder if by using a shader I can skip certain state changes, or generally speed up things. Or if by using the above code OpenGL internally just does that and the outcome will be exactly the same?
If you are worried about clobbering the UI SDK state, you should wrap the code with glPushAttrib(GL_ENABLE_BIT | GL_TEXTURE_BIT) ... glPopAttrib() as well.
You could simplify the state management code a bit by using a vertex array object.
As to using a shader, for this simple program I wouldn't bother. It would be one more bit of state you'd have to save & restore, and you're right that internally OpenGL is probably doing just that for the same outcome.
On speeding things up: performance is going to be dominated by the cost of sending tens? hundreds? of kilobytes of video frame data to the GPU, and adding or removing OpenGL calls is very unlikely to make a difference. I'd look first at possible differences in frame rate between the UI and video stream: for example, if the frame rate is faster, arrange for the video data to be copied once and re-used instead of copying it each time the UI is redrawn.
Hope this helps.

OpenGL - Fast Textured Quads?

I am trying to display as many textured quads as possible at random positions in the 3D space. In my experience so far, I cannot display even a couple of thousands of them without dropping the fps significantly under 30 (my camera movement script becomes laggy).
Right now I am following an ancient tutorial. After initializing OpenGL:
glEnable(GL_TEXTURE_2D);
glShadeModel(GL_SMOOTH);
glClearColor(0, 0, 0, 0);
glClearDepth(1.0f);
glEnable(GL_DEPTH_TEST);
glDepthFunc(GL_LEQUAL);
glHint(GL_PERSPECTIVE_CORRECTION_HINT, GL_NICEST);
I set the viewpoint and perspective:
glViewport(0,0,width,height);
glMatrixMode(GL_PROJECTION);
glLoadIdentity();
gluPerspective(60.0f,(GLfloat)width/(GLfloat)height,0.1f,100.0f);
glMatrixMode(GL_MODELVIEW);
glLoadIdentity();
Then I load some textures:
glGenTextures(TEXTURE_COUNT, &texture[0]);
for (int i...){
glBindTexture(GL_TEXTURE_2D, texture[i]);
glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_LINEAR_MIPMAP_NEAREST);
gluBuild2DMipmaps(GL_TEXTURE_2D,3,TextureImage[0]->w,TextureImage[0]->h,GL_RGB,GL_UNSIGNED_BYTE,TextureImage[0]->pixels);
}
And finally I draw my GL_QUADS using:
glBindTexture(GL_TEXTURE_2D, q);
glTranslatef(fDistanceX,fDistanceZ,-fDistanceY);
glBegin(GL_QUADS);
glNormal3f(a,b,c);
glTexCoord2f(d, e); glVertex3f(x1, y1, z1);
glTexCoord2f(f, g); glVertex3f(x2, y2, z2);
glTexCoord2f(h, k); glVertex3f(x3, y3, z3);
glTexCoord2f(m, n); glVertex3f(x4, y4, z4);
glEnd();
glTranslatef(-fDistanceX,-fDistanceZ,fDistanceY);
I find all that code very self explaining. Unfortunately that way to do things is deprecated, as far as I know. I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them. I don't even know if these objects are suited to realize what I am trying to do here (a billion quads on the screen without a lag). Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result? And if you happen to have one more minute of spare time, could you give me a short summary of how these functions are used (just as i did with the deprecated ones above)?
Perhaps anyone here could give me a definitive suggestion, of what I should use to achieve the result?
What is "the result"? You have not explained very well what exactly it is that you're trying to accomplish. All you've said is that you're trying to draw a lot of textured quads. What are you trying to do with those textured quads?
For example, you seem to be creating the same texture, with the same width and height, given the same pixel data. But you store these in different texture objects. OpenGL does not know that they contain the same data. Therefore, you spend a lot of time swapping textures needlessly when you render quads.
If you're just randomly drawing them to test performance, then the question is meaningless. Such tests are pointless, because they are entirely artificial. They test only this artificial scenario where you're changing textures every time you render a quad.
Without knowing what you are trying to ultimately render, the only thing I can do is give general performance advice. In order (ie: do the first before you do the later ones):
Stop changing textures for every quad. You can package multiple images together in the same texture, then render all of the quads that use that texture at once, with only one glBindTexture call. The texture coordinates of the quad specifies which image within the texture that it uses.
Stop using glTranslate to position each individual quad. You can use it to position groups of quads, but you should do the math yourself to compute the quad's vertex positions. Once those glTranslate calls are gone, you can put multiple quads within the space of a single glBegin/glEnd pair.
Assuming that your quads are static (fixed position in model space), consider using a buffer object to store and render with your quad data.
I read some vague things about PBO and vertexArrays on the internet, but i did not find any tutorial on how to use them.
Did you try the OpenGL Wiki, which has a pretty good list of tutorials (as well as general information on OpenGL)? In the interest of full disclosure, I did write one of them.
I heard, in modern games milliards of polygons are rendered in real time
Actually its in the millions. I presume you're German: "Milliarde" translates into "Billion" in English.
Right now I am following an ancient tutorial.
This is your main problem. Contemporary OpenGL applications don't use ancient rendering methods. You're using the immediate mode, which means that you're going through several function calls to just submit a single vertex. This is highly inefficient. Modern applications, like games, can reach that high triangle counts because they don't waste their CPU time on calling as many functions, they don't waste CPU→GPU bandwidth with the data stream.
To reach that high counts of triangles being rendered in realtime you must place all the geometry data in the "fast memory", i.e. in the RAM on the graphics card. The technique OpenGL offers for this is called "Vertex Buffer Objects". Using a VBO you can draw large batches of geometry using a single drawing call (glDrawArrays, glDrawElements and their relatives).
After getting the geometry out of the way, you must be nice to the GPU. GPUs don't like it, if you switch textures or shaders often. Switching a texture invalidates the contents of the cache(s), switching a shader means stalling the GPU pipeline, but worse it means invalidating the execution path prediction statistics (the GPU takes statistics which execution paths of a shader are the most probable to be executed and which memory access patterns it exhibits, this used to iteratively optimize the shader execution).

glReadPixels() slow on reading GL_DEPTH_COMPONENT

My application is dependent on reading depth information back from the framebuffer. I've implemented this with glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data)
However this runs unreasonable slow, it brings my application from a smooth 30fps to a laggy 3fps. If I try to other dimensions or data to read back it runs on an acceptable level.
To give an overview:
No glReadPixels -> 30 frames per second
glReadPixels(0, 0, 1, 1, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 3 frames per second, not acceptable
Why should the last one be so slow compared to the other calls? Is there any way to remedy it?
width x height is approximately 100 x 1000, the call gets increasingly slower as I increase the dimensions.
I've also tried to use pixel buffer objects but this has no significant effect on performance, it only delays the slowness till the glMapBuffer() call.
(I've tested this on a MacBook Air nVidia 320m graphics OS X 10.6, strangely enough my old MacBook Intel GMA x3100 got ~15 fps reading the depth buffer.)
UPDATE: leaving GLUT_MULTISAMPLE out of the glutInitDisplayMode options made a world of difference bringing the application back to a smooth 20fps again. I don't know what the option does in the first place, can anyone explain?
If your main framebuffer is MSAA-enabled (GLUT_MULTISAMPLE is present), then 2 actual framebuffers are created - one with MSAA and one regular.
The first one is needed for you to fill. It contains front and back color surfaces, plus depth and stencil. The second one has to contain only color that is produced by resolving the corresponding MSAA surface.
However, when you are trying to read depth using glReadPixels the driver is forced to resolve the MSAA-enabled depth surface too, which probably causes your slowdown.
What is the storage format you chose for your depth buffer ?
If it is not GLfloat, then you're asking GL to convert every single depth in the depth buffer to float when reading it. (And it's the same for your 3rd bullet, with GL_RED. was your Color buffer a float buffer ?)
No matter it is GL_FLOAT or GL_UNSIGNED_BYTE, glReadPixels is still very slow. If you use PBO to get RGB value, it will be very fast.
When using PBO to handle RGB value, the CPU usage is 4%. But it will increase to 50% when handling depth value. I've tried GL_FLOAT, GL_UNSIGNED_BYTE, GL_UNSIGNED_INT, GL_UNSIGNED_INT_24_8. So I can conclude that PBO is useless for reading depth value

Resources