Optimized GPU to CPU data transfer - performance

I'm a bit out of my depth here (best way to be me thinks), but I am poking around looking for an optimization that could reduce GPU to CPU data transfer for my application.
I have an application that performs some modifications to vertex data in the GPU. Occasionally the CPU has to read back parts of the modified vertex data and then compute some parameters which then get passed back into the GPU shader via uniforms, forming a loop.
It takes too long to transfer all the vertex data back to the CPU and then sift through it on the CPU (millions of points), and so I have a "hack" in place to reduce the workload to usable, although not optimal.
What I do:
CPU: read image
CPU: generate 1 vertex per pixel, Z based on colour information/filter etc
CPU: transfer all vertex data to GPU
GPU: transform feedback used to update GL_POINT vertex coords in realtime based on some uniform parameters set from the CPU.
When I wish to read only a rectangular "section", I use glMapBufferRange to map the entire rows that comprise the desired rect (bad diagram alert):
This is supposed to represent the image/set of vertices in the GPU. My "hack" involves having to read all the blue and red vertices. This is because I can only specify 1 continuous range of data to read back.
Does anyone know a clever way to efficiently get at the red, without the blue? (without having to issue a series of glMapBufferRange calls)
EDIT-
The use case is that I render the image into a 3D world as GLPoints, coloured and offset in the Z by an amount based on the colour info (sized etc according to distance). Then the user can modify the vertex Z data with a mouse cursor brush. The logic behind some of the brush application code needs to know the Z's of the area under the mouse (brush circle), eg. min/max/average etc so that the CPU can control the shaders modification of data by setting a series of uniforms that feed into the shader. So for example the user can say, I want all points under the cursor to set to the average value. This could all probably be done entirely in the GPU, but the idea is that once I get the CPU-GPU "loop" (optimised as far as I can reasonably do), I can then expand out the min/max/avg stuff to do interesting things on the CPU that would be cumbersome (probably) to do entirely on the GPU.
Cheers!
Laythe

To get any data from the GPU to the CPU you need to map the GPU memory in any case, which means the OpenGL application will have to use something like mmap under the hood. I have checked the implementation of that for both x86 and ARM, and it looks like it is page-aligned, so you cannot map less than 1 contiguous page of GPU memory at any given time, so even if you could request to map just the red areas, you quite likely would also get the blue ones as well (depending on your page and pixel data sizes).
Solution 1
Just use glReadPixels, as this allows you to select a window of the framebuffer. I assume a GPU vendor like Intel would optimize the driver, so it would map as few pages as possible, however this is not guaranteed, and in some cases you may need to map 2 pages just for 2 pixels.
Solution 2
Create a compute shader or use several glCopyBufferSubData calls to copy your region of interest into a contiguous buffer in GPU memory. If you know the height and width you want, you can then un-mangle and get a 2D buffer back on the CPU side.
Which of the above solutions works better depends on your hardware and driver implementation. If GPU->CPU is the bottleneck and GPU->GPU is fast, then the second solution may work well, however you would have to experiment.
Solution 3
As suggested in the comments, do everything on the GPU. This heavily depends on whether the work is parallelize-able well, but if the copying of memory is too slow for you, then you don't have much other choice.

I suppose you are asking because you can not do all work at shaders, right?
If you render to a Frame Buffer Object, then bind it as GL_READ_FRAMEBUFFER, you can read a block of it by glReadPixels.

Related

Opengl ES 2.0: Model Matrix vs Per Vertex Calculation

I may be asking a silly question but i'm a bit curious about opengl es 2.0 perfomance.
Let's say I have an drawing object that contains a Vertex Array "VA", A Buffer Array "BA", and/or a Model Matrix "MM", and I want to do at least one Translation and one Rotation per frame. So, what is the best alternative?
Do the operations (Rot and Trans) on VA and pass to BA.
Do the operations (Ror and Trans) directly on BA.
Do the operations on MM and pass it to Opengl Vertex Shader.
My conecern is about perfomance, the processing/memory ratio. I think that the 3rd option may be the best because of the GPU, but also the most expensive on terms of memory because every object would have to have a MM, right?
Another Solution that I thought was to pass the translation and rotation parameters to the shaders and assemble the MM on the Shader.
How this is best done?
It is far from a silly question but unfortunately it all depends on the case. Generally even using the vertex buffers on the GPU might not be the best idea if the vertex data is constantly changing but I guess this is not the case you are having.
So the two main differences in what you are thinking would be:
Modify each of the vertex in the CPU and then send the vertex data to the GPU.
Leaving the data on the GPU as it is and change them in the vertex shader by using a matrix.
So the first option is actually good if the vertex data are changing beyond what you can present with a matrix or any other type of analytically presented vertex transformation. For instance if you kept generating random positions on the CPU. In such cases there is actually little sense in even using a vertex buffer since you will need to keep streaming the vertex data every fame anyway.
The second one is great in cases where the base vertex data are relatively static (not changing too much on every frame). You push the vertex data to the GPU once (or once every now and then) and then use the vertex shader to transform the vertex data for you. The vertex shader on the GPU is very affective in doing so and will be much faster then applying the same algorithm on the CPU.
So about your questions:
The third option would most likely be the best if you have significant amount of vertex data but I wouldn't say it is expensive on terms of memory as a matrix consists of 16 floats which should be relatively small since 6 3d vertex positions would take more memory then that so you should not worry about that at all. If anything you should worry about how much data you stream to the GPU which again is the least with this option.
To pass a translation and rotation to the vertex shader and than compose the matrix for every vertex is probably not the best idea. What happens here is you gain a little in traffic to the GPU sending 4+3 floats instead of 16 floats but simply to begin with you send it in two chunks which can produce an overhead. Next to that you consume rather more memory then less since you need to create the matrix in the shader anyway. And if you do that you will be computing a new matrix for every vertex shader which means for each and every vertex.
Now about these matrices and the memory it is hard to say it will actually have any influence on the memory itself. The stack size is usually fixed or at least rounded so adding a matrix into the shader or not will most likely have no difference in any memory consumption at all.
When it comes to openGL and performance you primarily need to watch for:
Memory consumption. This is mostly taken with textures, a 1024x1024 RGBA will take about 4MB which equals to a million floats or about 350k vertices containing a 3D position vectors so something like a matrix really has little effect.
Data stream. This is how much data you need to pass to the GPU on every frame for processing. This should be reduced as much as possible but again sending up to a few MB should not be a problem at all.
Overall efficiency in the shader
Number of draw calls. If possible try to pack as much similar data as possible to reduce the draw calls.

Is it better to use a single texture or multiple textures for a YUV image

This question is for OpenGL ES 2.0 (on Android) but may be more general to OpenGL.
Ultimately all performance questions are implementation-dependent, but if anyone can answer this question in general or based on their experience that would be helpful. I'm writing some test code as well.
I have a YUV (12bpp) image I'm loading into a texture and color-converting in my fragment shader. Everything works fine but I'd like to see where I can improve performance (in terms of frames per second).
Currently I'm actually loading three textures for each image - one for the Y component (of type GL_LUMINANCE), one for the U component (of type GL_LUMINANCE and of course 1/4 the size of the Y component), and one for the V component (of type GL_LUMINANCE and of course 1/4 the size of the Y component).
Assuming I can get the YUV pixels in any arrangement (e.g. the U and V in separate planes or interspersed), would it be better to consolidate the three textures into only two or only one? Obviously it's the same number of bytes to push to the GPU no matter how you do it, but maybe with fewer textures there would be less overhead. At the very least, it would use fewer texture units. My ideas:
If the U and V pixels were interspersed with each other, I could load them in a single texture of type GL_LUMINANCE_ALPHA which has two components.
I could load the entire YUV image as a single texture (of type GL_LUMINANCE but 3/2 the size of the image) and then in the fragment shader I could call texture2D() three times on the same texture, doing a bit of arithmetic figure out the correct co-ordinates to pass to texture2D to get the correct texture co-ordinates for the Y, U and V components.
I would combine the data into as few textures as possible. Fewer textures is usually a better option for a few reasons.
Fewer state changes to setup the draw call.
The fewer texture fetches in a fragment shader the better.
Less upload time.
Sources:
I understand some of these are focused on more specific hardware, but the principles apply to most Mobile graphics architectures.
Best Practices for Working with Texture Data
Optimize OpenGL for Tegra
Optimizing performance of a heavy fragment shader
"Binding to a texture takes time for OpenGL ES to process. Apps that reduce the number of changes they make to OpenGL ES state perform better. "
"In my experience mobile GPU performance is roughly proportional to the number of texture2D calls." "There are two texture loads, so the minimum cycle count for the texture sub-unit is two." (Tegra has a texture unit which has to run a cycle for reach texture read)
"making calls to the glTexSubImage and glCopyTexSubImage functions particularly expensive" - upload operations must stall the pipeline until textures are uploaded. It is faster to batch these into a single upload than block a bunch of separate times.

OpenGL Optimization - Duplicate Vertex Stream or Call glDrawElements Repeatedly?

This is for an OpenGL ES 2.0 game on Android, though I suspect the right answer is generic to any opengl situation.
TL;DR - is it better to send N data to the gpu once and then make K draw calls with it; or send K*N data to the gpu once, and make 1 draw call?
More Details I'm wondering about best practices for my situation. I have a dynamic mesh whose vertices I recompute every frame - think of it as a water surface - and I need to project these vertices onto K different quads in my game. (In each case the projection is slightly different; sparing details, you could imagine them as K different mirrors surrounding the mesh.) K is in the order of 10-25; I'm still figuring it out.
I can think of two broad options:
Bind the mesh as is, and call draw K different times, either
changing a uniform for shaders or messing with the fixed function
state to render to the correct quad in place (on the screen) or to different
segments of a texture (which I can later use when rendering the quads to achieve
the same effect).
Duplicate all the vertices in the mesh K times, essentially making a
single vertex stream with K meshes in it, and add an attribute (or
few) indicating which quad each mesh clone is supposed to project
onto (and how to get there), and use vertex shaders to project. I
would make one call to draw, but send K times as much data.
The Question: of those two options, which is generally better performance wise?
(Additionally: is there a better way to do this?
I had considered a third option, where I rendered the mesh details to a texture, and created my K-clone geometry as a sort of dummy stream, which I could bind once and for all, that looked up in a vertex shader into the texture for each vertex to find out what vertex it really represented; but I've been told that texture support in vertex shaders is poor or prohibited in OpenGL ES 2.0 and would prefer to avoid that route.)
There is no perfect answer to this question, though I would suggest you think about the nature of real-time computer graphics and the OpenGL pipeline. Although "the GL" is required to produce results that are consistent with in-order execution, the reality is that GPUs are highly parallel beasts. They employ lots of tricks that work best if you actually have many unrelated tasks going on at the same time (some even split the whole pipeline up into discrete tiles). GDDR memory, for instance is really high latency, so for efficiency GPUs need to be able to schedule other jobs to keep the stream processors (shader units) busy while memory is fetched for a job that is just starting.
If you are recomputing parts of your mesh each frame, then you will almost certainly want to favor more draw calls over massive CPU->GPU data transfers every frame. Saturating the bus with unnecessary data transfers plagues even PCI Express hardware (it is far slower than the overhead that several additional draw calls would ever add), it can only get worse on embedded OpenGL ES systems. Having said that, there is no reason you could not simply do glBufferSubData (...) to stream in only the affected portions of your mesh and continue to draw the entire mesh in a single draw call.
You might get better cache coherency if you split (or partition the data within) the buffer and/or draw calls up, depending on your actual use case scenario. The only way to decisively tell which is going to work better in your case is to profile your software on your target hardware. But all of this fail to look at the bigger picture, which is: "Why am I doing this on the CPU?!"
It sounds like what you really want is simply vertex instancing. If you can re-work your algorithm to work completely in vertex shaders by passing instance IDs you should see a massive improvement over all of the solutions I have seen you propose so far (true instancing is actually somewhere between what you described in solutions 1 and 2) :)
The actual concept of instancing is very simple and will give you benefits whether your particular version of the OpenGL API supports it at the API level or not (you can always implement it manually with vertex attributes and extra vertex buffer data). The thing is, you would not have to duplicate your data at all if you implement instancing correctly. The extra data necessary to identify each individual vertex is static, and you can always change a shader uniform and make an additional draw call (this is probably what you will have to do with OpenGL ES 2.0, since it does not offer glDrawElementsInstanced) without touching any vertex data.
You certainly will not have to duplicate your vertices K*N times, your buffer space complexity would be more like O (K + K*M), where M is the number of new components you had to add to uniquely identify each vertex so that you could calculate everything on the GPU. For "instance," you might need to number each of the vertices in your quad 1-4 and process the vertex differently in your shader depending on which vertex you're processing. In this case, the M coefficient is 1 and it does not change no matter how many instances of your quad you need to dynamically calculate each frame; N would determine the number of draw calls in OpenGL ES 2.0, not the size of your data. None of this additional storage space would be necessary in OpenGL ES 2.0 if it supported gl_VertexID :(
Instancing is the best way to make effective use of the highly-parallel GPU and avoid CPU/GPU synchronization and slow bus transfers. Even though OpenGL ES 2.0 does not support instancing in the API sense, multiple draw calls using the same vertex buffer where the only thing you change between calls are a couple of shader uniforms is often preferable to computing your vertices on the CPU and uploading new vertex data every frame or having your vertex buffer's size depend directly on the number of instances you intend to draw (yuck). You'll have to try it out and see what your hardware likes.
Instancing would be what you are looking for but unfortunately it is not available with OpenGL ES 2.0. I would be in favor of sending all the vertices to the GPU and make one draw call if all your assets can fit into the GPU. I have an experience of reducing draw calls from 100+ to 1 and the performance went from 15 fps to 60 fps.

How to efficiently display a large number of moving points

I have a large array of points, which updates dynamically. For the most part, only certain (relatively small) parts of the array get updated. The goal of my program is to build and display a picture using these points.
If I build a picture directly from the points it would be 8192 x 8192 pixels in size. I believe an optimization would be to reduce the array in size. My application has two screen areas (the one is a magnification/zooming in of the other). Additionally I will need to pan this picture in either of screen areas.
My approach for optimization is as follows.
Take a source array of points and reduce it with scaling factor for the first screen area
Same for the second area, but with larger scaling factor
Render there two arrays in two FBOs
Using FBOs as a textures (to provide ability to pan a picture)
When updating a picture I re-render only changed area.
Suggest ways to speed this up as my current implementation runs extremely slow.
You will hardly be able to optimize this a lot if you don't have the hardware to run it at an adequate rate. Even if you render in different threads to FBOs and then compose the result, your bottleneck is likely to remain. 67 million data points is nothing to sneeze at, even for modern GPUs.
Try not to update unnecessarily, update only what changes, render only what's updated and visible, try to minimize the size of your components, e.g. use a shorter data type if possible.

graphics: best performance with floating point accumulation images

I need to speed up some particle system eye candy I'm working on. The eye candy involves additive blending, accumulation, and trails and glow on the particles. At the moment I'm rendering by hand into a floating point image buffer, converting to unsigned chars at the last minute then uploading to an OpenGL texture. To simulate glow I'm rendering the same texture multiple times at different resolutions and different offsets. This is proving to be too slow, so I'm looking at changing something. The problem is, my dev hardware is an Intel GMA950, but the target machine has an Nvidia GeForce 8800, so it is difficult to profile OpenGL stuff at this stage.
I did some very unscientific profiling and found that most of the slow down is coming from dealing with the float image: scaling all the pixels by a constant to fade them out, and converting the float image to unsigned chars and uploading to the graphics hardware. So, I'm looking at the following options for optimization:
Replace floats with uint32's in a fixed point 16.16 configuration
Optimize float operations using SSE2 assembly (image buffer is a 1024*768*3 array of floats)
Use OpenGL Accumulation Buffer instead of float array
Use OpenGL floating-point FBO's instead of float array
Use OpenGL pixel/vertex shaders
Have you any experience with any of these possibilities? Any thoughts, advice? Something else I haven't thought of?
The problem is simply the sheer amount of data you have to process.
Your float buffer is 9 megabytes in size, and you touch the data more than once. Most likely your rendering loop looks somewhat like this:
Clear the buffer
Render something on it (uses reads and writes)
Convert to unsigned bytes
Upload to OpenGL
That's a lot of data that you move around, and the cache can't help you much because the image is much larger than your cache. Let's assume you touch every pixel five times. If so you move 45mb of data in and out of the slow main memory. 45mb does not sound like much data, but consider that almost each memory access will be a cache miss. The CPU will spend most of the time waiting for the data to arrive.
If you want to stay on the CPU to do the rendering there's not much you can do. Some ideas:
Using SSE for non temporary loads and stores may help, but they will complicate your task quite a bit (you have to align your reads and writes).
Try break up your rendering into tiles. E.g. do everything on smaller rectangles (256*256 or so). The idea behind this is, that you actually get a benefit from the cache. After you've cleared your rectangle for example the entire bitmap will be in the cache. Rendering and converting to bytes will be a lot faster now because there is no need to get the data from the relative slow main memory anymore.
Last resort: Reduce the resolution of your particle effect. This will give you a good bang for the buck at the cost of visual quality.
The best solution is to move the rendering onto the graphic card. Render to texture functionality is standard these days. It's a bit tricky to get it working with OpenGL because you have to decide which extension to use, but once you have it working the performance is not an issue anymore.
Btw - do you really need floating point render-targets? If you get away with 3 bytes per pixel you will see a nice performance improvement.
It's best to move the rendering calculation for massive particle systems like this over to the GPU, which has hardware optimized to do exactly this job as fast as possible.
Aaron is right: represent each individual particle with a sprite. You can calculate the movement of the sprites in space (eg, accumulate their position per frame) on the CPU using SSE2, but do all the additive blending and accumulation on the GPU via OpenGL. (Drawing sprites additively is easy enough.) You can handle your trails and blur either by doing it in shaders (the "pro" way), rendering to an accumulation buffer and back, or simply generate a bunch of additional sprites on the CPU representing the trail and throw them at the rasterizer.
Try to replace the manual code with sprites: An OpenGL texture with an alpha of, say, 10%. Then draw lots of them on the screen (ten of them in the same place to get the full glow).
If you by "manual" mean that you are using the CPU to poke pixels, I think pretty much anything you can do where you draw textured polygons using OpenGL instead will represent a huge speedup.

Resources