I am trying to measure performance of glTexSubImage2D().
I need periodically update my texture 1920x1080 before rendering.
Strange is that sometimes glTexSubImage2D() takes less then 20ms but sometimes it takes up to 190ms.
Fragment of my measure log is:
22,
94,
21,
94,
22,
93,
22,
94,
36,
24,
98,
21,
94,
108,
121,
30,
Those above are milliseconds consumed by glTexSubImage2D() when full RGBA Texture is updated.
It is clear that I cannot use this for real time video rendering..
I do my experiments on embedded OpenGLES2 ROCK64 ARM board with Mali450 GPU enabled.
On Raspberry PI3 OpenGLES2 is specific but also glTexSubImage2D is not very fast. So question why it so slow? Is it possible to update texture somehow different way and faster?
That's typical that the OpenGL stalls the CPU when the frame is being rendered with the texture you're trying to udpate.
And a typical solution is to use a pair of textures. While GPU is busy with rendering from texture1, you update texture2 with data for the next frame. When the frame is done, you swap textures, so GPU is locking texture2, and you update texture1 meanwhile.
Related
like arc(200, 200, 195, 195, radians(270), radians(40));
Sitting in class right now and could need some help with that.
I'm using StretchBlt to draw a resized real-time video.
::SetStretchBltMode(hDC, HALFTONE);
::StretchBlt(hDc, 0, 0, 1225, 689, hwNd, 0, 0, 1364, 768, SRCCOPY);
However, the StretchBlt api is too slow. It's take about 100ms on my computer each time StretchBlt is executed. Is there any other API or any way to improve the speed?
Yes, using HW accelerated video processing:
Read more on IDirectXVideoProcessor::VideoProcessBlt
Unfortunately, this is a wide topic but you can read online and find samples on how to use it.
I want to resize a bunch of images down really tiny so that I can perform some image analysis on them. I want them all to contain the same number of pixels for my vector comparisons. I chose "120" because it's highly composite. I could resize every image to 12x10, but then there may be more stretching than necessary for images that do not have a 1.2 aspect ratio.
How can I choose a new width and height that most closely matches the original aspect ratio?
For reference, the divisors of 120 are: {1, 2, 3, 4, 5, 6, 8, 10, 12, 15, 20, 24, 30, 40, 60, 120}, so valid sizes would be 12x10, 10x12, 8x15, 15x8, 6x20, 20x6 and so forth.
Edit: 144 might have been a better choice, as it allows for square images and the popular 16:9 ratio.
The way I'd implement something like this is to store the aspect ratios (1:144, 2:72, 3:48, 4:36, etc.) into a sorted array. Then for each incoming image, calculate its aspect ratio, then find the nearest desired ratio using binary search.
Even better, store the log of the aspect ratios, and do the binary search using the log of the image's aspect ratio.
I need to implement off-screen rendering to texture on an ARM device with PowerVR SGX hardware.
Everything is done (pixelbuffers and OpenGL ES 2.0 API were used). The only problem unsolved is very slow glReadPixels function.
I'm not an expert in OpenGL ES, so I'm asking community: is it possible to render textures directly into user-space memory? Or may be there is some way to get hardware address of texture's memory region? Some other technique (EGL extensions)?
I don't need an universal solution, just working one for PowerVR hardware.
Update: A little more information on 'slow function glReadPixels'. Copy 512x512 RGB texture data to CPU's memory:
glReadPixels(0, 0, WIDTH, HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, &arr) takes 210 ms,
glReadPixels(0, 0, WIDTH, HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, &arr) takes 24 ms (GL_BGRA is not standard for glReadPixels, it's PoverVR extension),
memcpy(&arr, &arr2, WIDTH * HEIGHT * 4) takes 5 ms
In case of bigger textures, differences are bigger too.
Solved.
The way how to force OpenVR hardware render into user-allocated memory:
http://processors.wiki.ti.com/index.php/Render_to_Texture_with_OpenGL_ES#Pixmaps
An example, how to use it:
https://gforge.ti.com/gf/project/gleslayer/
After all of this I can get rendered image as faster as 5 ms.
When you call opengl functions, you're queuing commands in a render queue. Those commands are executed by the GPU asynchronously. When you call glReadPixels, the cpu must wait the gpu to finish its rendering. So the call might be waiting for that draw to finish. On most hardware ( at least those I work on ), the memory is shared by the cpu and the gpu, so the read pixel should not be that slow if the rendering is done.
If you can wait the result or deferred it to the next frame, you might not see that delay anymore
Frame buffer objects are what you are looking for. They are supported on OpenGL ES, and on PowerVr-SGX
EDIT:
Keep in mind that GPU/CPU hardware is incredibly optimized towards moving data in one direction from CPU side to GPU side. The backpath from GPU to CPU is often much slower (its just not a priority to spend hardware resources on). So what ever technique you use (eg FBO/getTexImage) you're going to run against this limit.
My application is dependent on reading depth information back from the framebuffer. I've implemented this with glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data)
However this runs unreasonable slow, it brings my application from a smooth 30fps to a laggy 3fps. If I try to other dimensions or data to read back it runs on an acceptable level.
To give an overview:
No glReadPixels -> 30 frames per second
glReadPixels(0, 0, 1, 1, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_RED, GL_FLOAT, &depth_data); -> 20 frames per second, acceptable
glReadPixels(0, 0, width, height, GL_DEPTH_COMPONENT, GL_FLOAT, &depth_data); -> 3 frames per second, not acceptable
Why should the last one be so slow compared to the other calls? Is there any way to remedy it?
width x height is approximately 100 x 1000, the call gets increasingly slower as I increase the dimensions.
I've also tried to use pixel buffer objects but this has no significant effect on performance, it only delays the slowness till the glMapBuffer() call.
(I've tested this on a MacBook Air nVidia 320m graphics OS X 10.6, strangely enough my old MacBook Intel GMA x3100 got ~15 fps reading the depth buffer.)
UPDATE: leaving GLUT_MULTISAMPLE out of the glutInitDisplayMode options made a world of difference bringing the application back to a smooth 20fps again. I don't know what the option does in the first place, can anyone explain?
If your main framebuffer is MSAA-enabled (GLUT_MULTISAMPLE is present), then 2 actual framebuffers are created - one with MSAA and one regular.
The first one is needed for you to fill. It contains front and back color surfaces, plus depth and stencil. The second one has to contain only color that is produced by resolving the corresponding MSAA surface.
However, when you are trying to read depth using glReadPixels the driver is forced to resolve the MSAA-enabled depth surface too, which probably causes your slowdown.
What is the storage format you chose for your depth buffer ?
If it is not GLfloat, then you're asking GL to convert every single depth in the depth buffer to float when reading it. (And it's the same for your 3rd bullet, with GL_RED. was your Color buffer a float buffer ?)
No matter it is GL_FLOAT or GL_UNSIGNED_BYTE, glReadPixels is still very slow. If you use PBO to get RGB value, it will be very fast.
When using PBO to handle RGB value, the CPU usage is 4%. But it will increase to 50% when handling depth value. I've tried GL_FLOAT, GL_UNSIGNED_BYTE, GL_UNSIGNED_INT, GL_UNSIGNED_INT_24_8. So I can conclude that PBO is useless for reading depth value