I used the code from How to make an OpenGL rendering context with transparent background? to create a window with transparent background. My problem is that the frame rate is very low - I have around 20 frames/sec even when I draw one quad(made from 2 triangles). I tried to find out why and glFlush() takes around 0.047 seconds. Do you have any idea why? Same thing is rendered in a window that does not have transparent background at 6000 fps(when I remove 60 fps limitation). It also takes one core to 100%. I test it on a Q9450#2.66GHz with ATI Radeon 4800 using Win7.
I think you can't get good performances this way, In the example linked there is the following code
void draw(HDC pdcDest)
{
assert(pdcDIB);
verify(BitBlt(pdcDest, 0, 0, w, h, pdcDIB, 0, 0, SRCCOPY));
}
BitBlt is a function executed on the processor, whereas the OpenGL functions are executed by the GPU. So the rendered data from the GPU as to crawl back to the main memory, and effectively the bandwidth from the GPU to the CPU is somewhat limited (even more because data as to go back there once BitBlt'ed).
If you really want transparent window with rendered content, you might want to look at Direct2D and/or Direct3D, maybe there is some way to do that without the performance penalty of data moving.
Related
function render(time, scene) {
if (useFramebuffer) {
gl.bindFramebuffer(gl.FRAMEBUFFER, scene.fb);
}
gl.viewport(0.0, 0.0, canvas.width, canvas.height);
gl.clear(gl.COLOR_BUFFER_BIT | gl.DEPTH_BUFFER_BIT);
gl.enable(gl.DEPTH_TEST);
renderScene(scene);
gl.disable(gl.DEPTH_TEST);
if (useFramebuffer) {
gl.bindFramebuffer(gl.FRAMEBUFFER, null);
copyFBtoBackBuffer(scene.fb);
}
window.requestAnimationFrame(function(time) {
render(time, scene);
});
}
I'm not able to share the exact code I use, but a mockup will illustrate my point.
I'm rendering a fairly complex scene and am also doing some ray tracing in WebGL. I've noticed two very strange performance issues.
1) Inconsistent frame rate between runs.
Sometimes, when the page starts the first ~100 frames render in 25ms, then it suddenly drops to 45ms, without any user input or changes to the scene. I'm not updating any buffer or texture data in a frame, only shader uniforms. When this happens the GPU memory stays constant.
2) Rendering to the default framebuffer is slower than using an extra pass.
If I render to a created frambuffer and then blit to the HTML canvas (the default framebuffer), I get 10% performance increase. So in the code snippet, if useFramebuffer == true performance is gained, which seems very counter intuitive.
Edit 1:
Due to changes in requirements, the scene will always be rendered to a framebuffer and then copied to the canvas. This makes question 2) a non-issue.
Edit 2:
System specs of the PC this was tested on:
OS: Win 10
CPU: Intel i7-7700
Nvidia GTX 1080
RAM: 16 GB
Edit 3:
I profiled the scene using chrome:tracing. The first ~100-200 frames render 16.6ms.
Then it starts dropping frames.
I'll try profiling everything with timer queries, but I'm afraid that each render actually takes the same amount of time, and buffer swap will randomly take twice as long.
Another thing I noticed is that this starts happening when I use Chrome for a while. When the problems start, clearing the browser cache or killing the Chrome process don't help, only a system reboot does.
Is it possible that Chrome is throttling the GPU on a whim?
P.S.
The frame times changed because of some optimizations, but the core problem persists.
I am drawing an animation using double-buffered GDI on a window, on a system where DWM composition is enabled, and seeing clearly visible tearing onscreen. Is there a way to prevent this?
Details
The animation takes the same image, and moves it right to left over the screen; the number of pixels across is determined by the difference between the current time and the time the animation started and the time to end, to get a fraction complete which is applied to the whole window width, using timeGetTime with a 1ms resolution. The animation draws in a loop without processing application messages; it calls the (VCL library) method Repaint which internally invalidates and then calls UpdateWindow for the window in question, directly calling into the message procedure with WM_PAINT. The VCL implementation of the paint handler uses BeginBufferedPaint. Painting is itself double-buffered.
The aim of this is to have as high a frame-rate as possible to get a smooth animation across the screen. (The drawing uses double-buffering to remove flickering and to ensure a whole image or frame is onscreen at any one time. It invalidates and updates directly by calling into the message procedure, without doing other message processing. Painting is implemented using modern techniques (eg BeginBufferedPaint) for Aero composition.) Within this, painting is done in a couple of BitBlt calls (one for the left side of the animation, ie what's moving offscreen, and one for the right side of the animation, ie what's moving onscreen.)
When watching the animation, there is clearly visible tearing. This occurs on Windows Vista, 7 and 8.1 on multiple systems with different graphics cards.
My approach to handle this has been to reduce the rate at which it is drawing, or to try to wait for VSync before painting again. This might be the wrong approach, so the answer to this question might be "Do something else completely: X". If so, great :)
(What I'd really like is a way to ask the DWM to compose / use only fully-painted frames for this specific window.)
I've tried the following approaches, none of which remove all visible tearing. Therefore the question is, Is it possible to avoid tearing when using DWM composition, and if so how?
Approaches tried:
Getting the monitor refresh rate via GetDeviceCaps(Application.MainForm.Handle, VREFRESH); sleeping for 1 / refresh rate milliseconds. Slightly improved over painting as fast as possible, but may be wishful thinking. Perceptually slightly less smooth animation rate. (Tweaks: normal Sleep and a high-resolution spin-wait using timeGetTime.)
Using DwmSetPresentParameters to try to limit updating to the same rate at which the code draws. (Variations: lots of buffers (cBuffer = 8) (no visible effect); specifying a source rate of monitor refresh rate / 1 and sleeping using the above code (the same as just trying the sleeping approach); specifying a refresh per frame of 1, 10, etc (no visible effect); changing the source frame coverage (no visible effect.)
Using DwmGetCompositionTimingInfo in a variety of ways:
While cFramesPending > 0, spin;
Get cFrame (frame composed) and spin while this number doesn't change;
Get cFrameDisplayed and spin while this doesn't change;
Calculating a time to sleep to by adding qpcVBlank + qpcRefreshPeriod, and then while QueryPerformanceCounter returns a time less than this, spin
All these approaches have also been varied by painting, then spinning/sleeping before painting again; or the reverse: sleeping and then painting.
Few seem to have any visible effect and what effect there is is hard to qualify and may just be a result of a lower frame rate. None prevent tearing, ie none make the DWM compose the window with a "whole" copy of the contents of the window's DC.
Advice appreciated :)
Since you're using BitBlt, make sure your DIBs are 4-bytes / pixel. With 3 bytes / pixel, GDI is horribly slow while DWM is running, that could be the source of your tearing. Another BitBlt issue I've run into, if your DIB is somewhat larger, than the BitBlt call make take an unexpectedly long time. If you split up one call into smaller calls than only draw a portion of the data, it might help. Both of these items helped me for my case, only because BitBlt itself was running too slow, thus leading to video artifacts.
I'm working on the visualizations of an interactive installation as seen here: http://vimeo.com/78977964. But I'm running into some issues with the smoothness of the animation. While it tells me it runs on a steady 30 or 60 fps, the actual image is not smooth at all; imagine a 15fps animation with an unsteady clock. Can you guys give me some pointers on where to look in optimizing my sketch?
What I'm doing is receiving relative coordinates (0.-1. on x and y axis) through oscP5. This goes through a data handler to check if there hasn't been input in that area for x amount of time. If all is ok, a new Wave object is created, which will draw an expanding (modulated) circle on its location. As the installation had to be very flexible, all visual parameters are adjustable through a controlP5 GUI.
All of this is running on a computer with i7 3770 3.4Ghz,8 GB RAm and two Radeon HD7700's to drive 4 to 10 Panasonic EX600 XGA projectors over VGA (simply drawing a 3072x1536 window). The CPU and GPU load is reasonable ( http://imgur.com/a/usNVC ) but the performance is not what we want it to be.
We tried a number of solutions including: changing rendering mode; trying a different GPU; different drawing methods; changing process priority; exporting to application; etc. But nothing seemed to make a noticeable improvement. So now I'm guessing its either just processing/java not being able to run smoothly over multiple monitors or something is causing this in my code...
How I draw the waves within the wave class (this is called from the main draw loop for every wave object)
public void draw(){
this.diameter = map(this.frequency, lowLimitFrequency, highLimitFrequency, speedLowFreq, speedHighFreq) * (millis()-date)/5f;
strokeWeight(map(this.frequency, lowLimitFrequency, highLimitFrequency, lineThicknessLowFreq, lineThicknessHighFreq)*map(this.diameter, 0, this.maxDiameter, 1., 0.1)*50);
stroke(255,255,255, constrain((int)map(this.diameter, 0, this.maxDiameter, 255, 0),0,255));
pushMatrix();
beginShape();
translate(h*this.x*width, v*this.y*height);
//this draws a circle from line segments, and is modified by a sinewave
for (int i = 0;i<segments;i++) {
vertex(
(this.distortion*sin(map(i, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter*sin(i*TWO_PI/segments),
(this.distortion*sin(map(i, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter* cos(i*TWO_PI/segments)
);
}
vertex(
(this.distortion*sin(map(0, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter*sin(0*TWO_PI/segments),
(this.distortion*sin(map(0, 0, segments, 0, this.periods*TWO_PI))+1)* this.diameter* cos(0*TWO_PI/segments)
);
endShape();
popMatrix();
}
I hope I've provided enough information to grasp whats going wrong!
My colleagues and I have had similar issues here running a PowerWall (6x3 monitors) from one PC using an Eyefinity setup. The short version is that, as you've discovered, there are a lot of problems running Processing sketches across multiple cards.
We've tended to work around it by using a different approach - multiple copies of the application, which each span one monitor only, render a subsection and sync themselves up. This is the approach people tend to use when driving large displays from multiple machines, but it seems to sidestep these framerate problems as well.
For Processing, there're a couple of libraries that support this: Dan Shiffman's Most Pixels Ever and the Massive Pixel Environment from the Texas Advanced Computing Center. They've both got reasonable examples that should help you through the setup phase.
One proviso though, we kept encountering crashes from JOGL if we tried this with OpenGL rendering - this was about 6 months ago, so maybe that's fixed now. Your draw loop looks like it'll be OK using Java2D, so hopefully that won't be an issue for you.
I need to implement off-screen rendering to texture on an ARM device with PowerVR SGX hardware.
Everything is done (pixelbuffers and OpenGL ES 2.0 API were used). The only problem unsolved is very slow glReadPixels function.
I'm not an expert in OpenGL ES, so I'm asking community: is it possible to render textures directly into user-space memory? Or may be there is some way to get hardware address of texture's memory region? Some other technique (EGL extensions)?
I don't need an universal solution, just working one for PowerVR hardware.
Update: A little more information on 'slow function glReadPixels'. Copy 512x512 RGB texture data to CPU's memory:
glReadPixels(0, 0, WIDTH, HEIGHT, GL_RGBA, GL_UNSIGNED_BYTE, &arr) takes 210 ms,
glReadPixels(0, 0, WIDTH, HEIGHT, GL_BGRA, GL_UNSIGNED_BYTE, &arr) takes 24 ms (GL_BGRA is not standard for glReadPixels, it's PoverVR extension),
memcpy(&arr, &arr2, WIDTH * HEIGHT * 4) takes 5 ms
In case of bigger textures, differences are bigger too.
Solved.
The way how to force OpenVR hardware render into user-allocated memory:
http://processors.wiki.ti.com/index.php/Render_to_Texture_with_OpenGL_ES#Pixmaps
An example, how to use it:
https://gforge.ti.com/gf/project/gleslayer/
After all of this I can get rendered image as faster as 5 ms.
When you call opengl functions, you're queuing commands in a render queue. Those commands are executed by the GPU asynchronously. When you call glReadPixels, the cpu must wait the gpu to finish its rendering. So the call might be waiting for that draw to finish. On most hardware ( at least those I work on ), the memory is shared by the cpu and the gpu, so the read pixel should not be that slow if the rendering is done.
If you can wait the result or deferred it to the next frame, you might not see that delay anymore
Frame buffer objects are what you are looking for. They are supported on OpenGL ES, and on PowerVr-SGX
EDIT:
Keep in mind that GPU/CPU hardware is incredibly optimized towards moving data in one direction from CPU side to GPU side. The backpath from GPU to CPU is often much slower (its just not a priority to spend hardware resources on). So what ever technique you use (eg FBO/getTexImage) you're going to run against this limit.
My video card is Mobile Intel 4 Series. I'm updating a texture with changing data every frame, here's my main loop:
for(;;) {
Timer timer;
glBindTexture(GL_TEXTURE2D, tex);
glBegin(GL_QUADS); ... /* draw textured quad */ ... glEnd();
glTexSubImage2D(GL_TEXTURE2D, 0, 0, 0, 512, 512,
GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, data);
swapBuffers();
cout << timer.Elapsed();
}
Every iteration takes 120ms. However, inserting glFlush before glTexSubImage2D brings the iteration time to 2ms.
The issue is not in the pixel format. I've tried the pixel formats BGRA, RGBA and ABGR_EXT together with the pixel types UNSIGNED_BYTE, BYTE, UNSIGNED_INT_8_8_8_8 and UNSIGNED_INT_8_8_8_8_EXT. The texture's internal pixel format is RGBA.
The order of calls matters. Moving the texture upload before the quad drawing, for example, fixes the slowness.
I also tried this on an GeForce GT 420M card, and it works fast there. My real app does have performance problems on non-Intel cards that are fixed by glFlush calls, but I haven't distilled those to a test case yet.
Any ideas on how to debug this?
One issue is that glTexImage2D performs a full reinitialization of the texture object. If only the data changes, but the format remains the same, use glTexSubImage2D to speed things up (just a reminder).
The other issue is, that despite its name the immediate mode, i.e. glBegin(…) … glEnd() the drawing calls are not synchronous, i.e. the calls return long before the GPU is done drawing. Adding a glFinish() will synchronize. But as well will do calls to anything that modifies data still required by queued operations. So in your case glTexImage2D (and glTexSubImage2D) must wait for the drawing to finish.
Usually it's best to do all volatile resource uploads at either the beginning of the drawing function, or during the SwapBuffers block in a separate thread through buffer objects. Buffer objects have been introduced for that very reason, to allow for asynchronous, yet tight operation.
I assume you're actually using that texture for one or more of your quads?
Uploading textures is one of the most expensive operations possible. Since your texture data changes every frame, the upload is unavoidable, but you should try to do it when the texture isn't in use by shaders. Remember that glBegin(GL_QUADS); ... glEnd(); doesn't actually draw quads, it requests that the GPU render the quads. Until the rendering completes, the texture will be locked. Depending on the implementation, this might cause the texture upload to wait (ala glFlush), but it could also cause the upload to fail, in which case you've wasted megabytes of PCIe bandwidth and the driver has to retry.
It sounds like you already have a solution: upload all new textures at the beginning of the frame. So what's your question?
NOTE: Intel integrated graphics are horribly slow anyway.
When you make a Draw Call ( glDrawElements, other ), the driver simply add this call in a buffer, and let the GPU consume these commands when it can.
If this buffer had to be consumed entirely at glSwapBuffers, this would mean that the GPU would be idle after that, waiting for you to send new commands.
Drivers solve this by letting the GPU lag one frame behind. This is the first reason why glTexSubImage2D blocks : the driver waits for the GPU not to use it anymore (in the previous frame) to begin the transfer, so that you never get half-updated data.
The other reason is that glTexSubImage2D is synchronous. Il will also block during the whole transfer.
You can solve the first issue by keeping 2 textures : one for the current frame, one for the previous frame. Upload the texture in the former, but draw with the latter.
You can solve the second issue by using a GL_TEXTURE_BUFFER Buffer Object, which allows asynchronous transfers.
In your case, I suspect that calling glTexSubImage2D just before glSwapBuffer adds an extra synchronization in the driver, whereas drawing the quad just before glSwapBuffer simply appends the command in the buffer. 120ms is probably a driver bug, though : even an Intel GMA doesn't need 120ms to upload a 512x512 texture.