I’m seeing some strange results when trying to time some WebGL using the disjoint timer query extension.
I have written some simple WebGL code to reproduce the issue we are seeing here : https://jsfiddle.net/d79q3mag/2/.
const texImage2DQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, texImage2DQuery);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA32F, 512, 512, 0, gl.RGBA, gl.FLOAT, buffer);
gl.endQuery(ext.TIME_ELAPSED_EXT);
tex2dQuerys.push(texImage2DQuery);
const drawQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, drawQuery);
gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
gl.endQuery(ext.TIME_ELAPSED_EXT);
drawQuerys.push(drawQuery);
tryGetNextQueryResult(tex2dQuerys, tex2dQueryResults);
tryGetNextQueryResult(drawQuerys, drawQueryResults);
The render function uses the timer extension to individually time the texImage2D call and the drawArrays call. If I graph the results of the draw call I see some pretty large spikes (Some as high as 12ms! but the majority of spikes in the 2ms to 4ms range) :
However, if I increase the framerate from 30fps to 60fps the results improve (largest spike 1.8ms most spikes between 1ms and 0.4ms) :
I have also noticed that if I don’t time the texImage2D function (https://jsfiddle.net/q01aejnv/2/) then the spikes in the times for the drawArrays call also disappear at 30FPS (spikes between 1.6ms and 0.2 ms).
I’m using a Quadro P4000 with chrome 81.
In the Nvidia control panel Low latency mode is set to Ultra and the power management mode is set to Prefer maximum performance.
These results were gathered using D3D11 as the ANGLE graphics backend in chrome.
There seems to be 2 confusing things here. First is that a higher framerate seems to improve the draw times. Second is that timing the texImage2D seems to be affecting the draw times.
When doing timing things you probably need to gl.flush after gl.endQuery
Explanation: WebGL functions add commands to a command buffer. Some other process reads that buffer but there is overhead in telling the other process "hey! I added some commands for you!". So, in general, WebGL commands don't always do they "Hey!" part they just insert the commands. WebGL will do an auto flush (a "Hey! Execute this!) at various points automatically but when doing timing you may need to add a gl.flush.
Note: Like I said above there is overhead to flushing. In a normal program it's rarely important to do a flush.
Related
I need to draw 53000000 observations from a standard normal distribution. My current code takes a long time to run in Julia (in fact, it's been running for the past twenty minutes) and I'm wondering if there's anything I can do to speed it up. Here's what I tried:
using Distributions
d = Normal()
shock = rand(d, 1, 53000000)
The code works instantaneously when I execute it in REPL (I am working in Juno/Atom), but lags at this point (drawing from the standard normal) when I step through using the debugger. So I think the debugger may be the real culprit here.
It may be that the 1/2 gig of memory used by the allocation of the variable shock is sometimes causing swapping when the debugger is loaded.
Try running this to see, in the debugger:
using Distributions, Base.Sys
println("Free memory is $(Int(Sys.free_memory()))")
d = Normal()
shock = rand(d, 1, 53000000)
println("shock uses $(sizeof(shock)) bytes.")
println("Free memory is $(Int(Sys.free_memory()))")
Are you close to out of memory in gigs?
We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?
I'm experimenting with writing a simplistic single-AU play-through based, (almost)-no-latency tracking phase vocoder prototype in C. It's a standalone program. I want to find how much processing load can a single render callback safely bear, so I prefer keeping off async DSP.
My concept is to have only one pre-determined value which is window step, or hop size or decimation factor (different names for same term used in different literature sources). This number would equal inNumberFrames, which somehow depends on the device sampling rate (and what else?). All other parameters, such as window size and FFT size would be set in relation to the window step. This seems the simplest method for keeipng everything inside one callback.
Is there a safe method to machine-independently and safely guess or query which could be the inNumberFrames before the actual rendering starts, i.e. before calling AudioOutputUnitStart()?
The phase vocoder algorithm is mostly standard and very simple, using vDSP functions for FFT, plus custom phase integration and I have no problems with it.
Additional debugging info
This code is monitoring timings within the input callback:
static Float64 prev_stime; //prev. sample time
static UInt64 prev_htime; //prev. host time
printf("inBus: %d\tframes: %d\tHtime: %lld\tStime: %7.2lf\n",
(unsigned int)inBusNumber,
(unsigned int)inNumberFrames,
inTimeStamp->mHostTime - prev_htime,
inTimeStamp->mSampleTime - prev_stime);
prev_htime = inTimeStamp->mHostTime;
prev_stime = inTimeStamp->mSampleTime;
Curious enough, the argument inTimeStamp->mSampleTime actually shows the number of rendered frames (name of the argument seems somewhat misguiding). This number is always 512, no matter if another sampling rate has been re-set through AudioMIDISetup.app at runtime, as if the value had been programmatically hard-coded. On one hand, the
inTimeStamp->mHostTime - prev_htime
interval gets dynamically changed depending on the sampling rate set in a mathematically clear way. As long as sampling rate values match multiples of 44100Hz, actual rendering is going on. On the other hand 48kHz multiples produce the rendering error -10863 ( =
kAudioUnitErr_CannotDoInCurrentContext ). I must have missed a very important point.
The number of frames is usually the sample rate times the buffer duration. There is an Audio Unit API to request a sample rate and a preferred buffer duration (such as 44100 and 5.8 mS resulting in 256 frames), but not all hardware on all OS versions honors all requested buffer durations or sample rates.
Assuming audioUnit is an input audio unit:
UInt32 inNumberFrames = 0;
UInt32 propSize = sizeof(UInt32);
AudioUnitGetProperty(audioUnit,
kAudioDevicePropertyBufferFrameSize,
kAudioUnitScope_Global,
0,
&inNumberFrames,
&propSize);
This number would equal inNumberFrames, which somehow depends on the device sampling rate (and what else?)
It depends on what you attempt to set it to. You can set it.
// attempt to set duration
NSTimeInterval _preferredDuration = ...
NSError* err;
[[AVAudioSession sharedInstance]setPreferredIOBufferDuration:_preferredDuration error:&err];
// now get the actual duration it uses
NSTimeInterval _actualBufferDuration;
_actualBufferDuration = [[AVAudioSession sharedInstance] IOBufferDuration];
It would use a value roughly around the preferred value you set. The actual value used is a time interval based on a power of 2 and the current sample rate.
If you are looking for consistency across devices, choose a value around 10ms. The worst performing reasonable modern device is iOS iPod touch 16gb without the rear facing camera. However, this device can do around 10ms callbacks with no problem. On some devices, you "can" set the duration quite low and get very fast callbacks, but often times it will crackle up because the processing is not finished in the callback before the next callback happens.
I am building an application where i want to replay the movement of several users (up to 20).
Each user has a list of X,Y positions (ranging from 20 to 400 positions). A replay ranges from 1-10 minutes.
The replay is drawn at 8 FPS, which is all i require. At each frame i delete the layer showing the user's movement, and redraw everything up to the next point in time.
This application uses a lot of memory, and if i re-run a replay, the memory consumption keeps increasing (up to 8 gb). I have tried using the Profiler in Google Chrome (version 27), and it seems there is a build-up of layers in memory, even though i constantly remove() the old layers.
The following code shows a quick mockup of what the application does.
function draw()
{
stage.removeChildren();
var userLayer = new Kinetic.Layer();
/*
iterate all data and create lines to signify the movement of a user,
and add it to userLayer
*/
stage.add(userLayer);
}
setInterval(draw, 125); // 8 FPS
My question is, stage.removeChildren() and also Kinetic.Layer().remove(), do they not remove the layer from memory? Or do i need to handle this in an entirely different manner?
Yes. "Remove" is removing from parent container. But object still exists. You have to use "destroy" instead.
I'm currently trying to do some GPGPU image processing on a mobile device (Nokia N9 with OMAP 3630/PowerVR SGX 530) with OpenGL ES 2.0. Basically my application's pipeline uploads a color image to video memory, converts it to grayscale, computes an integral image and extracts some features with the help of several fragment shaders.
The output is correct, however the runtime of the program is somewhat confusing. When I push the same image through my pipeline 3+ times, the timings are something like this (after the 3rd time the timings stay the same):
RGB-to-gray conversion: 7.415769 ms
integral image computation: 818.450928 ms
feature extraction: 285.308838 ms
RGB-to-gray conversion: 1471.252441 ms
integral image computation: 825.012207 ms
feature extraction: 1.586914 ms
RGB-to-gray conversion: 326.080353 ms
integral image computation: 2260.498047 ms
feature extraction: 2.746582 ms
If I exclude the feature extraction, the timings for the integral image computation change to something reasonable:
RGB-to-gray conversion: 7.354737 ms
integral image computation: 814.392090 ms
RGB-to-gray conversion: 318.084717 ms
integral image computation: 812.133789 ms
RGB-to-gray conversion: 318.145752 ms
integral image computation: 812.103271 ms
If I additionally exclude the integral image computation from the pipeline, this happens (also reasonable):
RGB-to-gray conversion: 7.751465 ms
RGB-to-gray conversion: 9.216308 ms
RGB-to-gray conversion: 8.514404 ms
The timings I would expect are more like:
RGB-to-gray conversion: ~8 ms
integral image computation: ~800 ms
feature extraction: ~250 ms
Basically, the timings are differing from my expectations in two points:
The rgb2gray conversion takes 300 instead of 8 ms when I extend the pipeline
The integral image computation takes 2200 instead of 800 ms when I extend the pipeline further
I suspect a shader switch to be the cause of the performance drop for 1.). But can this really have this much of an influence? Especially when considering that the feature extraction step consists of multiple passes with different fragment shaders and FBO switches, but is still as fast as expected.
Particulary odd is the performance drop 2.) during the integral image computation, because it's a multipass operation, using only one shader and ping-pong render targets. If I measure the performance of glDraw*() for each pass, the drop happens only once among all passes and always at the same pass (nothing special happening in this pass though).
I also suspected memory constraints to be the cause, since I'm using quite a few textures/FBOs for my output data, but alltogether I'm occupying ~6 MB video memory, which really isn't that much.
I've tried glDrawElements(), glDrawArrays() and glDrawArrays() with VBOs with the same outcome every time.
All timings have been captured with:
glFinish();
timer.start();
render();
glFinish();
timer.stop();
If I leave out the calls to glFinish(), the timings are the same, though.
Does anyone have an idea, what I could be doing wrong? I'm not too savvy with OpenGL, so maybe someone can point my to a direction or something I should look out for. I know this is hard to answer without any code samples, that's why I'm asking for rather general suggestions. If you need more info on what I'm doing precisely, I'll be glad to provide some code or pseudo code. I just didn't want to bloat this question too much...
Edit
I think I found the reason what causes the performance drops: it seems to be some kind of waiting time between two shaders, where the OpenGL pipeline waits for a previous fragment shader to finish, before it hands the output to the next fragment shader. I experimented a bit with the rgb2gray conversion shader and could isolate two cases:
1.) The second rendering with the rgb2gray shader depends on the output of the first rendering with it:
|inImg| -> (rgb2gray) -> |outImg1| -> (rgb2gray) -> |outImg2|
2.) The second rendering does not depend:
|inImg| -> (rgb2gray) -> |outImg1|
|inImg| -> (rgb2gray) -> |outImg2|
It is of course obvious that variant 2.) will most likely be faster than 1.), however, I don't understand why the pipeline completes with a reasonable runtime the first time it is executed, but has those strange delays afterwards.
Also I think that the runtime measurement of the last pipeline step is always inaccurate, so I assume ~280 ms to be a more correct measurement of the feature extraction step (not ~3 ms).
I think the problem might be with the measurement method. Timing an individual GL command or even a render is very difficult because the driver will be trying to keep all stages of the GPU's pipeline busy by running different parts of multiple renders in parallel. For this reason the driver is probably ignoring glFinish and will only wait for the hardware to finish if it must (e.g. glReadPixels on a render target).
Individual renders might appear to complete very quickly if the driver is just adding them to the end of a queue but very slowly if it needs to wait for space in the queue and has to wait for an earlier render to finish.
A better method would be to run a large number of frames (e.g. 1000) and measure the total time for all of them.