Strange performance drops with glDrawArrays()/glDrawElements() - opengl-es

I'm currently trying to do some GPGPU image processing on a mobile device (Nokia N9 with OMAP 3630/PowerVR SGX 530) with OpenGL ES 2.0. Basically my application's pipeline uploads a color image to video memory, converts it to grayscale, computes an integral image and extracts some features with the help of several fragment shaders.
The output is correct, however the runtime of the program is somewhat confusing. When I push the same image through my pipeline 3+ times, the timings are something like this (after the 3rd time the timings stay the same):
RGB-to-gray conversion: 7.415769 ms
integral image computation: 818.450928 ms
feature extraction: 285.308838 ms
RGB-to-gray conversion: 1471.252441 ms
integral image computation: 825.012207 ms
feature extraction: 1.586914 ms
RGB-to-gray conversion: 326.080353 ms
integral image computation: 2260.498047 ms
feature extraction: 2.746582 ms
If I exclude the feature extraction, the timings for the integral image computation change to something reasonable:
RGB-to-gray conversion: 7.354737 ms
integral image computation: 814.392090 ms
RGB-to-gray conversion: 318.084717 ms
integral image computation: 812.133789 ms
RGB-to-gray conversion: 318.145752 ms
integral image computation: 812.103271 ms
If I additionally exclude the integral image computation from the pipeline, this happens (also reasonable):
RGB-to-gray conversion: 7.751465 ms
RGB-to-gray conversion: 9.216308 ms
RGB-to-gray conversion: 8.514404 ms
The timings I would expect are more like:
RGB-to-gray conversion: ~8 ms
integral image computation: ~800 ms
feature extraction: ~250 ms
Basically, the timings are differing from my expectations in two points:
The rgb2gray conversion takes 300 instead of 8 ms when I extend the pipeline
The integral image computation takes 2200 instead of 800 ms when I extend the pipeline further
I suspect a shader switch to be the cause of the performance drop for 1.). But can this really have this much of an influence? Especially when considering that the feature extraction step consists of multiple passes with different fragment shaders and FBO switches, but is still as fast as expected.
Particulary odd is the performance drop 2.) during the integral image computation, because it's a multipass operation, using only one shader and ping-pong render targets. If I measure the performance of glDraw*() for each pass, the drop happens only once among all passes and always at the same pass (nothing special happening in this pass though).
I also suspected memory constraints to be the cause, since I'm using quite a few textures/FBOs for my output data, but alltogether I'm occupying ~6 MB video memory, which really isn't that much.
I've tried glDrawElements(), glDrawArrays() and glDrawArrays() with VBOs with the same outcome every time.
All timings have been captured with:
glFinish();
timer.start();
render();
glFinish();
timer.stop();
If I leave out the calls to glFinish(), the timings are the same, though.
Does anyone have an idea, what I could be doing wrong? I'm not too savvy with OpenGL, so maybe someone can point my to a direction or something I should look out for. I know this is hard to answer without any code samples, that's why I'm asking for rather general suggestions. If you need more info on what I'm doing precisely, I'll be glad to provide some code or pseudo code. I just didn't want to bloat this question too much...
Edit
I think I found the reason what causes the performance drops: it seems to be some kind of waiting time between two shaders, where the OpenGL pipeline waits for a previous fragment shader to finish, before it hands the output to the next fragment shader. I experimented a bit with the rgb2gray conversion shader and could isolate two cases:
1.) The second rendering with the rgb2gray shader depends on the output of the first rendering with it:
|inImg| -> (rgb2gray) -> |outImg1| -> (rgb2gray) -> |outImg2|
2.) The second rendering does not depend:
|inImg| -> (rgb2gray) -> |outImg1|
|inImg| -> (rgb2gray) -> |outImg2|
It is of course obvious that variant 2.) will most likely be faster than 1.), however, I don't understand why the pipeline completes with a reasonable runtime the first time it is executed, but has those strange delays afterwards.
Also I think that the runtime measurement of the last pipeline step is always inaccurate, so I assume ~280 ms to be a more correct measurement of the feature extraction step (not ~3 ms).

I think the problem might be with the measurement method. Timing an individual GL command or even a render is very difficult because the driver will be trying to keep all stages of the GPU's pipeline busy by running different parts of multiple renders in parallel. For this reason the driver is probably ignoring glFinish and will only wait for the hardware to finish if it must (e.g. glReadPixels on a render target).
Individual renders might appear to complete very quickly if the driver is just adding them to the end of a queue but very slowly if it needs to wait for space in the queue and has to wait for an earlier render to finish.
A better method would be to run a large number of frames (e.g. 1000) and measure the total time for all of them.

Related

In a normal image classification using cnn's? what should be the value of the units in the dense layer?

I am just creating a normal image classifier for rock-paper-scissors.I am using my local gpu itself and it isnt a high end gpu. When i began training the model it kept giving the error:
ResourceExhaustedError: OOM when allocating tensor with shape.
I googled this error and they suggested I decrease my batch size which i did. It still did not solve anything however later I changed my image size to 50*50 initially it was 200*200 and then it started training with an accuracy of 99%.
Later i wanted to see if i could do it with 150*150 sized images as i found a tutorial on the official tensorflow channel on youtube I followed their exact code and it still did not work. I reduced the batch size, still no solution. Later I changed the no. of units in the dense layer initially it was 512 and then i decreased it to 200 and it worked fine but now the accuracy is pretty trash. I was just wondering is there anyway I could tune my model according to my gpu without affecting my accuracy? So I was just wondering how does the no. of units in the dense layer matter? It would really help me alot.
i=Input(shape=X_train[0].shape)
x=Conv2D(64,(3,3),padding='same',activation='relu')(i)
x=BatchNormalization()(x)
x=Conv2D(64,(3,3),padding='same',activation='relu')(x)
x=BatchNormalization()(x)
x=MaxPool2D((2,2))(x)
x=Conv2D(128,(3,3),padding='same',activation='relu')(x)
x=BatchNormalization()(x)
x=Conv2D(128,(3,3),padding='same',activation='relu')(x)
x=BatchNormalization()(x)
x=MaxPool2D((2,2))(x)
x=Flatten()(x)
x=Dropout(0.2)(x)
x=Dense(512,activation='relu')(x)
x=Dropout(0.2)(x)
x=Dense(3,activation='softmax')(x)
model=Model(i,x)
okay now when I run this with image size of 150*150 it throws that error,
if I change the size of the image to 50*50 and reduce batch size to 8 it works and gives me an accuracy of 99. but if I use 150*150 and reduce the no. of units in the dense layer to 200(random) it works fine but accuracy is very very bad.
I am using a low end nvidia geforce mx 230 gpu.
And my vram is 4 gigs
For 200x200 images the output of the last MaxPool has a shape of (50,50,128) which is then flattened and serves as the input of the Dense layer giving you in total of 50*50*128*512=163840000 parameters. This is a lot.
To reduce the amount of parameters you can do one of the following:
- reduce the amount of filters in the last Conv2D layer
- do a MaxPool of more than 2x2
- reduce the size of the Dense layer
- reduce the size of the input images.
You have already tried the two latter options. You will only find out by trial and error which method ultimately gives you the best accuracy. You were already at 99%, which is good.
If you want a platform with more VRAM available, you can use Google Colab https://colab.research.google.com/

WebGl disjoint timer query spikes

I’m seeing some strange results when trying to time some WebGL using the disjoint timer query extension.
I have written some simple WebGL code to reproduce the issue we are seeing here : https://jsfiddle.net/d79q3mag/2/.
const texImage2DQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, texImage2DQuery);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA32F, 512, 512, 0, gl.RGBA, gl.FLOAT, buffer);
gl.endQuery(ext.TIME_ELAPSED_EXT);
tex2dQuerys.push(texImage2DQuery);
const drawQuery = gl.createQuery();
gl.beginQuery(ext.TIME_ELAPSED_EXT, drawQuery);
gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);
gl.endQuery(ext.TIME_ELAPSED_EXT);
drawQuerys.push(drawQuery);
tryGetNextQueryResult(tex2dQuerys, tex2dQueryResults);
tryGetNextQueryResult(drawQuerys, drawQueryResults);
The render function uses the timer extension to individually time the texImage2D call and the drawArrays call. If I graph the results of the draw call I see some pretty large spikes (Some as high as 12ms! but the majority of spikes in the 2ms to 4ms range) :
However, if I increase the framerate from 30fps to 60fps the results improve (largest spike 1.8ms most spikes between 1ms and 0.4ms) :
I have also noticed that if I don’t time the texImage2D function (https://jsfiddle.net/q01aejnv/2/) then the spikes in the times for the drawArrays call also disappear at 30FPS (spikes between 1.6ms and 0.2 ms).
I’m using a Quadro P4000 with chrome 81.
In the Nvidia control panel Low latency mode is set to Ultra and the power management mode is set to Prefer maximum performance.
These results were gathered using D3D11 as the ANGLE graphics backend in chrome.
There seems to be 2 confusing things here. First is that a higher framerate seems to improve the draw times. Second is that timing the texImage2D seems to be affecting the draw times.
When doing timing things you probably need to gl.flush after gl.endQuery
Explanation: WebGL functions add commands to a command buffer. Some other process reads that buffer but there is overhead in telling the other process "hey! I added some commands for you!". So, in general, WebGL commands don't always do they "Hey!" part they just insert the commands. WebGL will do an auto flush (a "Hey! Execute this!) at various points automatically but when doing timing you may need to add a gl.flush.
Note: Like I said above there is overhead to flushing. In a normal program it's rarely important to do a flush.

How to measure CPU/GPU data transfer overhead in Metal

We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?

What’s the meaning of `Duration: 30.18s, Total samples = 26.26s (87.00%)` in go pprof?

As my understanding, pprof stops and samples go program at every 10ms. So a 30s program should got 3000 samples, but what’s the meaning of the 26.26s? How can the samples count be shown as time duration?
What’s more, I even ever got such output shows that the sample time is bigger than wall time, how could it be such result?
Duration: 5.13s, Total samples = 5.57s (108.58%)
That confusing wording was reported in google/pprof issue 128
The "Total samples" part is confusing.
Milliseconds are continuous, but samples are discrete — they're individual points, so how can you sum them up into a quantity of milliseconds?
The "sum" is the sum of a discrete number (a quantity of samples), not a continuous range (a time interval).
Reporting the sums makes perfect sense, but reporting discrete numbers using continuous units is just plain confusing.
Please update the formatting of the Duration line to give a clearer indication of what a quantity of samples reported in milliseconds actually means.
Raul Silvera's answer:
Each callstack in a profile is associated to a set of values. What is reported here is the sum of these values for all the callstacks in the profile, and is useful to understand the weight of individual frames over the full profile.
We're reporting the sum using the unit described in the profile.
Would you have a concrete suggestion for this example?
Maybe just a rewording would help, like:
Duration: 1.60s, Samples account for 14.50ms (0.9%)
There are still pprof improvements discussed for pprof in golang/go issue 36821
pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs).
The issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid.
I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters
It includes examples where Total samples exceeds duration.
Dependence on the number of cores and length of test execution:
The results of goroutine.go test depend on the number of CPU cores available.
On a multi-core CPU, if you set GOMAXPROCS=1, goroutine.go will not show a huge variation, since each goroutine runs for several seconds.
However, if you set GOMAXPROCS to a larger value, say 4, you will notice a significant measurement attribution problem.
One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.
Since Go 1.17 (and improved in Go 1.18), you can add pprof labels to know more:
A cool feature of Go's CPU profiler is that you can attach arbitrary key value pairs to a goroutine. These labels will be inherited by any goroutine spawned from that goroutine and show up in the resulting profile.
Let's consider the example below that does some CPU work() on behalf of a user.
By using the pprof.Labels() and pprof.Do() API, we can associate the user with the goroutine that is executing the work() function.
Additionally the labels are automatically inherited by any goroutine spawned within the same code block, for example the backgroundWork() goroutine.
func work(ctx context.Context, user string) {
labels := pprof.Labels("user", user)
pprof.Do(ctx, labels, func(_ context.Context) {
go backgroundWork()
directWork()
})
}
How you use these labels is up to you.
You might include things such as user ids, request ids, http endpoints, subscription plan or other data that can allow you to get a better understanding of what types of requests are causing high CPU utilization, even when they are being processed by the same code paths.
That being said, using labels will increase the size of your pprof files. So you should probably start with low cardinality labels such as endpoints before moving on to high cardinality labels once you feel confident that they don't impact the performance of your application.

How to prevent SD card from creating write delays during logging?

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

Resources