How to measure CPU/GPU data transfer overhead in Metal - xcode

We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?

Related

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

Cache blocking brings no improvement for image filter on ARM

I'm experimenting with cache blocking. To do that, I implemented 2 convolution based smoothing algorithms. The gaussian kernel I'm using looks like this:
The first algorithm is just the simple double for loop, looping from left to right, top to bottom as shown below.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
In the second algorithm I tried to play with cache blocking by spliting the loops into chunks, which became something like the following. I used a BLOCK size of 512x512.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
I'm running the code on a raspberry pi 3B+, which has a Cortex-A53 with 32KB of L1 and 256KB of L2, I believe. I ran the two algorithms with different image sizes (2048x1536, 6000x4000, 12000x8000, 16000x12000. 8bit gray scale images). But across different image sizes, I saw the run time being very similar.
The question is shouldn't the first algorithm experience access latency which the second should not, especially when using large size image (like 12000x8000). Base on the description of cache blocking in this link, when processing data at the end of image rows using the 1st algorithm, the data at the beginning of the rows should have been evicted from the L1 cache. Using 12000x8000 size image as an example, since we are using 5x5 kernel, 5 rows of data is need, which is 12000x5=60KB, already larger than the 32KB L1 size. When we start processing data for a new row, 4 rows of previous data are still needed but they are likely gone in L1 so needs to be re-fetched. But for the second algorithm it shouldn't have this problem because the block size is small. Can anyone please tell me what am I missing?
I also profiled the algorithm using oprofile with the following data:
Algorithm 1
event
count
L1D_CACHE_REFILL
13,933,254
PREFETCH_LINEFILL
13,281,559
Algorithm 2
event
count
L1D_CACHE_REFILL
9,456,369
PREFETCH_LINEFILL
8,725,250
So it looks like the 1st algorithm does have more cache miss compared to the second, reflecting by the L1D_CACHE_REFILL counts. But it also has higher data prefetching rate, which maybe due to the simple behavior of the loop. So is the whole story of cache blocking not taking into account data prefetching?
Conceptually, you're right blocking will reduce cache misses by keeping the input window in cache.
I suspect the main reason you're not seeing a speedup is because the cache is prefetching from all 5 input rows. Your performance counters show more prefetch loads in the unblocked implementation. I suspect many textbook examples are out of date since cache prefetching has kept getting better. Intel's L2 cache can detect and prefetch from up to 16 linear streams about 10 years ago, I think.
Assume the filter takes 5 * 5 cycles. So that would be 20.8 ns = 25 / 1.2GHz on RPI3. The IO cost will be reading a 5 high column of new input pixels. The amortized IO cost will be 5 bytes / 20.8ns = 229 MiB/s, which is much less than the ~2 GiB/s DRAM bandwidth. So in theory, the relatively slow computation combined with prefetching (I'm not certain how effective) means that memory access isn't a bottleneck.
Try increasing the filter height. The cache can only detect and prefetch from a certain # streams. Or try vectorizing the computation so that memory access becomes the bottleneck.

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

How to prevent SD card from creating write delays during logging?

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

Strange performance drops with glDrawArrays()/glDrawElements()

I'm currently trying to do some GPGPU image processing on a mobile device (Nokia N9 with OMAP 3630/PowerVR SGX 530) with OpenGL ES 2.0. Basically my application's pipeline uploads a color image to video memory, converts it to grayscale, computes an integral image and extracts some features with the help of several fragment shaders.
The output is correct, however the runtime of the program is somewhat confusing. When I push the same image through my pipeline 3+ times, the timings are something like this (after the 3rd time the timings stay the same):
RGB-to-gray conversion: 7.415769 ms
integral image computation: 818.450928 ms
feature extraction: 285.308838 ms
RGB-to-gray conversion: 1471.252441 ms
integral image computation: 825.012207 ms
feature extraction: 1.586914 ms
RGB-to-gray conversion: 326.080353 ms
integral image computation: 2260.498047 ms
feature extraction: 2.746582 ms
If I exclude the feature extraction, the timings for the integral image computation change to something reasonable:
RGB-to-gray conversion: 7.354737 ms
integral image computation: 814.392090 ms
RGB-to-gray conversion: 318.084717 ms
integral image computation: 812.133789 ms
RGB-to-gray conversion: 318.145752 ms
integral image computation: 812.103271 ms
If I additionally exclude the integral image computation from the pipeline, this happens (also reasonable):
RGB-to-gray conversion: 7.751465 ms
RGB-to-gray conversion: 9.216308 ms
RGB-to-gray conversion: 8.514404 ms
The timings I would expect are more like:
RGB-to-gray conversion: ~8 ms
integral image computation: ~800 ms
feature extraction: ~250 ms
Basically, the timings are differing from my expectations in two points:
The rgb2gray conversion takes 300 instead of 8 ms when I extend the pipeline
The integral image computation takes 2200 instead of 800 ms when I extend the pipeline further
I suspect a shader switch to be the cause of the performance drop for 1.). But can this really have this much of an influence? Especially when considering that the feature extraction step consists of multiple passes with different fragment shaders and FBO switches, but is still as fast as expected.
Particulary odd is the performance drop 2.) during the integral image computation, because it's a multipass operation, using only one shader and ping-pong render targets. If I measure the performance of glDraw*() for each pass, the drop happens only once among all passes and always at the same pass (nothing special happening in this pass though).
I also suspected memory constraints to be the cause, since I'm using quite a few textures/FBOs for my output data, but alltogether I'm occupying ~6 MB video memory, which really isn't that much.
I've tried glDrawElements(), glDrawArrays() and glDrawArrays() with VBOs with the same outcome every time.
All timings have been captured with:
glFinish();
timer.start();
render();
glFinish();
timer.stop();
If I leave out the calls to glFinish(), the timings are the same, though.
Does anyone have an idea, what I could be doing wrong? I'm not too savvy with OpenGL, so maybe someone can point my to a direction or something I should look out for. I know this is hard to answer without any code samples, that's why I'm asking for rather general suggestions. If you need more info on what I'm doing precisely, I'll be glad to provide some code or pseudo code. I just didn't want to bloat this question too much...
Edit
I think I found the reason what causes the performance drops: it seems to be some kind of waiting time between two shaders, where the OpenGL pipeline waits for a previous fragment shader to finish, before it hands the output to the next fragment shader. I experimented a bit with the rgb2gray conversion shader and could isolate two cases:
1.) The second rendering with the rgb2gray shader depends on the output of the first rendering with it:
|inImg| -> (rgb2gray) -> |outImg1| -> (rgb2gray) -> |outImg2|
2.) The second rendering does not depend:
|inImg| -> (rgb2gray) -> |outImg1|
|inImg| -> (rgb2gray) -> |outImg2|
It is of course obvious that variant 2.) will most likely be faster than 1.), however, I don't understand why the pipeline completes with a reasonable runtime the first time it is executed, but has those strange delays afterwards.
Also I think that the runtime measurement of the last pipeline step is always inaccurate, so I assume ~280 ms to be a more correct measurement of the feature extraction step (not ~3 ms).
I think the problem might be with the measurement method. Timing an individual GL command or even a render is very difficult because the driver will be trying to keep all stages of the GPU's pipeline busy by running different parts of multiple renders in parallel. For this reason the driver is probably ignoring glFinish and will only wait for the hardware to finish if it must (e.g. glReadPixels on a render target).
Individual renders might appear to complete very quickly if the driver is just adding them to the end of a queue but very slowly if it needs to wait for space in the queue and has to wait for an earlier render to finish.
A better method would be to run a large number of frames (e.g. 1000) and measure the total time for all of them.

Resources