Performance Monitoring on Xilink SDK - performance

I am using Zynq Ultrascale + ZU106 FPGA board and I'm trying to monitor cache misses when running my own code on it.
I'm running a very simple Z = Ax + B where A, B, and Z are 2D arrays.
I'm running this code because I know this will cause cache misses.
Following this link:
https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18842421/Zynq+UltraScale+MPSoC+-+System+Performance+Modelling
I should be able to monitor the cache misses when I run my code on the board. When I run my code it doesn't catch it. Am I misunderstanding what this does? Or is there a better way to monitor cache misses?
I asked this question on Xilinx's forums but haven't heard back so I figured I ask here.

Related

sys_icache_invalidate() slow on M1

Following up on an earlier question of mine. I'm writing, testing and benchmarking code on a MacBook Air with the M1 CPU running macOS 13.2.
I implemented the code generation approach I suggested in my question and got all tests working, compared to a "conventional" (no code generation) approach to the same problem. As usual, I had to enable writes to executable pages using pthread_jit_write_protect_np(0) prior to generating the code, followed by write-protecting the pages again using pthread_jit_write_protect_np(1), and then call sys_icache_invalidate() prior to running the generated code, due to cache coherency issues between the L1 I- and D-caches.
If I run the full code with the call to sys_icache_invalidate() commented out, it takes a few hundreds of nanoseconds, which is quite competitive with the conventional approach. This is after a few straightforward optimizations, and after working on it more, I am certain I'd be able to beat the conventional approach.
However, the code of course doesn't work with sys_icache_invalidate() commented out. Once I add it back and benchmark, it's adding almost 3 µs to the execution time. This makes the codegen approach hopelessly slower than the conventional approach.
Looking at Apple's code for sys_cache_invalidate(), it seems simple enough: for each cache line with the starting address in a register xN, it runs ic ivau, xN. Afterwards, it runs dsb ish and isb. It occurred to me that I could run ic ivau, xN after each cache line is generated in my codegen function, and then dsb ish and isb at the end. My thought is that perhaps each ic ivau, xN instruction could run in parallel with the rest of the codegen.
Unfortunately, the code still failed, and moreover, it only shaved a couple hundred ns from the execution time. I then decided to add a call to pthread_jit_write_protect_np(1) before each ic ivau, xN followed by a call to pthread_jit_write_protect_np(0), which finally fixed the code. At this point, it added a further 5 µs to the execution time, which renders the approach completely unfeasible. Scratch that, I made a mistake and even with the calls to pthread_jit_write_protect_np(), I simply can't get it work unless I call Apple's sys_icache_invalidate().
While I've made peace with the fact that I will need to abandon the codegen approach, I just wanted to be sure:
Does ic ivau, xN "block" the M1, i.e. prevents other instructions from executing in parallel, or perhaps flushes the pipeline?
Does ic ivau, xN really not work if the page is writeable? Or perhaps pthread_jit_write_protect_np() is doing some other black magic under the hood, unrelated to its main task of write-protecting the page, that I could also do without actually write-protecting the page? For reference, here is the source to Apple's pthread library, but it essentially calls os_thread_self_restrict_rwx_to_rx() or os_thread_self_restrict_rwx_to_rw(), which I assume are Apple-internal functions whose source I was unable to locate.
Is there some other approach to cache line invalidation to reduce this overhead?
Sorry, this sounds like really badly formulated questions to me, I'm not sure that I got them right, but will try to answer.
Does ic ivau, xN "block" the M1, i.e. prevents other instructions from
executing in parallel, or perhaps flushes the pipeline?
'parallel' does not make sense to me in this context. Is it blocking other CPU ? or other instruction on same CPU ?
ARMv8 supports processors: 'in-order' (Cortex-A53) and 'out-of-order' (Cortex-A57).
In both cases 'in-order' and 'out-of-order' CPU-s have internal multi-stage pipelines. Pipeline, in this context, means that a few instructions are executed 'in parallel', eg at same time. More precisely execution of inst2 could be started before inst1 is completed (though I'm not sure that question means 'parallel' of this kind).
Also there is difference in issuing command for execution, and completing command. For example issuing ic ivau start invalidation of cache, but it does not block pipeline until completed. Sync for completion is organized by isb barrier instruction.
"Ordering and completion of data and instruction cache instructions" section in ARMv8 reference manual describes all cache related ordering in details.
So considering all above, answers for original questions:
prevents other instructions from executing in parallel
No
or perhaps flushes the pipeline
No
^^ Disclaimer, all above is true for generic ARMv8 CPU (also sometimes referred as 'arm64'). M1 might have it's own HW - bugs/implementation specific that would affect execution.
Does ic ivau, xN really not work if the page is writeable?
No, ic instruction itself is not affected by memory page attributes.
Is there some other approach to cache line invalidation to reduce this overhead?
if memory block to invalidate is big enough, might be faster to blow whole cache at once instead of looping over memory region.
side note:
However, the code of course doesn't work with sys_icache_invalidate()
commented out. Once I add it back and benchmark, it's adding almost 3
µs to the execution time
And why that's surprise you ? cache is about repeated access. This 3microsec comes from necessity to access higher memory level which is slower. However after first execution when instructions/data are fetched back to cache, performance would be back to normal.
Also you mentioning invalidating instruction cache, w/o mentioning flushing data cache anywhere.
Any self-modifying/execution code loading sequence is
write code to memory
flush data cache
invalidate instruction cache
jump to execute new code

Stall all memory accesses of an application

I want to analyze the effect of using slower memories on applications and need a mean to add delay for all memory accesses. Until now I investigated Intel PIN and other software but they seem to be overkill for what I need. Is there any tool to do so?
Is adding NOP operations in the binary code of the application right before each LOAD/STORE a feasible way?
Your best bet is to run your application under an x86 simulator such as MARSSx86 or Sniper. Using these simulators you can smoothly vary the modeled memory latency or any other parameters of the system1 and see how your application performance varies. This is a common approach in academia (often a generic machine will be modeled, rather than x86, which gives you access to more simulator implementations).
The primary disadvantage of using a simulator is that even good simulators are not completely accurate, and how accurate they are depends on the code in question. Certain types of variance from actual performance aren't particularly problematic when answering the question "how does performance vary with latency", but a simulator that doesn't model the memory access path well might produce a answer that is far from reality.
If you really can't use simulation, you could use any binary re-writing tool like PIN to instrument the memory access locations. nop would be a bad choice because it executes very quickly because you cannot add a dependency between the memory load result and the nop instruction. This latter issue means it only adds additional "work" at the location of each load, but that the work is independent of the load itself, so doesn't simulate increased memory latency.
A better approach would be to follow each load with a long latency operation that uses the result of the load as input and output (but doesn't modify it). Maybe something like imul reg, reg, 1 if reg received the result of the load (but this only adds 3 cycles, so you might hunt for longer latency instructions if you want to add a lot of latency).
1 At least within the set of things modeled by the simulator.

When to use cudaHostRegister() and cudaHostAlloc()? What is the meaning of "Pinned or page-locked" memory? Which are the equivalent in OpenCL?

I am just new with this APIs of the Nvidia and some expressions are not so clear for me. I was wondering if somebody can help me to understand when and how to use these CUDA commands in a simply way. To be more precise:
Studing how is possible to speed up some applications with parallel execution of a kernel (with CUDA for example), at some point I was facing the problem of speeding up the interaction Host-Device.
I have some informations, taken surfing on the web, but I am little bit confused.
It clear that you can go faster when it is possible to use cudaHostRegister() and/or cudaHostAlloc(). Here it is explained that
"you can use the cudaHostRegister() command to take some data (already allocated) and pin it avoiding extra copy to take into the GPU".
What is the meaning of "pin the memory"? Why is it so fast? How can I do this previously in this field? After, in the same video in the link, they continue explaining that
"if you are transferring PINNED memory, you can use the asynchronous memory transfer, cudaMemcpyAsync(), which let's the CPU keep working during the memory transfer".
Are the PCIe transaction managed entirely from the CPU? Is there a manager of a bus that takes care of this?
Also partial answers are really appreciated to re-compose the puzzle at the end.
It is also appreciate to have some link about the equivalent APIs in OpenCL.
What is the meaning of "pin the memory"?
It means make the memory page locked. That is telling the operating system virtual memory manager that the memory pages must stay in physical ram so that they can be directly accessed by the GPU across the PCI-express bus.
Why is it so fast? 
In one word, DMA. When the memory is page locked, the GPU DMA engine can directly run the transfer without requiring the host CPU, which reduces overall latency and decreases net transfer times.
Are the PCIe transaction managed entirely from the CPU?
No. See above.
Is there a manager of a bus that takes care of this?
No. The GPU manages the transfers. In this context there is no such thing as a bus master
EDIT: Seems like CUDA treats pinned and page-locked as the same as per the Pinned Host Memory section in this blog written by Mark Harris. This means by answer is moot and the best answer should be taken as is.
I bumped into this question while looking for something else. For all future users, I think #talonmies answers the question perfectly, but I'd like to bring to notice a slight difference between locking and pinning pages - the former ensures that the memory is not pageable but the kernel is free to move it around and the latter ensures that it stays in memory (i.e. non-pageable) but also is mapped to the same address.
Here's a reference to the same.

What is the best way to detect CPU cache misses when running an algorithm?

We have an algorithm which is performing poorly and we believe it's because of CPU cache misses. Nevertheless, we can't prove it because we don't have any way of detecting them. Is there any way to tell how many CPU cache misses an algorithm produces? We can port it to any language which could allow us to detect them.
Thanks in advance.
Easiest way to find out such kind of issues is to use profilers and collect cache related performance counters.
I would recommend to check following tools:
Intel® VTune™ Amplifier XE (supports: linux and windows; C/C++, Java, .NET) - http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/
OProfile - http://oprofile.sourceforge.net/
Is it possible to see the overall structure of your algorithm (if it is not too long)?
Intel CPUs keep performance counters that you can extract with some assembler instructions.
Could you (1) baseline cache misses on a quiescent system, (2) run the program and compare?
See Volume 3B of the Intel Instruction Set Reference Section 18 Page 15 (18-15) for the assembler you would have to write up.

Profiling for analyzing the low level memory accesses of my program

I have to analyze the memory accesses of several programs. What I am looking for is a profiler that allow me to see which one of my programs is more memory intensive insted of computing intensive. I am very interested in the number of accesses to the L1 data cache, L2, and the main memory.
It needs to be for Linux and if it is possible only with command usage. The programming language is c++. If there is any problem with my question, such as I do not understand what you mean or we need more data please comment below.
Thank you.
Update with the solution
I have selected the answer of Crashworks as favourited because is the only one that provided something of what I was looking for. But the question is still open, if you know a better solution please answer.
It is not possible to determine all accesses to memory, since it doesn't make much sense. An access to memory could be executing next instruction (program resides in memory), or when your program reads or write a variable, so your program is almost accessing memory all the time.
What could be more interesting for you could be follow the memory usage of your program (both heap and stack). In this case you can use standard top command.
You could also monitor system calls (i.e. to write to disk or to attach/alloc a shared memory segment). In this case you should use strace command.
A more complete control to do everything would be debugging your program by means of gdb debugger. It lets you control your program such as setting breakpoints to a variable so the program is interrputed whenever it is read or written (maybe this is what you were looking for). On the other hand GDB can be tricky to learn so DDD, which is a gtk graphical frontend will help you starting out with it.
Update: What you are looking for is really low level memory access that it is not available at user level (that is the task of the kernel of the operating system). I am not sure if even L1 cache management is handled transparently by CPU and hidden to kernel.
What is clear is that you need to go as down as kernel level, so KDB, explained here o KDBG, explained here.
Update 2: It seems that Linux kernel does handle CPU cache but only L1 cache. The book Understanding the Linux Virtual Memory Manager explais how memory management of Linux kernel works. This chapter explains some of the guts of L1 cache handling.
If you are running Intel hardware, then VTune for Linux is probably the best and most full-featured tool available to you.
Otherwise, you may be obliged to read the performance-counter MSRs directly, using the perfctr library. I haven't any experience with this on Linux myself, but I found a couple of papers that may help you (assuming you are on x86 -- if you're running PPC, please reply and I can provide more detailed answers):
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/11169/35961/01704008.pdf?temp=x
http://www.cise.ufl.edu/~sb3/files/pmc.pdf
In general these tools can't tell you exactly which lines your cache misses occur on, because they work by polling a counter. What you will need to do is poll the "l1 cache miss" counter at the beginning and end of each function you're interested in to see how many misses occur inside that function, and of course you may do so hierarchically. This can be simplified by eg inventing a class that records the start timer on entering scope and computes the delta on leaving scope.
VTune's instrumented mode does this for you automatically across the whole program. The equivalent AMD tool is CodeAnalyst. Valgrind claims to be an open-source cache profiler, but I've never used it myself.
Perhaps cachegrind (part of the valgrind suite) may be suitable.
Do you need something more than the unix command top will provide? This provides cpu usage and memory usage of linux programs in an easy to read presentation format.
If you need something more specific, a profiler perhaps, the software language (java/c++/etc.) will help determine which profiler is best for your situation.

Resources