sys_icache_invalidate() slow on M1

sys_icache_invalidate() slow on M1 - macos

Following up on an earlier question of mine. I'm writing, testing and benchmarking code on a MacBook Air with the M1 CPU running macOS 13.2.
I implemented the code generation approach I suggested in my question and got all tests working, compared to a "conventional" (no code generation) approach to the same problem. As usual, I had to enable writes to executable pages using pthread_jit_write_protect_np(0) prior to generating the code, followed by write-protecting the pages again using pthread_jit_write_protect_np(1), and then call sys_icache_invalidate() prior to running the generated code, due to cache coherency issues between the L1 I- and D-caches.
If I run the full code with the call to sys_icache_invalidate() commented out, it takes a few hundreds of nanoseconds, which is quite competitive with the conventional approach. This is after a few straightforward optimizations, and after working on it more, I am certain I'd be able to beat the conventional approach.
However, the code of course doesn't work with sys_icache_invalidate() commented out. Once I add it back and benchmark, it's adding almost 3 µs to the execution time. This makes the codegen approach hopelessly slower than the conventional approach.
Looking at Apple's code for sys_cache_invalidate(), it seems simple enough: for each cache line with the starting address in a register xN, it runs ic ivau, xN. Afterwards, it runs dsb ish and isb. It occurred to me that I could run ic ivau, xN after each cache line is generated in my codegen function, and then dsb ish and isb at the end. My thought is that perhaps each ic ivau, xN instruction could run in parallel with the rest of the codegen.
Unfortunately, the code still failed, and moreover, it only shaved a couple hundred ns from the execution time. I then decided to add a call to pthread_jit_write_protect_np(1) before each ic ivau, xN followed by a call to pthread_jit_write_protect_np(0), which finally fixed the code. At this point, it added a further 5 µs to the execution time, which renders the approach completely unfeasible. Scratch that, I made a mistake and even with the calls to pthread_jit_write_protect_np(), I simply can't get it work unless I call Apple's sys_icache_invalidate().
While I've made peace with the fact that I will need to abandon the codegen approach, I just wanted to be sure:
Does ic ivau, xN "block" the M1, i.e. prevents other instructions from executing in parallel, or perhaps flushes the pipeline?
Does ic ivau, xN really not work if the page is writeable? Or perhaps pthread_jit_write_protect_np() is doing some other black magic under the hood, unrelated to its main task of write-protecting the page, that I could also do without actually write-protecting the page? For reference, here is the source to Apple's pthread library, but it essentially calls os_thread_self_restrict_rwx_to_rx() or os_thread_self_restrict_rwx_to_rw(), which I assume are Apple-internal functions whose source I was unable to locate.
Is there some other approach to cache line invalidation to reduce this overhead?

Sorry, this sounds like really badly formulated questions to me, I'm not sure that I got them right, but will try to answer.
Does ic ivau, xN "block" the M1, i.e. prevents other instructions from
executing in parallel, or perhaps flushes the pipeline?
'parallel' does not make sense to me in this context. Is it blocking other CPU ? or other instruction on same CPU ?
ARMv8 supports processors: 'in-order' (Cortex-A53) and 'out-of-order' (Cortex-A57).
In both cases 'in-order' and 'out-of-order' CPU-s have internal multi-stage pipelines. Pipeline, in this context, means that a few instructions are executed 'in parallel', eg at same time. More precisely execution of inst2 could be started before inst1 is completed (though I'm not sure that question means 'parallel' of this kind).
Also there is difference in issuing command for execution, and completing command. For example issuing ic ivau start invalidation of cache, but it does not block pipeline until completed. Sync for completion is organized by isb barrier instruction.
"Ordering and completion of data and instruction cache instructions" section in ARMv8 reference manual describes all cache related ordering in details.
So considering all above, answers for original questions:
prevents other instructions from executing in parallel
No
or perhaps flushes the pipeline
No
^^ Disclaimer, all above is true for generic ARMv8 CPU (also sometimes referred as 'arm64'). M1 might have it's own HW - bugs/implementation specific that would affect execution.
Does ic ivau, xN really not work if the page is writeable?
No, ic instruction itself is not affected by memory page attributes.
Is there some other approach to cache line invalidation to reduce this overhead?
if memory block to invalidate is big enough, might be faster to blow whole cache at once instead of looping over memory region.
side note:
However, the code of course doesn't work with sys_icache_invalidate()
commented out. Once I add it back and benchmark, it's adding almost 3
µs to the execution time
And why that's surprise you ? cache is about repeated access. This 3microsec comes from necessity to access higher memory level which is slower. However after first execution when instructions/data are fetched back to cache, performance would be back to normal.
Also you mentioning invalidating instruction cache, w/o mentioning flushing data cache anywhere.
Any self-modifying/execution code loading sequence is
write code to memory
flush data cache
invalidate instruction cache
jump to execute new code

Related

Dynamically executing large volumes of execute-once, straight-line x86 code

Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?

I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.
The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).
Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.
Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.

I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.
That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)
Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?

First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.

Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.

Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

How can I speed up my code coverage tool?

I've written a small code coverage utility to log which basic blocks are hit in an x86 executable. It runs without source code or debugging symbols for the target, and just takes a lost of basic blocks which it monitors.
However, it is becoming the bottleneck in my application, which involves repeated coverage snapshots of a single executable image.
It has gone through a couple of phases as I've tried to speed it up. I started off just placing an INT3 at the start of each basic block, attaching as debugger, and logging hits. Then I tried to improve performance by patching in a counter to any block bigger than 5 bytes (the size of a JMP REL32). I wrote a small stub ('mov [blah], 1 / jmp backToTheBasicBlockWeCameFrom') in the process memory space and patch a JMP to that. This greatly speeds things up, since there's no exception and no debugger break, but I'd like to speed things up more.
I'm thinking of one of the following:
1) Pre-instrument the target binary with my patched counters (at the moment I do this at runtime). I could make a new section in the PE, throw my counters in it, patch in all the hooks I need, then just read data out of the same section with my debugger after each execution. That'll gain me some speed (about a 16% according to my estimation) but there are still those pesky INT3's which I need to have in the smaller blocks, which are really going to cripple performance.
2) Instrument the binary to include its own UnhandledExceptionFilter and handle its own int3's in conjunction with the above. This would mean there's no process switch from the debuggee to my coverage tool on every int3, but there'd still be the breakpoint exception raised and the subsequent kernel transition - am I right in thinking this wouldn't actually gain me much performance?
3) Try to do something clever using Intel's hardware branch profiling instructions. This sounds pretty awesome but I'm not clear on how I'd go about it - is it even possible in a windows usermode application? I might go as far as to write a kernel-mode driver if it's fairly straightforward but I'm not a kernel coder (I dabble a bit) and would probably cause myself lots of headaches. Are there any other projects using this approach? I see the Linux kernel has it to monitor the kernel itself, which makes me think that monitoring a specific usermode application will be difficult.
4) Use an off-the-shelf application. It'd need to work without any source or debugging symbols, be scriptable (so I can run in batches), and preferably be free (I'm pretty stingy). For-pay tools aren't off the table, however (if I can spend less on a tool and increase perf enough to avoid buying new hardware, that'd be good justification).
5) Something else. I'm running in VMWare on Windows XP, on fairly old hardware (Pentium 4-ish) - is there anything I've missed, or any leads I should read up on? Can I get my JMP REL32 down to less than 5 bytes (and catch smaller blocks without the need for an int3)?
Thanks.

If you insist on instrumenting binaries, pretty much your fastest coverage is the 5-byte jump-out jump-back trick. (You're covering standard ground for binary instrumentation tools.)
The INT 3 solution will always involve a trap. Yes, you could handle the trap in your space instead of a debugger space and that would speed up it, but it will never be close to competitive to the jump-out/back patch. You may need it as backup anyway, if the function you are instrumenting happens to be shorter than 5 bytes (e.g., "inc eax/ret") because then you don't have 5 bytes you can patch.
What you might do to optimize things a little is examine the patched code. Without such examination, with original code:
instrn 1
instrn 2
instrn N
next:
patched, in general to look like this:
jmp patch
xxx
next:
has to generally have a patch:
patch: pushf
inc count
popf
instrn1
instrn2
instrnN
jmp back
If all you want is coverage, you don't need to increment, and the means you don't need to save the flags:
patch: mov byte ptr covered,1
instrn1
instrn2
instrnN
jmp back
You should use a byte rather than a word to keep the patch size down. You should align the patch on a cache line so the processor doesn't have fetch 2 cache lines to execute the patch.
If you insist on counting, you can analyze the instrn1/2/N to see if they care about the flags that "inc" fools with, and only pushf/popf if needed, or you can insert the increment between two instructions in the patch that don't care. You must be analyzing these to some extent to handle complications such as instn being ret anyway; you can generate a better patch (e.g., don't "jmp back").
You may find that using add count,1 is faster than inc count because this avoids partial condition code updates and consequent pipeline interlocks. This will affect your cc-impact-analysis a bit, since inc doesn't set the carry bit, and add does.
Another possibility is PC sampling. Don't instrument the code at all; just interrupt the thread periodically and take a sample PC value. If you know where the basic blocks are, a PC sample anywhere in the basic block is evidence the entire block got executed. This won't necessarily give precise coverage data (you may miss critical PC values), but the overhead is pretty low.
If you are willing to patch source code, you can do better: just insert "covered[i]=true;" in the beginning the ith basic block, and let the compiler take care of all the various optimizations. No patches needed. The really cool part of this is that if you have basic blocks inside nested loops, and you insert source probes like this, the compiler will notice that the probe assignments are idempotent with respect to the loop and lift the probe out of the loop. Viola, zero probe overhead inside the loop. What more more could you want?

NVidia CUDA: cache L2 and multiple kernel invocations

I'm wondering whether L2 cache is freed between multiple kernel invocations. For example I have a kernel that does some preprocessing on data and the second one that uses it. Is it possible to achieve greater performance if data size is less than 768 KB? I see no reason for NVidia guys to implement it otherwise but maybe I'm wrong. Does anybody have an experience with that?

Assuming you are talking about L2 data cache in Fermi.
I think the caches are flushed after each kernel invocation. In my experience, running two consecutive launches of the same kernel with a lots of memory accesses (and #L2 cache misses) doesn't make any substantial changes to the L1/L2 cache statistics.
In your problem, I think, depending on the data dependency, it is possible to put two stages into one kernel (with some sync) so the second part of the kernel can reuse the data processed by the first part.
Here is another trick: You know the gpu has, for example N SMs, you can perform the first part using the first N * M1 blocks. The next N * M2 blocks for the second part. Make sure all the blocks in the first part finish at the same time (or almost) using sync. In my experience, the block scheduling order is really deterministic.
Hope it helps.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.

If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!

Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.

It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)

This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.

If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?

Understand your worst case runtime for complex operations and use timers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio