Dynamically executing large volumes of execute-once, straight-line x86 code - performance

Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?

I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.
The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).
Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.
Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.

I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.
That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)
Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!

Related

Drhystone benchmark on 32 bit micrcontroller

Currently I am doing a performance comparison on two 32bit microcontrollers. I used Dhrystone benchmark to run on both microcontrollers. One microcontroller has 4KB I-cache while second coontroller has 8KB of I-cache. Both microcontrollers are using same tool chain. As much as possible I kept same static and run-time settings onboth microcontrollers. But microcontroller with 4KB cache are faster than 8KB cache microcontroller. Both microcontroller are from same vendor and based on same CPU.
Could anyone provide some information why microcontroller with 4KB cache is faster than other?
Benchmarks are in general useless. dhrystone being one of the oldest ones and it may have had a little bit of value back then, before pipelines and too much compiler optimization. I think I gave up on dhrystone 15 years ago roughly about the time I started using it.
It is trivial to demonstrate that this code
.globl ASMDELAY
ASMDELAY:
sub r0,r0,#1
bne ASMDELAY
bx lr
Which is primarily two INSTRUCTIONS, can vary WIDELY in execution time on the same chip if you understand how modern processors work. The simple trick to seeing this is turn off the caches and prefetchers and such, and place this code at offset 0x0000, call it with some value. place it at 0x0004 repeat, then at 0x0008, repeat. keep doing this. You can put one, two, etc nops between the subtract and branch. try it at various offsets.
THEN, turn on and of caches for each of these alignments, if you have a prefetch outside the processor for the flash, turn that on and off.
THEN, vary your clocks, esp for those MCUs where you have to adjust the wait states based on clock rate.
On a SINGLE MCU, you are going to see those essentially two instructions vary in execution time by a very large amount. Lets just say 20 times longer in some cases than others.
Now take a small program or a small fraction of the dhrystone program. Compile that for your mcu, how many instructions do you see? Make minor to major optimization and other variations on the compile command line, how much does the code change. If two instructions can vary by lets call it 20 times in execution time, how bad can it get for 200 instructions or 2000 instructions? It can get pretty bad.
If you take the dhrystone programs you have right now with the compiler options you have right now, go into your bootstrap, add one nop (causing the entire binary to shift by one instruction in flash) run again. Add two, three, four. You are still not comparing different mcus just running your benchmark on one system.
Run with and without d cache, with and without i cache if you have each of those. turn on and off the flash prefetch if you have one and if you have a write buffer you can turn on and of try that. Still remaining on the same compiler same options same mcu.
Take the different functions in the dhrystone source code, and re-arrange them in the source code. instead of Proc_1, Proc_2, Proc_3 make it Proc_1, Proc_3, Proc_2. Do all of the above again. Re-arrange again, repeat.
Before leaving this mcu you should now see that the execution time of the same source code which is completely unmodified (other than perhaps re-arranging functions) can and will have vastly different execution times.
if you then start changing compiler options or keep the same source and change compilers, you will see even more of a difference in execution time.
How is it possible that dhrystone benchmarks today or from way back had a single result for each platform? Simple, that result was just one of a wide range, not really representing the platform.
So then if you try to compare two different hardware platforms be it the same arm core inside a different mcu from the same vendor or different vendors. the arm core, assuming (which is not a safe assumption) it is the same source with the same compile/build options, even assuming the same verilog compile and synthesis was used. you can have that same core change based on arm provided options. Anyway, how the vendor be it the same vendor in two instances or two different vendors wraps that same core, you will see variations. Then take a completely different core be it another arm or a mips, etc. How could you gain any value comparing those using a program like this that itself varies widely on each platform?
You cant. What you can do is use benchmarks to give the illusion of one thing being better than another, one computer is faster than another, one compiler is faster than another. In order to sell computers or compilers. Sprints coverage is within one percent of Verizons...does that tell us anything useful? Nope.
If you were to eliminate the compiler from the equation, and if these are truly the same "CPU", same rev of source from ARM, built the same way, then they should fetch the same, but the size of the cache is part of that so it may already be a different cpu implementation as the width or depth of the cache can affect things. In software it is like needing a 32 bit pointer rather than a 16 bit pointer (17 bit instead of 16, but you cant have a 17 bit generally in logic you can).
Anyway, if you compile the code under test one time for an address space that is common to both platforms, use that same binary exactly for that space, can attach different bootstrap code as needed, note the C library calls strcpy, etc also have to be the same in the same space between platforms to eliminate the compiler and alignment from messing you up. this may or may not level the playing field.
If you want to believe these are the same cpu, then turn the caches off, eliminate the compiler variations by doing the above. See if they execute the same. Copy the program to ram and run in ram, eliminate the flash issues. I assume you have them both clocked the same with the same wait states in the flash?
if they are the same cpu and the chip vendor has with these two chips made the memory system take the same number of clocks say for ram accesses, and it is really the same cpu, you should be able to get the same time by eliminating optimizations (caching, flash prefetching, alignment).
what you are probably seeing is some form of alignment with how the code lies in memory from the compiler vs cache lines, or it could be much much simpler it could be just the differences in the caches, how the hits and misses work and the 4KB is just more lucky than the 8KB for this particular program compiled a certain way, aligned in memory a certain way, etc.
With the simple two instruction loop above it is easy to see some of the reasons why performance varies on the same system, if your modern cpu fetches 8 instructions at a time and your loop gets too close to the tail end of that fetch, the prefetch may think it needs to fetch another 8 beyond that costing you those clock cycles. Certainly as you exactly straddle two "fetch lines" as I call them with those two instructions it is going to cost you a lot more cycles per loop even with a cache. Same problem happens when those two instructions approach a cache line (as you vary their alignment per test) eventually it takes two cache line reads instead of one to fetch those two instructions. At least the first time through there is an extra cache line read. The extra clocks for that first time through is something you can see using a simple benchmark like that while playing with alignment.
Michael Abrash, Zen of Assembly Language. There is an epub/etc you can build from github of this book. the 8088 was obsolete when this book came out if that is all you see 8088 stuff, then you are completely missing the point. It applies to the most modern processors today, how to view the problem, how to test, how to time the test, how to interpret the results. All the stuff I have mentioned so far, and all the things I know about this that I have not mentioned, all came from that books knowledge applied for however many decades I have been doing this.
So again if you have truly eliminated the compiler, alignment, the cpu, the memory system tied to that cpu, etc and it is down to only the size of the cache varies. Then it is probably related to how the cache line fetches hit and miss differently based on alignment of the code relative to the cache lines for the two caches. One is hitting more and missing less and/or evicting better for this particular binary. You can try rearranging the functions, can add nops, or if you cant get at the bootstrap then add whole functions or more code (another printf, etc) at a lower address in the binary causing the linker to slide the code under test through to different addresses changing how the program lines up with the cache lines. Since the functions in the code under test are so big (more than a few instructions to implement) you would have to start modifying the program in order to get a finer grained adjustment of binary relative to cache lines.
You most definitely should see execution time differences on the two platforms if you adjust alignment and or wholesale re-arrange the binary based on functions being re-arranged.
Bottom line benchmarks dont really tell you much, the results have more of a negative stink to them than a positive joy. Without a re-write a particular benchmark or application may just do better on one platform (be it everything is the same but the size of the cache or two completely different architectures) even when you try to vary the results due to alignment, turning on and off prefetching, write buffering, branch prediction, etc. Pulling out all the tricks you can come up with one may vary from x to y the other from n to m and maybe there is some overlap in the ranges. For these two platforms that you say are similiar except for cache size I would hope you could find a combination where sometimes A is faster than B and at least one combination where B is faster than A, with the same features turned on and off for both in that comparision. If/when you switch to some other application/benchmark this all starts over again, no reason to assume dhrystone results predict any other code under test. The only program that matters, esp on an MCU is the final build of your application. Just remember changing a single line of code, or even adding a single nop in the bootstrap can have dramatic results in performance sometimes several TIMES slower or faster from a single nop.

Optimizing code where "problems" are in libc

I have a C++ code and I am playing with Intel's VTune and I ran the General Exploration analysis and have no idea how to interpret the results. It flags as an issue the number of Retire Stalls.
On it's own, that is enough to confuse me because I'm probably in over my head. But the functions that it lists as having an abnormal amount of retire stalls is _int_malloc and malloc_consolidate, both in libc. So it's not even something that I can look at my own code and try to figure out and it's not something that I can really begin to change.
Is there a way to use that information to improve my own code? Or does it really just mean that I should find ways to allocate less or less often?
(Note: the specific code at hand isn't the issue, I'm looking for strategies to interpret the data and improve things when the hotspots or the stalls or whatever the "problem" may be occurs in code outside my control)
Is there a way to use that information to improve my own code? Or does
it really just mean that I should find ways to allocate less or less
often?
Yes, it pretty much sounds like you should make changes in your code so that malloc gets called less often.
Is the heap allocation really necessary?
Is there a buffer that you can reuse?
Is using memory pool an option?
Can you do stack allocation instead? For example, if those are
arrays, do you happen to know the maximum size of those arrays at
compile time?
Depending on your application, memory allocation can be expensive. I once made a program 20x faster by removing memory allocations from a tight loop. The application wasn't that slow on Linux but it was a disaster on Windows. After my changes, it was also OK on Windows.
know which line of code is calling malloc mostly
avoid repeated allocation and deallocation
potentially use thread-local-storage together with the previous point
write your own allocator which only returns memory when you tell him to and otherwise keeps freed memory blocks in a list (use list::splice to move list elements from one list into another)
use allocators from boost which potentially do the same like the previous point

When should we use a scatter/gather(vectored) IO?

Windows file system supports scatter/gather IO.(Of course, other platform does)
But I don't know when do I use the IO mechanism.
Could you explain me a proper case?
And what benefit can we get from using the I/O mechanism?(Just a little IO request?)
You use Scatter/Gather IO when you are doing lots of random (i.e. non-sequential) reads / writes, and you want to save on context switches / syscalls - Scatter/Gather is a form of batching in this sense. However, unless you've got a very fast disk (or more likely, a large array of disks), the syscall cost is negligible.
If you were writing a Database server, you might care about this, but anything less than a big-iron machine handling thousands or millions of requests a second won't see any benefit.
Paul -- one extra note: one additional advantage is that you hand multiple requests to the disk driver at the same time. The driver then can sort the requests and issue them in the optimal order. While syscall time is small, seek time (many milliseconds) can be punitive (that's less than 1000 I/O's/sec).
Chris's comment about demonstrating the efficiency is pragmatic. Mother nature never lies. Well, almost never.
I would imagine that you would use scatter gatehr IO when you (a) suspected your application had a performance bottleneck, and (b) you built a performance analysis framework that could show significant improvment using it.
Unless you can show a provable improvement, the additional code complexity is just a risk, and theres no magic recipe that says that, when some condition is met, and application will automatically benefit in a significant way from some programming cleverness.
Or - to put it another way - dont base major architectural decisions based on the statements of 'some guy on an internet forum'. Create a test, and find out.
in posix, readv and writev read from or write to discontinuous memory but to read and write discontinuous file ranges from discontinuous memory in one go you want readx and writex which were one of the proposed posix additions
doing a readx is faster then doing a lot of reads as it's only one system call and it lets the disk scheduler have the most io's to reorder i remember some one saying that for the ext2/3/.. fsck program that they wanted this as it knows what ranges it wants

Does JIT performance suffer due to writeable pages?

In the book Linkers and Loaders, it's mentioned that one of the reasons for executables to have a separate code section is that the code section can be kept in read only pages, which results in a performance increase. Is this still true for a modern OS? Seeing as Just in Time compilers are generating code on the fly, I assume they need writable pages. Does this mean that JIT generated code will always suffer a performance hit in comparison? If so, how significant a hit is it?
Effects of memory management aside (which are explained in other answers), CPU doesn't need to continuously check if the current stream of instructions are modified and the intermediate results in the pipeline should be thrown away and new code needs to be read. In the case of jit compilation, this scenario may occur often depending on the design of the compiler, depth of CPU pipeline, size of code cache on CPU and number of other CPUs which may modify that code. It isn't normally allowed to occur in well designed modern systems where code is generated to a writeable page and marked as executable and readonly afterwards. This is not unique to jit of course. It can happen in all kinds of self modifying code.
Yes, it should suffer some sort of hit because the code in memory is not backed directly by the executable, so it would have to be paged out instead of just dropped. Having said that, various forms of linking can also dirty up regular code pages so that they no longer match the disk image, with the same consequences, so I'm not sure that this is a big deal.
The performance increase is not because of whether the pages are read only or not. The advantage is that read only pages can be shared between processes, so you use less memory which means less swapping (both to L1/L2/L3 caches as well as to disk in extreme cases).
JIT tries to mitigate this by not needlessly JITting, but only JITting the hot functions. This will result in only a modest increase in memory since the number of hot functions are relatively small.
A JIT compiler could also be smart and cache the result of the JITting so it could (theoretically) be shared. But I don't know whether this is done in practice.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources