Optimizing code where "problems" are in libc - performance

I have a C++ code and I am playing with Intel's VTune and I ran the General Exploration analysis and have no idea how to interpret the results. It flags as an issue the number of Retire Stalls.
On it's own, that is enough to confuse me because I'm probably in over my head. But the functions that it lists as having an abnormal amount of retire stalls is _int_malloc and malloc_consolidate, both in libc. So it's not even something that I can look at my own code and try to figure out and it's not something that I can really begin to change.
Is there a way to use that information to improve my own code? Or does it really just mean that I should find ways to allocate less or less often?
(Note: the specific code at hand isn't the issue, I'm looking for strategies to interpret the data and improve things when the hotspots or the stalls or whatever the "problem" may be occurs in code outside my control)

Is there a way to use that information to improve my own code? Or does
it really just mean that I should find ways to allocate less or less
often?
Yes, it pretty much sounds like you should make changes in your code so that malloc gets called less often.
Is the heap allocation really necessary?
Is there a buffer that you can reuse?
Is using memory pool an option?
Can you do stack allocation instead? For example, if those are
arrays, do you happen to know the maximum size of those arrays at
compile time?
Depending on your application, memory allocation can be expensive. I once made a program 20x faster by removing memory allocations from a tight loop. The application wasn't that slow on Linux but it was a disaster on Windows. After my changes, it was also OK on Windows.

know which line of code is calling malloc mostly
avoid repeated allocation and deallocation
potentially use thread-local-storage together with the previous point
write your own allocator which only returns memory when you tell him to and otherwise keeps freed memory blocks in a list (use list::splice to move list elements from one list into another)
use allocators from boost which potentially do the same like the previous point

Related

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.
The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

Dynamically executing large volumes of execute-once, straight-line x86 code

Dynamically generating code is pretty well-known technique, for example to speed up interpreted languages, domain-specific languages and so on. Whether you want to work low-level (close to 1:1 with assembly), or high-level you can find libraries you help you out.
Note the distinction between self-modifying code and dynamically-generated code. The former means that some code that has executed will be modified in part and then executed again. The latter means that some code, that doesn't exist statically in the process binary on disk, is written to memory and then executed (but will not necessarily ever be modified). The distinction might be important below or simply because people treat self-modifying code as a smell, but dynamically generated code as a great performance trick.
The usual use-case is that the generated code will be executed many times. This means the focus is usually on the efficiency of the generated code, and to a lesser extent the compilation time, and least of all the mechanics of actually writing the code, making it executable and starting execution.
Imagine however, that your use case was generating code that will execute exactly once and that this is straight-line code without loops. The "compilation" process that generates the code is very fast (close to memcpy speed). In this case, the actual mechanics of writing to the code to memory and executing it once become important for performance.
For example, the total amount of code executed may be 10s of GBs or more. Clearly you don't want to just write all out to a giant buffer without any re-use: this would imply writing 10GB to memory and perhaps also reading 10GB (depending on how generation and execution was interleaved). Instead you'd probably want to use some reasonably sized buffer (say to fit in the L1 or L2 cache): write out a buffer's worth of code, execute it, then overwrite the buffer with the next chunk of code and so on.
The problem is that this seems to raise the spectre of self-modifying code. Although the "overwrite" is complete, you are still overwriting memory that was at one point already executed as instructions. The newly written code has to somehow make its way from the L1D to the L1I, and the associated performance hit is not clear. In particular, there have been reports that simply writing to the code area that has already been executed may suffer penalties of 100s of cycles and that the number of writes may be important.
What's the best way of generating a large about of dynamically generated straight-line code on x86 and executing it?
I think you're worried unnecessarily. Your case is more like when a process exits and its pages are reused for another process (with different code loaded into them), which shouldn't cause self-modifying code penalties. It's not the same as when a process writes into its own code pages.
The self-modifying code penalties are significant when the overwritten instructions have been prefetched or decoded to the trace cache. I think it is highly unlikely that any of the generated code will still be in the prefetch queue or trace cache by the time the code generator starts overwriting it with the next bit (unless the code generator is trivial).
Here's my suggestion: Allocate pages up to some fraction of L2 (as suggested by Peter), fill them with code, and execute them. Then map the same pages at the next higher virtual address and fill them with the next part of the code. You'll get the benefit of cache hits for the reads and the writes but I don't think you'll get any self-modifying code penalty. You'll use 10s of GB of virtual address space, but keep using the same physical pages.
Use a serializing operation such as CPUID before each time you start executing the modified instructions, as described in sections 8.1.3 and 11.6 of the Intel SDM.
I'm not sure you'll stand to gain much performance by using a gigantic amount of straight-line code instead of much smaller code with loops, since there's significant overhead in continually thrashing the instruction cache for so long, and the overhead of conditional jumps has gotten much better over the past several years. I was dubious when Intel made claims along those lines, and some of their statements were rather hyperbolic, but it has improved a lot in common cases. You can still always avoid call instructions if you need to for simplicity, even for tree recursive functions, by effectively simulating "the stack" with "a stack" (possibly itself on "the stack"), in the worst case.
That leaves two reasons I can think of that you'd want to stick with straight-line code that's only executed once on a modern computer: 1) it's too complicated to figure out how to express what needs to be computed with less code using jumps, or 2) it's an extremely heterogeneous problem being solved that actually needs so much code. #2 is quite uncommon in practice, though possible in a computer theoretical sense; I've just never encountered such a problem. If it's #1 and the issue is just how to efficiently encode the jumps as either short or near jumps, there are ways. (I've also just recently gotten back into x86-64 machine code generation in a side project, after years of not touching my assembler/linker, but it's not ready for use yet.)
Anyway, it's a bit hard to know what the stumbling block is, but I suspect that you'll get much better performance if you can figure out a way to avoid generating gigabytes of code, even if it may seem suboptimal on paper. Either way, it's usually best to try several options and see what works best experimentally if it's unclear. I've sometimes found surprising results that way. Best of luck!

Windows memory allocation questions

I am currently looking into malloc() implementation under Windows. But in my research I have stumbled upon things that puzzled me:
First, I know that at the API level, windows uses mostly the HeapAlloc() and VirtualAlloc() calls to allocate memory. I gather from here that the Microsoft implementation of malloc() (that which is included in the CRT - the C runtime) basically calls HeapAlloc() for blocks > 480 bytes and otherwise manage a special area allocated with VirtualAlloc() for small allocations, in order to prevent fragmentation.
Well that is all good and well. But then there are other implementation of malloc(), for instance nedmalloc, which claim to be up to 125% faster than Microsoft's malloc.
All this makes me wonder a few things:
Why can't we just call HeapAlloc() for small blocks? Does is perform poorly in regard to fragmentation (for example by doing "first-fit" instead of "best-fit")?
Actually, is there any way to know what is going under the hood of the various API allocation calls? That would be quite helpful.
What makes nedmalloc so much faster than Microsoft's malloc?
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true? Or is the malloc() "wrapper" just needed because of fragmentation? One would think that system calls like this would be quick - or at least that some thoughts would have been put into them to make them efficient.
If it is true, why is it so?
On average, how many (an order of magnitude) memory reads/write are performed by a typical malloc call (probably a function of the number of already allocated segments)? I would intuitively says it's in the tens for an average program, am I right?
Calling HeapAlloc doesn't sound cross-platform. MS is free to change their implementation when they wish; advise to stay away. :)
It is probably using memory pools more effectively, much like the Loki library does with its "small object allocator"
Heap allocations, which are general purpose by nature, are always slow via any implementation. The more "specialized" the allocator, the faster it will be. This returns us to point #2, which deals with memory pools (and the allocation sizes used that are specific to your application).
Don't know.
From the above, I got the impression that HeapAlloc()/VirtualAlloc() are so slow that it is much faster for malloc() to call them only once in a while and then to manage the allocated memory itself. Is that assumption true?
The OS-level system calls are designed and optimized for managing the entire memory space of processes. Using them to allocate 4 bytes for an integer is indeed suboptimal - you get overall better performance and memory usage by managing tiny allocations in library code, and letting the OS optimize for larger allocations. At least as far as I understand it.

Alternative for Garbage Collector

I'd like to know the best alternative for a garbage collector, with its pros and cons. My priority is speed, memory is less important. If there is garbage collector which doesn't make any pause, let me know.
I'm working on a safe language (i.e. a language with no dangling pointers, checking bounds, etc), and garbage collection or its alternative has to be used.
I suspect you will be best sticking with garbage collection (as per the JVM) unless you have a very good reason otherwise. Modern GCs are extremely fast, general purpose and safe. Unless you can design your language to take advantage of a very specific special case (as in one of the above allocators) then you are unlikely to beat the JVM.
The only really compelling reason I see nowadays as an argument against modern GC is latency issues caused by GC pauses. These are small, rare and not really an issue for most purposes (e.g. I've successfully written 3D engines in Java), but they still can cause problems in very tight realtime situations.
Having said that, there may still be some special cases where a different memory allocation scheme may make sense so I've listed a few interesting options below:
An example of a very fast, specialised memory management approach is the "per frame" allocator used in many games. This works by incrementing a single pointer to allocate memory, and at the end of a time period (typically a visual "frame") all objects are discarded at once by simply setting the pointer back to the base address and overwriting them in the next allocation. This can be "safe", however the constraints of object lifetime would be very strict. Might be a winner if you can guarantee that all memory allocation is bounded in size and only valid for the scope of handling e.g. a single server request.
Another very fast approach is to have dedicated object pools for different classes of object. Released objects can just be recycled in the pool, using something like a linked list of free object slots. Operating systems often used this kind of approach for common data structures. Again however you need to watch object lifetime and explicitly handle disposals by returning objects to the pool.
Reference counting looks superficially good but usually doesn't make sense because you frequently have to dereference and update the count on two objects whenever you change a pointer value. This cost is usually worse than the advantage of having simple and fast memory management, and it also doesn't work in the presence of cyclic references.
Stack allocation is extremely fast and can run safely. Depending on your language, it is possible to make do without a heap and run entirely on a stack based system. However I suspect this will somewhat constrain your language design so that might be a non-starter. Still might be worth considering for certain DSLs.
Classic malloc/free is pretty fast and can be made safe if you have sufficient constraints on object creation and lifetime which you may be able to enforce in your language. An example would be if e.g. you placed significant constraints on the use of pointers.
Anyway - hope this is useful food for thought!
If speed matters but memory does not, then the fastest and simplest allocation strategy is to never free. Allocation is simply a matter of bumping a pointer up. You cannot get faster than that.
Of course, never releasing anything has a huge potential for overflowing available memory. It is very rare that memory is truly "unimportant". Usually there is a large but finite amount of available memory. One strategy is called "region based allocation". Namely you allocate memory in a few big blocks called "regions", with the pointer-bumping strategy. Release occurs only by whole regions. This strategy can be applied with some success if the problem at hand can be structured into successive "tasks", each having its own region.
For more generic solutions, if you want real-time allocation (i.e. guaranteed limits on the response time from allocation requests) then garbage collection is the way to go. A real-time GC may look like this: objects are allocated with a pointer-bumping strategy. Also, on every allocation, the allocator performs a little bit of garbage collection, in which "live" objects are copied somewhere else. In a way the GC runs "at the same time" than the application. This implies a bit of extra work for accessing objects, because you cannot move an object and update all pointers to point to the new object location while keeping the "real-time" promise. Solutions may imply barriers, e.g. an extra indirection. Generational GC allow for barrier-free access to most objects while keeping pause times under strict bounds.
This article is a must-read for whoever wants to study memory allocation, in particular garbage collection.
With C++ it's possible to make a heap allocation ONCE for your objects, then reuse that memory for subsequent objects, I've seen it work and it was blindingly fast.
It's only applicable to a certian set of problems, and it's difficult to do it right, but it is possible.
One of the joys of C++ is you have complete control over memory management, you can decide to use classic new/delete, or implement your own reference counting or Garbage Collection.
However - here be dragons - you really, really need to know what you're doing.
If memory doesn't matter, then what #Thomas says applies. Considering the gargantuan memory spaces of modern hardware, this may very well be a viable option -- it really depends on the process.
Manual memory management doesn't necessarily solve your problems directly, but it does give you complete control over WHEN memory events happen. Generic malloc, for example, is not an O(1) operation. It does all sorts of potentially horrible things in there, both within the heap managed by malloc itself as well as the operating system. For example, ya never know when "malloc(10)" may cause the VM to page something out, now your 10 bytes of RAM have an unknown disk I/O component -- oops! Even worse, that page out could be YOUR memory, which you'll need to immediately page back in! Now c = *p is a disk hit. YAY!
But if you are aware of these, then you can safely set up your code so that all of the time critical parts effectively do NO memory management, instead they work off of pre-allocated structures for the task.
With a GC system, you may have a similar option -- it depends on the collector. I don't think the Sun JVM, for example, has the ability to be "turned off" for short periods of time. But if you work with pre-allocated structures, and call all of your own code (or know exactly what's going on in the library routine you call), you probably have a good chance of not hitting the memory manager.
Because, the crux of the matter is that memory management is a lot of work. If you want to get rid of memory management, the write old school FORTRAN with ARRAYs and COMMON blocks (one of the reasons FORTRAN can be so fast). Of course, you can write "FORTRAN" in most any language.
With modern languages, modern GCs, etc., memory management has been pushed aside and become a "10%" problem. We are now pretty sloppy with creating garbage, copying memory, etc. etc., because the GCs et al make it easy for us to be sloppy. And for 90% of the programs, this is not an issue, so we don't worry about. Nowadays, it's a tuning issue, late in the process.
So, your best bet is set it all up at once, use it, then toss it all away. The "use it" part is where you will get consistent, reliable results (assuming enough memory on the system of course).
As an "alternative" to garbage collection, C++ specifically has smart pointers. boost::shared_ptr<> (or std::tr1::shared_ptr<>) works exactly like Python's reference counted garbage collection. In my eyes, shared_ptr IS garbage collection. (although you may need to do a few weak_ptr<> stuff to make sure that circular references don't happen)
I would argue that auto_ptr<> (or in C++0x, the unique_ptr<>...) is a viable alternative, with its own set of benefits and tradeoffs. Auto_ptr has a clunky syntax and can't be used in STL containers... but it gets the job done. During compile-time, you "move" the ownership of the pointer from variable to variable. If a variable owns the pointer when it goes out of scope, it will call its destructor and free the memory. Only one auto_ptr<> (or unique_ptr<>) is allowed to own the real pointer. (at least, if you use it correctly).
As another alternative, you can store everything on the stack and just pass references around to all the functions you need.
These alternatives don't really solve the general memory management problem that garbage collection solves. Nonetheless, they are efficient and well tested. An auto_ptr doesn't use any more space than the pointer did originally... and there is no overhead on dereferencing an auto_ptr. "Movement" (or assignment in Auto_ptr) has a tiny amount of overhead to keep track of the owner. I haven't done any benchmarks, but I'm pretty sure they're faster than garbage collection / shared_ptr.
If you truly want no pauses at all, disallow all memory allocation except for stack allocation, region-based buffers, and static allocation. Despite what you may have been told, malloc() can actually cause severe pauses if the free list becomes fragmented, and if you often find yourself building massive object graphs, naive manual free can and will lose to stop-and-copy; the only way to really avoid this is to amortize over preallocated pages, such as the stack or a bump-allocated pool that's freed all at once. I don't know how useful this is, but I know that the proprietary graphical programming language LabVIEW by default allocates a static region of memory for each subroutine-equivalent, requiring programmers to manually enable stack allocation; this is the kind of thing that's useful in a hard-real-time environment where you need absolute guarantees on memory usage.
If what you want is to make it easy to reason about pauses and give your developers control over allocation and placement, then there is already a language called Rust that has the same stated goals as your language; while not a completely safe language, it does have a safe subset, allowing you to create safe abstractions for raw bit-twiddling. It uses pointer type annotations to eliminate use-after-free bugs. It also doesn't have null pointers in safe code, because null pointers cost a billion dollars at least.
If bounded pauses are enough, though, there are a wide variety of algorithms that will work. If you really have a small working set compared to available memory, then I would recommend the MOS collector (aka the Train Algorithm), which collects incrementally and provably always makes progress toward freeing unreferenced objects.
It's a common fallacy that managed languages are not suitable for high performance low latency scenarios. Yes, with limited resources (such as an embedded platform) and sloppy programming you can shoot yourself in the foot just as spectacularly as with C++ (and that can be VERY VERY spectacular).
This problem has come whilst developing games in Java/C# and the solution was to utilise a memory pool and not let object die, hence not needing garbage collector to run when you don't expect it. This is really the same approach as with low latency unmanaged systems - TO TRY REALLY REALLY HARD NOT TO ALLOCATE MEMORY.
So, considering the fact that implementing such system in Java/C# is very similar to C++, the advantage of doing it the girly man way(managed), you have the "niceness" of other language features that free up your mental clock cycles to concentrate on important things.

Win32/MFC: How to find free memory (RAM) available?

Any suggestions/hints/links/tutorials would be appreciated! :)
There really is no answer to this. Under normal circumstances, the OS will keep something in essentially all the memory on the system. Basically, once it's read something into memory, it'll keep a copy of it in memory until something else needs memory so the first one gets kicked out. There are a number of functions that can get you information about memory, but none of them even attempts to really return an amount of memory that's completely unused. The closest of which I'm aware is GlobalMemoryStatusEx, which does return a number for the amount of memory that's available.
That means whatever is currently in that memory is currently both in memory and on disk, so the copy in memory can be thrown away without having to write it to disk first. For example, if you ran a program, most of is code will stay in memory (until something else wants memory), in case you decide to run it again. Since it's just a copy of the program on disk, it can be thrown away, and (if necessary) reloaded from disk when needed.
If you want more detail, you can use things like VirtualQueryEx to get it -- but it'll usually overload you with information, telling you about each block of memory used in a given process, instead of giving a nice, simple number saying "x bytes free".
GlobalMemoryStatus/GlobalMemoryStatusEx
http://msdn.microsoft.com/en-us/library/aa366586(VS.85).aspx
That's pretty easy to answer, free RAM is always sufficiently close to 0 to consider it zero and not bother. Unused RAM is always used by the file system cache, you can see this in the Taskmgr.exe, Performance tab.
If you actually mean "free virtual memory", the number you'd only really care about, then the answer is "not really possible". You'd need HeapWalk(), a very awkward and dangerous function to use. Only HeapWalk can detect blocks in the heap that are marked free but are still mapped. The number you'd arrive at is meaningless anyway. A program never runs out of free virtual memory blocks, it always runs out of large-enough memory blocks first.
Detecting this condition is easy enough. Malloc returns NULL, the new operator throws std::bad_alloc. Dealing with the condition is not easy. Solving it takes less than two hundred bucks, roughly the license fee for a 64-bit version of Windows.

Resources