I have written an OCaml implementation of B-trees. It is slow, and takes ~1 minute to add about 100k small ~16 byte records. When I profiled the implementation, I discovered that the program is spending most (~58 seconds) of its time in garbage collection.
The specific functions are:
caml_garbage_collection, which consumes 58 seconds, of this:
caml_major_collection_slice consumes 63% and
caml_gc_dispatch consumes 23.5%
What could be the cause of this overactive garbage collection, and how would I go about debugging it?
I ended up solving this problem by using Spacetime, a memory profiler for OCaml, and by following the instructions here: https://blog.janestreet.com/a-brief-trip-through-spacetime/
It was a very smooth experience. I identified the issue to be a debugging data structure. I was keeping a list of the entries around as a mutable reference to a list, which I was updating as follows:
t.orig_items <- new_entry :: t.orig_items
It appears that when you do this, OCaml creates a copy of the original list. So using a mutable list in this way is a bad idea, it seems.
Related
It's widely accepted that one of the main things holding Go back from C++ level performance is the garbage collector. I'd like to get some intuition to help reason about the overhead of Go's GC in different circumstances. For example, is there nontrivial GC overhead if a program never touches the heap, or just allocates a large block at setup to use as an object pool with self-management? Is a call made to the GC every x seconds, or on every allocation?
As a related question: is my initial assumption correct that Go's GC is the main impediment to C++ level performance, or are there some things that Go just does slower?
The pause time (stop the world) for garbage collection in Golang is in the order of a few milliseconds, or in more recent Golang less than that. (see
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md)
C++ does not have a garbage collector so these times of pauses do not happen. However, C++ is not magic and memory management must occur if memory for storing objects is to be managed. Memory management is still happening somewhere in your program regardless of the language
Using a static block of memory in C++ and not dealing with any memory management issues is an approach. But Go can do this too. For an outline of how this is done in a real, high performance Go program, see this video
https://www.youtube.com/watch?time_continue=7&v=ZuQcbqYK0BY
There is a program that my matlab runs, since there are two gigantic nested for-loop, we expect this program to run more than 10 hours. We ask matlab to print out the loop number every time when it is looping.
Initially (the first 1 hour), wee see the loop number increment very fast in our screen; as time goes by, it goes slower and slower..... now (more than 20 consecutive hours of executing the same ".m" file and it still haven't finished yet), it is almost 20 times slower than it had been initially.
The ram usage initially was about 30%, right now after 20 hours of executing time, it is as shown below:
My computer spec is below.
What can I do to let matlab maintain its initially speed?
I can only guess, but my bet is that you have some array variables that have not been preallocated, and thus their size increases at each iteration of the for loop. As a result of this, Matlab has to reallocate memory in each iteration. Reallocating slows things down, and more so the larger those variables are, because Matlab needs to find an ever larger chunk of contiguous memory. This would explain why the program seems to run slower as time passes.
If this is indeed the cause, the solution would be to preallocate those variables. If their size is not known beforehand, you can make a guess and preallocate to an approximate size, thus avoiding at least some of the reallocating. Or, if your program is not memory-limited, maybe you can use an upper-bound on variable size when preallocating; then, after the loop, trim the arrays by removing unused entries.
Some general hints, if they don't help I suggest to add the code to the question.
Don't print to console, this output slows down execution and the output is kept in memory. Write a logfile instead if you need it. For a simple status, use waitbar
Verify you are preallocating all variables
Check which function calls depend on the loop index, for example increasing size of variables.
If any mex-functions are used, double check them for memory leaks. The standard procedure to do this: Call the function with random example data, don't store the output. If the memory usage increases there is a memory leak in the function.
Use the profiler. Profile your code for the first n iterations (where n matches about 10 minutes), generate a HTML report. Then let it run for about 2 hours and generate a report for n iterations again. Now compare both reports and see where the time is lost.
I'd like to point everybody to the following page in the MATLAB documentation: Strategies for Efficient Use of Memory. This page contains a collection of techniques and good practices that MATLAB users should be aware of.
The OP reminded me of it when saying that memory usage tends to increase over time. Indeed there's an issue with long running instances of MATLAB on win32 systems, where a memory leak exists and is exacerbated over time (this is described in the link too).
I'd like to add to Luis' answer the following piece of advice that a friend of mine once received during a correspondence with Yair Altman:
Allocating largest vars first helps by assigning the largest free contiguous blocks to the highest vars, but it is easily shown that in some cases this can actually be harmful, so it is not a general solution.
The only sure way to solve memory fragmentation is by restarting Matlab and even better to restart Windows.
More information on memory allocation performance can be found in the following undocumentedMatlab posts:
Preallocation performance
Allocation performance take 2
I've been experimenting with programming language design, and have come to the point of needing to implement a garbage collection system. Now the first thing that came to mind was reference counting, but this won't handle reference loops. Most of the pages that I come across when searching for algorithms are references on tuning the garbage collectors in existing languages, such as Java. When I do find anything describing specific algorithms, I'm not getting enough detail for implementation. For example, most of the descriptions include "when your program runs low on memory...", which isn't likely to happen anytime soon on a 4 GB system with plenty of swap. So what I'm looking for is some tutorials with good implementation details such as how to tune when to kick off the garbage collector (i.e., collect after X number of memory allocations, or every Y minutes, etc).
To give a couple more details on what I'm trying to do, I'm starting off with writing a stack-based interpreter similar to Postscript, and my next attempt will be probably an S-expression language based on one of the Lisp dialects. I am implementing in straight C. My goal is both self education, and to document the various stages into a "how to design and write an interpreter" tutorial.
As for what I've done so far, I've written a simple interpreter which implements a C style imperative language, which gets parsed and processed by a stack machine style VM (see lang2e.sourceforge.net). But this language doesn't allocate new memory on entering any function, and doesn't have any pointer data types so there wasn't really a need at the time for any type of advanced memory management. For my next project I'm thinking of starting off with reference counting for non-pointer type objects (integers, strings, etc), and then keeping track of any pointer-type object (which can generate circular references) in a separate memory pool. Then, whenever the pool grows more than X allocation units more than it was at the end of the previous garbage collection cycle, kick off the collector again.
My requirements is that it not be too inefficient, yet easy to implement and document clearly (remember, I want to develop this into a paper or book for others to follow). The algorithm I've currently got at the front is tri-color marking, but it looks like a generational collector would be a bit better, but harder to document and understand. So I'm looking for some clear reference material (preferably available online) that includes enough implementation details to get me started.
There's a great book about garbage collection. It's called Garbage Collection: Algorithms for Automatic Dynamic Memory Management, and it's excellent. I've read it, so I'm not recommending this just because you can find it with Google. Look at it here.
For simple prototyping, use mark-and-sweep or any simple non-generational, non-incremental compacting collector. Incremental collectors are good only if you need to provide for "real-time" response from your system. As long as your system is allowed to lag arbitrarily much at any particular point in time, you don't need an incremental one. Generational collectors reduce average garbage collection overhead with the expense of assuming something about the life cycles of objects.
I have implemented all (generational/non-generational, incremental/non-incremental) and debugging garbage collectors is quite hard. Because you want to focus on the design of your language, and maybe not so much on debugging a more complex garbage collector, you could stick to a simple one. I would go for mark-and-sweep most likely.
When you use garbage collection, you do not need reference counting. Throw it away.
When to kick off the allocator is probably wide open -- you could GC when a memory allocation would otherwise fail, or you could GC every time a reference is dropped, or anywhere in the middle.
Waiting until you've got no choice may mean you never GC, if the running code is fairly well contained. Or, it may introduce a gigantic pause into your environment and demolish your response time or animations or sound playback completely.
Running the full GC on every free() could amortize the cost across more operations, though the entire system may run slower as a result. You could be more predictable, but slower overall.
If you'd like to test the thing by artificially limiting memory, you can simply run with very restrictive resource limits in place. Run ulimit -v 1024 and every process spawned by that shell will only ever have one megabyte of memory to work with.
Currently I'm experimenting with a little Haskell web-server written in Snap that loads and makes available to the client a lot of data. And I have a very, very hard time gaining control over the server process. At random moments the process uses a lot of CPU for seconds to minutes and becomes irresponsive to client requests. Sometimes memory usage spikes (and sometimes drops) hundreds of megabytes within seconds.
Hopefully someone has more experience with long running Haskell processes that use lots of memory and can give me some pointers to make the thing more stable. I've been debugging the thing for days now and I'm starting to get a bit desperate here.
A little overview of my setup:
On server startup I read about 5 gigabytes of data into a big (nested) Data.Map-alike structure in memory. The nested map is value strict and all values inside the map are of datatypes with all their field made strict as well. I've put a lot of time in ensuring no unevaluated thunks are left. The import (depending on my system load) takes around 5-30 minutes. The strange thing is the fluctuation in consecutive runs is way bigger than I would expect, but that's a different problem.
The big data structure lives inside a 'TVar' that is shared by all client threads spawned by the Snap server. Clients can request arbitrary parts of the data using a small query language. The amount of data request usually is small (upto 300kb or so) and only touches a small part of the data structure. All read-only request are done using a 'readTVarIO', so they don't require any STM transactions.
The server is started with the following flags: +RTS -N -I0 -qg -qb. This starts the server in multi-threaded mode, disable idle-time and parallel GC. This seems to speed up the process a lot.
The server mostly runs without any problem. However, every now and then a client request times out and the CPU spikes to 100% (or even over 100%) and keeps doing this for a long while. Meanwhile the server does not respond to request anymore.
There are few reasons I can think of that might cause the CPU usage:
The request just takes a lot of time because there is a lot of work to be done. This is somewhat unlikely because sometimes it happens for requests that have proven to be very fast in previous runs (with fast I mean 20-80ms or so).
There are still some unevaluated thunks that need to be computed before the data can be processed and sent to the client. This is also unlikely, with the same reason as the previous point.
Somehow garbage collection kicks in and start scanning my entire 5GB heap. I can imagine this can take up a lot of time.
The problem is that I have no clue how to figure out what is going on exactly and what to do about this. Because the import process takes such a long time profiling results don't show me anything useful. There seems to be no way to conditionally turn on and off the profiler from within code.
I personally suspect the GC is the problem here. I'm using GHC7 which seems to have a lot of options to tweak how GC works.
What GC settings do you recommend when using large heaps with generally very stable data?
Large memory usage and occasional CPU spikes is almost certainly the GC kicking in. You can see if this is indeed the case by using RTS options like -B, which causes GHC to beep whenever there is a major collection, -t which will tell you statistics after the fact (in particular, see if the GC times are really long) or -Dg, which turns on debugging info for GC calls (though you need to compile with -debug).
There are several things you can do to alleviate this problem:
On the initial import of the data, GHC is wasting a lot of time growing the heap. You can tell it to grab all of the memory you need at once by specifying a large -H.
A large heap with stable data will get promoted to an old generation. If you increase the number of generations with -G, you may be able to get the stable data to be in the oldest, very rarely GC'd generation, whereas you have the more traditional young and old heaps above it.
Depending the on the memory usage of the rest of the application, you can use -F to tweak how much GHC will let the old generation grow before collecting it again. You may be able to tweak this parameter to make this un-garbage collected.
If there are no writes, and you have a well-defined interface, it may be worthwhile making this memory un-managed by GHC (use the C FFI) so that there is no chance of a super-GC ever.
These are all speculation, so please test with your particular application.
I had a very similar issue with a 1.5GB heap of nested Maps. With the idle GC on by default I would get 3-4 secs of freeze on every GC, and with the idle GC off (+RTS -I0), I would get 17 secs of freeze after a few hundred queries, causing a client time-out.
My "solution" was first to increase the client time-out and asking that people tolerate that while 98% of queries were about 500ms, about 2% of the queries would be dead slow. However, wanting a better solution, I ended up running two load-balanced servers and taking them offline from the cluster for performGC every 200 queries, then back in action.
Adding insult to injury, this was a rewrite of an original Python program, which never had such problems. In fairness, we did get about 40% performance increase, dead-easy parallelization and a more stable codebase. But this pesky GC problem...
I have a Haskell program which processes a text file and builds a Map (with several million elements). The whole thing can run for 2-3 minutes. I found that tweaking the -H and -A options makes a big difference in running time.
There is documentation about this functionality of the RTS, but it's a hard read for me since I don't know the algorithms and terms from GC theory. I'm looking for a less technical explanation, preferably specific to Haskell/GHC. Are there any references about choosing sensible values for these options?
EDIT:
That's the code, it builds a trie for a given list of words.
buildTrie :: [B.ByteString] -> MyDFA
buildTrie l = fst3 $ foldl' step (emptyDFA, B.empty, 1) $ sort $ map B.reverse l where
step :: (MyDFA , B.ByteString, Int) -> B.ByteString -> (MyDFA , B.ByteString, Int)
step (dfa, lastWord, newIndex) newWord = (insertNewStates, newWord, newIndex + B.length newSuffix) where
(pref, lastSuffix, newSuffix) = splitPrefix lastWord newWord
branchPoint = transStar dfa pref
--new state labels for the newSuffix path
newStates = [newIndex .. newIndex + B.length newSuffix - 1]
--insert newStates
insertNewStates = (foldl' (flip insertTransition) dfa $ zip3 (branchPoint:init newStates) (B.unpack newSuffix) newStates)
Generally speaking, garbage collection is a space/time tradeoff. Give the GC more space, and it will take less time. There are (many) other factors in play, cache in particular, but the space/time tradeoff is the most important one.
The tradeoff works like this: the program allocates memory until it reaches some limit (decided by the GC's automatic tuning parameters, or explicitly via RTS options). When the limit is reached, the GC traces all the data that the program is currently using, and reclaims all the memory used by data that is no longer required. The longer you can delay this process, the more data will have become unreachable ("dead") in the meantime, so the GC avoids tracing that data. The only way to delay a GC is to make more memory available to allocate into; hence more memory equals fewer GCs, equals lower GC overhead. Roughly speaking, GHC's -H option lets you set a lower bound on the amount of memory used by the GC, so can lower the GC overhead.
GHC uses a generational GC, which is an optimisation to the basic scheme, in which the heap is divided into two or more generations. Objects are allocated into the "young" generation, and the ones that live long enough get promoted into the "old" generation (in a 2-generation setup). The young generation is collected much more frequently than the old generation, the idea being that "most objects die young", so young-generation collections are cheap (they don't trace much data), but they reclaim a lot of memory. Roughly speaking, the -A option sets the size of the young generation - that is, the amount of memory that will be allocated before the young generation is collected.
The default for -A is 512k: it's a good idea to keep the young generation smaller than the L2 cache, and performance generally drops if you exceed the L2 cache size. But working in the opposite direction is the GC space/time tradeoff: using a very large young generation size might outweigh the cache benefits by reducing the amount of work the GC has to do. This doesn't always happen, it depends on the dynamics of the application, which makes it hard for the GC to automatically tune itself. The -H option also increases the young generation size, so can also have an adverse effect on cache usage.
The bottom line is: play around with the options and see what works. If you have plenty of memory to spare, you might well be able to get a performance increase by using either -A or -H, but not necessarily.
It is probably possible to reproduce the issue for smaller data sets, where it will be easier to see what is going on. In particular, I suggest to gain some familiarity with profiling:
the profiling chapter in the Real World Haskell book provides a good overview, including typical issues which one might encounter
the profiling chapter in GHC manual documents the available functionality
Then, check whether the memory profiles you see match your expectations (you don't need to know about all the profiling options to get useful graphs). The combination of strict foldl' with a non-strict tuple as the accumulator would be the first thing I'd look at: if the components of the tuple are not forced, that recursion is building up unevaluated expressions.
Btw, you can build up a good intuition about such things by trying to evaluate your code by hand for really tiny data sets. A few iterations will suffice to see whether expressions get evaluated or remain unevaluated in the way that fits your application.