What do these turbofan memory zones mean? - performance

I was playing around with the V8 engine and stumbled upon the zone stats tool. I understand the zone memory concept, but what are these different zones in zone stats memory?
When I upload a trace zone memory log into it, I see that there are for zones for turbofan:
graph-zone
instruction-zone
pipeline-compilation-job-zone
register-allocation-zone
I understand that the compilation job zone is used for compiling (maybe?correct me if I am wrong), but what are the other zones?

(V8 developer here.) Since Turbofan is a compiler, all the zones it uses are used for compiling.
It uses more than one zone so that temporary data that's only needed for a short time can be freed afterwards, whereas the main zone sticks around until the entire compilation job is completed. This is most obvious for the "register-allocation-zone", which is used by the register allocator.
I'm not exactly sure what the differences between the other zones are, and to be honest I don't think it's particularly important: if you're specifically working on reducing Turbofan's peak memory consumption, then you probably already know (or can easily find out by reading a bit of code); and otherwise, there's no need for you to care. The specific set of zones may also change at any time. As a JavaScript developer, these kinds of compiler internals definitely don't concern you.

Related

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.
The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

"Well-parallelized" algorithm not sped up by multiple threads

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
Background:
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
Reasoning:
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
Question:
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.
It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
};
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.
Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

Have you ever used NSZoneMalloc() instead of malloc()?

Cocoa provides for page-aligned memory areas that it calls Memory Zones, and provides a few memory management functions that take a zone as an argument.
Let's assume you need to allocate a block of memory (not for an object, but for arbitrary data). If you call malloc(size), the buffer will always be allocated in the default zone. However, somebody may have used allocWithZone: to allocate your object in another zone besides the default. In that case, it would seem better to use NSZoneMalloc([self zone], size), which keeps your buffer and owning object in the same area of memory.
Do you follow this practice? Have you ever made use of memory zones?
Update: I think there is a tendency on Stack Overflow to respond to questions about low-level topics with a lecture about premature optimization. I understand that zones probably mattered more in 1993 on NeXT hardware than they do today, and a Google search makes it pretty clear that virtually nobody is concerned with them. I am asking anyway, to see if somebody could describe a project where they made use of memory zones.
I've written software for NeXTStep, GNUstep on Linux and Cocoa on Mac OS X, and have never needed to use custom memory zones. The condition which would suggest it as a good improvement to the software has either never arisen, or never been detected as significant.
You're absolutely right in your entire question, but in practice, nobody really uses zones. As the page you link to puts it:
In most circumstances, using the default zone is faster and more efficient than creating a separate zone.
The benefit of making your own zone is:
If a page fault occurs when trying to access one of the objects, loading the page brings in all of the related objects, which could significantly reduce the number of future page faults.
If a page fault occurs, that means that the system was recently paging things out and is therefore slow anyway, and that either your app is not responsible or the solution is in the part of your app that allocated too much memory at once in the first place.
So, basically, the question is “can you prove that you really do need to create your own zone to fix a performance problem or make your app wicked fast”, and the answer is “no”.
If you find yourself doing this, you're probably operating at a lower level than you really ought to be. The subsystem pretty much ignores them; any calls to +alloc or such will get you objects in the default zone. malloc and NSAllocateCollectable are all you need to know.

When might you not want to use garbage collection? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Garbage collection has been around since the early days of LISP, and now - several decades on - most modern programming languages utilize it.
Assuming that you're using one of these languages, what reasons would you have to not use garbage collection, and instead manually manage the memory allocations in some way?
Have you ever had to do this?
Please give solid examples if possible.
I can think of a few:
Deterministic deallocation/cleanup
Real time systems
Not giving up half the memory or processor time - depending on the algorithm
Faster memory alloc/dealloc and application-specific allocation, deallocation and management of memory. Basically writing your own memory stuff - typically for performance sensitive apps. This can be done where the behavior of the application is fairly well understood. For general purpose GC (like for Java and C#) this is not possible.
EDIT
That said, GC has certainly been good for much of the community. It allows us to focus more on the problem domain rather than nifty programming tricks or patterns. I'm still an "unmanaged" C++ developer though. Good practices and tools help in that case.
Memory allocations? No, I think the GC is better at it than I am.
But scarce resource allocations, like file handles, database connections, etc.? I write the code to close those when I'm done. GC won't do that for you.
I do a lot of embedded development, where the question is more likely to be whether to use malloc or static allocation and garbage collection is not an option.
I also write a lot of PC-based support tools and will happily use GC where it is available & fast enough and it means that I don't have to use pedant::std::string.
I write a lot of compression & encryption code and GC performance is usually not good enough unless I really bend the implementation. GC also requires you to be very careful with address aliasing tricks. I normally write performance sensitive code in C and call it from Python / C# front ends.
So my answer is that there are reasons to avoid GC, but the reason is almost always performance and it's then best to code the stuff that needs it in another language rather than trying to trick the GC.
If I develop something in MSVC++, I never use garbage collection. Partly because it is non-standard, but also because I've grown up without GC in C++ and automatically design in safe memory reclamation. Having said this, I think that C++ is an abomination which fails to offer the translation transparency and predictability of C or the scoped memory safety (amongst other things) of later OO languages.
Real time applications are probably difficult to write with a garbage collector. Maybe with an incremental GC that works in another thread, but this is an additional overhead.
One case I can think of is when you are dealing with large data sets amounting to hundreads of megabytes or more. Depending on the situation you might want to free this memory as soon as you are done with it, so that other applications can use it.
Also, when dealing with some unmanaged code there might be a situation where you might want to prevent the GC from collecting some data because it's still being used by the unmanaged part. Though I still have to think of a good reason why simply keeping a reference to it might not be good enough. :P
One situation I've dealt with is image processing. While working on an algorithm for cropping images, I've found that managed libraries just aren't fast enough to cut it on large images or on multiple images at a time.
The only way to do processing on an image at a reasonable speed was to use non-managed code in my situation. This was while working on a small personal side-project in C# .NET where I didn't want to learn a third-party library because of the size of the project and because I wanted to learn it to better myself. There may have been an existing third-party library (perhaps Paint.NET) that could do it, but it still would require unmanaged code.
Two words: Space Hardening
I know its an extreme case, but still applicable. One of the coding standards that applied to the core of the Mars rovers actually forbid dynamic memory allocation. While this is indeed extreme, it illustrates a "deploy and forget about it with no worries" ideal.
In short, have some sense as to what your code is actually doing to someone's computer. If you do, and you are conservative .. then let the memory fairy take care of the rest. While you develop on a quad core, your user might be on something much older, with much less memory to spare.
Use garbage collection as a safety net, be aware of what you allocate.
There are two major types of real time systems, hard and soft. The main distinction is that hard real time systems require that an algorithm always finish in a particular time budget where as a soft system would like it to normally happen. Soft systems can potentially use well designed garbage collectors although a normal one would not be acceptable. However if a hard real time system algorithm did not complete in time then lives could be in danger. You will find such sorts of systems in nuclear reactors, aeroplanes and space shuttles and even then only in the specialist software that the operating systems and drivers are made of. Suffice to say this is not your common programming job.
People who write these systems don't tend to use general purpose programming languages. Ada was designed for the purpose of writing these sorts of real time systems. Despite being a special language for such systems in some systems the language is cut down further to a subset known as Spark. Spark is a special safety critical subset of the Ada language and one of the features it does not allow is the creation of a new object. The new keyword for objects is totally banned for its potential to run out of memory and its variable execution time. Indeed all memory access in Spark is done with absolute memory locations or stack variables and no new allocations on the heap is made. A garbage collector is not only totally useless but harmful to the guaranteed execution time.
These sorts of systems are not exactly common, but where they exist some very special programming techniques are required and guaranteed execution times are critical.
Just about all of these answers come down to performance and control. One angle I haven't seen in earlier posts is that skipping GC gives your application more predictable cache behavior in two ways.
In certain cache sensitive applications, having the language automatically trash your cache every once in a while (although this depends on the implementation) can be a problem.
Although GC is orthogonal to allocation, most implementations give you less control over the specifics. A lot of high performance code has data structures tuned for caches, and implementing stuff like cache-oblivious algorithms requires more fine grained control over memory layout. Although conceptually there's no reason GC would be incompatible with manually specifying memory layout, I can't think of a popular implementation that lets you do so.
Assuming that you're using one of these languages, what reasons would you have to not use garbage collection, and instead manually manage the memory allocations in some way?
Potentially, several possible reasons:
Program latency due to the garbage collector is unacceptably high.
Delay before recycling is unacceptably long, e.g. allocating a big array on .NET puts it in the Large Object Heap (LOH) which is infrequently collected so it will hang around for a while after it has become unreachable.
Other overheads related to garbage collection are unacceptably high, e.g. the write barrier.
The characteristics of the garbage collector are unnacceptable, e.g. redoubling arrays on .NET fragments the Large Object Heap (LOH) causing out of memory when 32-bit address space is exhausted even though there is theoretically plenty of free space. In OCaml (and probably most GC'd languages), functions with deep thread stacks run asymptotically slower. Also in OCaml, threads are prevented from running in parallel by a global lock on the GC so (in theory) parallelism can be achieved by dropping to C and using manual memory management.
Have you ever had to do this?
No, I have never had to do that. I have done it for fun. For example, I wrote a garbage collector in F# (a .NET language) and, in order to make my timings representative, I adopted an allocationless style in order to avoid GC latency. In production code, I have had to optimize my programs using knowledge of how the garbage collector works but I have never even had to circumvent it from within .NET, much less drop .NET entirely because it imposes a GC.
The nearest I have come to dropping garbage collection was dropping the OCaml language itself because its GC impedes parallelism. However, I ended up migrating to F# which is a .NET language and, consequently, inherits the CLR's excellent multicore-capable GC.
I don't quite understand the question. Since you ask about a language that uses GC, I assume you are asking for examples like
Deliberately hang on to a reference even when I know it's dead, maybe to reuse the object to satisfy a future allocation request.
Keep track of some objects and close them explicitly, because they hold resources that can't easily be managed with the garbage collector (open file descriptors, windows on the screen, that sort of thing).
I've never found a reason to do #1, but #2 is one that comes along occasionally. Many garbage collectors offer mechanisms for finalization, which is an action that you bind to an object and the system runs that action before the object is reclaimed. But oftentimes the system provides no guarantees about whether or if finalizers actually run, so finalization can be of limited utility.
The main thing I do in a garbage-collected language is to keep a tight watch on the number of allocations per unit of other work I do. Allocation is usually the performance bottleneck, especially in Java or .NET systems. It is less of an issue in languages like ML, Haskell, or LISP, which are typically designed with the idea that the program is going to allocate like crazy.
EDIT: longer response to comment.
Not everyone understands that when it comes to performance, the allocator and the GC must be considered as a team. In a state-of-the-art system, allocation is done from contiguous free space (the 'nursery') and is as quick as test and increment. But unless the object allocated is incredibly short-lived, the object incurs a debt down the line: it has to be copied out of the nursery, and if it lives a while, it may be copied through several generatations. The best systems use contiguous free space for allocation and at some point switch from copying to mark/sweep or mark/scan/compact for older objects. So if you're very picky, you can get away with ignoring allocations if
You know you are dealing with a state-of-the art system that allocates from continuous free space (a nursery).
The objects you allocate are very short-lived (less than one allocation cycle in the nursery).
Otherwise, allocated objects may be cheap initially, but they represent work that has to be done later. Even if the cost of the allocation itself is a test and increment, reducing allocations is still the best way to improve performance. I have tuned dozens of ML programs using state-of-the-art allocators and collectors and this is still true; even with the very best technology, memory management is a common performance bottleneck.
And you'd be surprised how many allocators don't deal well even with very short-lived objects. I just got a big speedup from Lua 5.1.4 (probably the fastest of the scripting language, with a generational GC) by replacing a sequence of 30 substitutions, each of which allocated a fresh copy of a large expression, with a simultaneous substitution of 30 names, which allocated one copy of the large expression instead of 30. Performance problem disappeared.
In video games, you don't want to run the garbage collector in between a game frame.
For example, the Big Bad is in front
of you and you are down to 10 life.
You decided to run towards the Quad
Damage powerup. As soon as you pick up
the powerup, you prepare yourself to
turn towards your enemy to fire with
your strongest weapon.
When the powerup disappeared, it would
be a bad idea to run the garbage
collector just because the game world
has to delete the data for the
powerup.
Video games usually manages their objects by figuring out what is needed in a certain map (this is why it takes a while to load maps with a lot of objects). Some game engines would call the garbage collector after certain events (after saving, when the engine detects there's no threat in the vicinity, etc).
Other than video games, I don't find any good reasons to turn off garbage collecting.
Edit: After reading the other comments, I realized that embedded systems and Space Hardening (Bill's and tinkertim's comments, respectively) are also good reasons to turn off the garbage collector
The more critical the execution, the more you want to postpone garbage collection, but the longer you postpone garbage collection, the more of a problem it will eventually be.
Use the context to determine the need:
1.
Garbage collection is supposed to protect against memory leaks
Do you need more state than you can manage in your head?
2.
Returning memory by destroying objects with no references can be unpredictable
Do you need more pointers than you can manage in your head?
3.
Resource starvation can be caused by garbage collection
Do you have more CPU and memory than you can manage in your head?
4.
Garbage collection cannot address files and sockets
Do you have I/O as your primary concern?
In systems that use garbage collection, weak pointers are sometimes used to implement a simple caching mechanism because objects with no strong references are deallocated only when memory pressure triggers garbage collection. However, with ARC, values are deallocated as soon as their last strong reference is removed, making weak references unsuitable for such a purpose.
References
GC FAQ
Smart Pointer Guidelines
Transitioning to ARC Release Notes
Accurate Garbage Collection with LLVM
Memory management in various languages
jwz on Garbage Collection
Apple Could Power the Web
How Do The Script Garbage Collectors Work?
Minimize Garbage Generation: GC is your Friend, not your Servant
Garbage Collection in IE6
Slow web browser performance when you view a Web page that uses JScript in Internet Explorer 6
Transitioning to ARC Release Notes: Which classes don’t support weak references?
Automatic Reference Counting: Weak References

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources