I am learning memory managment in Windows. I know that process in windows has by default its heap, that can be extended in future. Also process can create additional (private) heaps. Why does windows allow to create private heap? What is benefit of such approach? As I understand usage of default heap (with possible reallocations) is enough. Or maybe is it another way to optimize reallocations?
If you look at HeapCreate you will see that it has multiple options that changes how the heap works. HEAP_NO_SERIALIZE will make it faster but you have to handle thread synchronization on your own etc.
Having multiple heaps can also be beneficial if you allocate objects of different sizes with different lifetimes. You might want to put large long-living objects on their own heap if you also have a high churn of small objects that are allocated and de-allocated as part of your work to reduce fragmentation (and lock contention if you are multithreaded).
As noted in a comment, you can call HeapDestroy to free every allocation and the heap itself in one call but this only makes sense if you have full control over everything allocated there. You are not allowed to destroy the default heap so you must create your own private heap to use this trick.
I recently stumbled over an issue within a FreePascal project I'm developing: The application requires a look-up array which may become very large during the runtime (a few million entries). Each array element is about 8 bytes in size.
I observed the following behavior of my application: If the array is already quite large (~130 MB), another enlargement will result in a peak in memory consumption and maybe also in a volatile raise of used RAM.
As far as I read, the peak may be explained with the internal behavior of the SetLength()-method which allocates memory with the size of the new array and then copies the old array to its new destination in memory.
But during examination of the sudden increase of used memory it seemed that there were situations where the "old" memory was not freed, resulting in a doubled usage of the RAM.
I was able to reproduce this behavior more clearly as I raised the steps in which the array was enlarged.
To become rid of this problem, I changed the memory manager to CMem and the issue was gone.
Unfortunately I did not found a clear description of the Free Pascal memory manager and I can only guess, that the space of the "old" (small) array is not used because the built-in memory manager wants the heap to be not-fragmented all the time, but I could not prove that.
Does someone of you have a source which describes the basic functionality of the Free Pascal memory manager and the C memory manager and/or the differences between both?
Thank you very much, kind regards
Alex
I'm writing an application leveraging JavaFX that scrolls a large amount of image content on and off of the screen every 20-30 seconds. It's meant to be able to run for multiple hours, pulling in completely new content and discarding old content every couple minutes. I have 512Mb of graphics memory on my system and after several minutes, all of that memory has been consumed by JavaFX and no matter what I do with my JavaFX scene, none of it is released. I've been very careful to discard nodes when they drop off of the scene, and at most I have 50-60 image nodes in memory at one time. I really need to be able to do a hard release of the graphics memory that was backing these images, but haven't been able to figure out how to accomplish that, as the Image interface in JavaFX seems to be very high level. JavaFX will continue to run fine, but other graphics heavy applications will fail to load due to limited resources.
I'm looking for something like the flush() method on java.awt.image.Image:
http://docs.oracle.com/javase/7/docs/api/java/awt/Image.html#flush()
I'm running java 7u13 on Linux.
EDIT:
I managed to work out a potential workaround ( see below ), but have also entered a JavaFX JIRA ticket to request the functionality described above:
RT-28661
Add explicit access to a native resource cleanup function on nodes.
The best workaround that I could come up with was to set my JVM's max heap to half of the available limit of my graphics card. ( I have 512mb of graphics memory, so I set this to -Xmx256m ) This forces the GC to be more proactive in cleaning up my discarded javafx.image.Image objects, which in turn seems to trigger graphics memory cleanup on the part of JavaFX.
Previously my heap space was set to 512mb, ( I have 4gb of system memory, so this is a very manageable limit ). The problem with that seems to be that the JVM was being very lazy about cleaning up my images until it started approaching this 512mb limit. Since all of my image data was copied into graphics memory, this meant I had most likely exhausted my graphics memory before the JVM really started really caring about cleanup.
I did try some of the suggestions by jewelsea:
I am calling setCache(false), so this may be having a positive affect, but I didn't notice an improvement until I dropped my max heap size.
I tried running with Java8 with some positive results. It did seem to behave better in graphics memory management, but it still ate up all of my memory and didn't seem to start caring about graphics memory until I was almost out. If reducing your the application's heap limit is not feasible, then evaluating the Java8 pre-release may be worthwhile.
I will be posting some feature requests to the JavaFX project and will provide links to the JIRA tickets.
Perhaps you are encountering behaviour related to the root cause of the following issue:
RT-16011 Need mechanism for PG nodes to know when they are no longer part of a scene graph
From the issue description:
Some PG nodes contain handles to non-heap resources, such as GPU textures, which we would want to aggressively reclaim when the node is no longer part of a scene graph. Unfortunately, there is no mechanism to report this state change to them so that they can release their resources so we must rely on a combination of GC, Ref queues, and sometimes finalization to reclaim the resources. Lazy reclamation of some of these resources can result in exceptions when garbage collection gets behind and we run out of these limited resources.
There are numerous other related issues you can see when you look at the issue page I linked (signup is required to view the issue, but anybody can signup).
A sample related issue is:
RT-15516 image data associated with cached nodes that are removed from a scene are not aggressively released
On which a user commented:
I found a workaround for my app just settihg up an using of cashe to false for all frequently using nodes. 2 days working without any crashes.
So try calling setCache(false) on your nodes.
Also try using a Java 8 preview release where some of these issues have been fixed and see if it increases the stability of your application. Though currently, even in the Java 8 branch, there are still open issues such as the following:
RT-25323 Need a unified Texture resource management system for Prism
Currently texture resources are managed separately in at least 2 places depending on how it is used; one is a texture cache for images and the other is the ImagePool for RTTs. This approach is flawed in its design, i.e. the 2 caches are unaware of each other and it assumes system has unlimited native resources.
Using a video card with more memory may either reduce or eliminate the issue.
You may also wish to put together a minimal executable example which demonstrates your issue and raise a bug request against the JavaFX Runtime project so that a JavaFX developer can investigate your scenario and see if it is new or a duplicate of a known issue.
I have been tasked with reducing memory footprint of a Windows CE 5.0 application. I came across Rob Tiffany's highly cited article which recommends using managed DLL to keep the code out of the process's slot. But there is something I don't understand.
The article says that
The JIT compiler is running in your slot and it pulls in IL from the 1
GB space as needed to compile the current call stack.
This means that all the code in the managed DLL can potentially eventually end up in the process's slot. While this will help other processes by not loading the code in common area how does it help this process? FWIW the article does mention that
It also reduces the amount of memory that has to be allocated inside your
My only thought is that just as the code is pulled into the slot it is also pushed/swapped out. But that is just a wild guess and probably completely false.
CF assemblies aren't loaded into the process slot like native DLLs are. They're actually accessed as memory-mapped files. This means that the size of the DLL is effectively irrelevant.
The managed heap also lies in shared memory, not your process slot, so object allocations are far less likely to cause process slot fragmentation or OOM's.
The JITter also doesn't just JIT and hold forever. It compiles what is necessary, and during a GC may very well pitch compiled code that is not being used, or that hasn't been used in a while. You're never going to see an entire assembly JITTed and pulled into the process slow (well if it's a small assembly maybe, but it's certainly not typical).
Obviously some process slot memory has to be used to create some pointers, stack storage, etc etc, but by and large managed code has way less impact on the process slot limitations than native code. Of course you can still hit the limit with large stacks, P/Invokes, native allocations and the like.
In my experience, the area people get into trouble most often with CF apps an memory is with GDI objects and drawing. Bitmaps take up a lot of memory. Even though it's largely in shared memory, creating lots of them (along with brushes, pens, etc) and not caching and reusing is what most often give a large managed app memory footprint.
For a bit more detail this MSDN webcast on Compact Framework Memory Management, while old, is still very relevant.
The llvm documentation says:
In practice, however, the locality and performance benefits of using aggressive garbage collection techniques dominates any low-level losses.
So what is it, exactly, that causes the performance gain when using garbage collection as opposed to manually managing memory? (besides the obvious decrease in code writing time) Is the benefit solely that performing heap compaction increases spatial locality and cache utilization? Or is there something else that helps more, like deleting everything at once?
On modern processors the memory caches are King. Suffering a cache miss can stall the processor for hundreds of cpu cycles, waiting for the slow bus to supply the data.
Making the caches effective requires locality of reference. In other words, if the next memory access is close to the previous one then the odds that the data is already in the cache are high.
A garbage collector can help a lot to make that work out well. The big win is not the collection, it is its ability to rebuild the object graph and reorganize the data structure while doing so. Compacting.
Imagine the typical data structure, an array of pointers to objects. Which is slowly being built up while, say, reading a bunch of strings from a file and turning them into field values of an object. Allocated objects will be scatter-shot in the address space doing so. Long lived objects pointed-to by the array separated by the worker objects, like strings. Iterating that array later is going to be pretty slow.
Until the garbage collector runs and rebuilds the data structure. Putting all of the pointed-to objects in order.
Now iterating the collection is very fast, since accessing element N makes it very likely that element N+1 is readily available. If not in the L1 cache then very good odds for L2 or L3 (if you have it).
Very big win, it is the one feature that made garbage collection competitive with explicit memory management. With the explicit kind having the problem of not supporting moving objects because it will invalidate a pointer.
I can only speak for the Oracle (ex-Sun) and IBM JVMs; their efficiency relies on the fact that newly-created objects are unlikely to live very long. So segregating them into their own area allows that area to be frequently compacted, since with few survivors that's a cheap operation. Frequent compaction means that free space can be kept contiguous, so object creation is also cheap because there's no free chain to traverse and no memory fragmentation.
Manual memory management schemes are rarely this efficient because this is a relatively complex way of doing things that is unlikely to be reinvented for each application. These garbage collectors have evolved and been optimised over a longer period and with more effort than individual applications ever receive. It would be surprising and disappointing if they weren't much more performant.
I doubt locality helps performance at all - admittedly small objects tend to be created at the same time in the same area of the heap (but this applies to C as well), over time, these small objects that remain will be compacted into a closely related area of the heap and it is supposedly this that give you an advantage over C-style allocations. However, show me a program that uses just these small objects and I'll show you a program that does sod all. Show me a program that passes all objects that are to be used on the stack and I'll show you one that screams with speed.
The de-allocation of memory is a performance benefit, short-term as they do not need to be de-allocated. However, when the garbage collector does kick in, this benefit disappears. Usually though, the collection occurs when nothing else is happening in the system (theoretically) so the cost is effectively nullified.
Compaction of the heap also helps allocation, all allocations can come from the beginning of the heap, and the memory manager doesn't have to walk the heap looking for the next free space block of the right size. However, traditional systems can gain the same amount of speed by using multiple fixed-block heaps (which mean you always allocate from a heap for the size of block you want, and you always allocate a fixed block, so walking the heap is just to find the first free block, and this can be removed using a bitmap)
So all in all, there isn't much of a benefit at all, except in benchmarks of course. In my experience the GC can and will jump in and slow you down dramatically at just the wrong time, usually when the system memory is getting filled because the user has done something like load a new page that required a lot of memory allocations.... which in turn required a collection.
It also has a tendency to use a lot of memory - 'memory is cheap' is the mantra of GC languages, so programs are written with this in mind, which means memory allocations are much more common, especially for temporaries and intermediate objects. Just look to StringBuilder classes for the evidence that this is well known. Strings may be 'solved' using this, but many other objects are still allocated with wild abandon. Any program that uses a lot of memory will find itself struggling with RAM IO - all that memory has to be brought into the CPU caches to be used, the more memory you use, the more IO your CPU MM will have to do and that can kill performance in the wrong circumstances.
In addition, when a GC occurs, you have to handle Finalised objects too, this isn't quite as bad as it used to be, but it can still halt your program while the finalisers are run.
Old Java GCs were dreadful for perf, though a lot of research has made them significantly better, they are still not perfect.
EDIT:
one more thing about localisation, imagine creating an array and adding a few items, then do a load of allocations, then you want to add another item to the array - with a GC system the added array element will not be localised, even after a compaction, each object in the array will be stored as an individual item on the heap. This is why I think the localisation issue is not as big a deal as it's made out to be. Now, compare that to an array that is allocated with a buffer and objects are allocated within the buffer space. That may require a re-alloc and copy to add a new item, but reading and modifying it is super fast.
One factor not yet mentioned is that, especially in multi-threaded systems, it can sometimes be difficult to predict with certainty what object will end up holding the last surviving reference to some other object. If one doesn't have to worry about object graphs that might contain cycles, it's possible to use reference counts for this purpose. Before copying a reference to an object, increment its reference count. Before destroying a reference to an object, decrement its reference count. It decrementing the reference count makes it hit zero, destroy the object as well as the reference. Such an approach works well on computers with only one CPU core; if only one thread can actually be running at any given time, one doesn't have to worry about what will happen if two threads try to adjust the same object's reference count simultaneously. Unfortunately, in systems with multiple CPU cores, any CPU that wants to adjust a reference count would have to coordinate that action with all the other CPUs to ensure that two CPUs never hit the counter at the exact same time. Such coordination is "free" with a single CPU, but is relatively expensive in multi-core systems.
When using a batch-mode garbage collector, object references may generally be freely assigned, copied, and destroyed, without inter-CPU coordination. It will periodically be necessary to have all the CPUs stop and run a garbage-collection cycle, but requiring all the CPUs to coordinate with each other once every few seconds or so is a lot cheaper than requiring them to coordinate with each other on every single object-reference assignment.