How can too much Virtual Memory negatively impact performance? [duplicate] - performance

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Does anyone have experience with using very large heaps, 12 GB or higher in Java?
Does the GC make the program unusable?
What GC params do you use?
Which JVM, Sun or BEA would be better suited for this?
Which platform, Linux or Windows, performs better under such conditions?
In the case of Windows is there any performance difference to be had between 64 bit Vista and XP under such high memory loads?

If your application is not interactive, and GC pauses are not an issue for you, there shouldn't be any problem for 64-bit Java to handle very large heaps, even in hundreds of GBs. We also haven't noticed any stability issues on either Windows or Linux.
However, when you need to keep GC pauses low, things get really nasty:
Forget the default throughput, stop-the-world GC. It will pause you application for several tens of seconds for moderate heaps (< ~30 GB) and several minutes for large ones (> ~30 GB). And buying faster DIMMs won't help.
The best bet is probably the CMS collector, enabled by -XX:+UseConcMarkSweepGC. The CMS garbage collector stops the application only for the initial marking phase and remarking phases. For very small heaps like < 4 GB this is usually not a problem, but for an application that creates a lot of garbage and a large heap, the remarking phase can take quite a long time - usually much less then full stop-the-world, but still can be a problem for very large heaps.
When the CMS garbage collector is not fast enough to finish operation before the tenured generation fills up, it falls back to standard stop-the-world GC. Expect ~30 or more second long pauses for heaps of size 16 GB. You can try to avoid this keeping the long-lived garbage production rate of you application as low as possible. Note that the higher the number of the cores running your application is, the bigger is getting this problem, because the CMS utilizes only one core. Obviously, beware there is no guarantee the CMS does not fall back to the STW collector. And when it does, it usually happens at the peak loads, and your application is dead for several seconds. You would probably not want to sign an SLA for such a configuration.
Well, there is that new G1 thing. It is theoretically designed to avoid the problems with CMS, but we have tried it and observed that:
Its throughput is worse than that of CMS.
It theoretically should avoid collecting the popular blocks of memory first, however it soon reaches a state where almost all blocks are "popular", and the assumptions it is based on simply stop working.
Finally, the stop-the-world fallback still exists for G1; ask Oracle, when that code is supposed to be run. If they say "never", ask them, why the code is there. So IMHO G1 really doesn't make the huge heap problem of Java go away, it only makes it (arguably) a little smaller.
If you have bucks for a big server with big memory, you have probably also bucks for a good, commercial hardware accelerated, pauseless GC technology, like the one offered by Azul. We have one of their servers with 384 GB RAM and it really works fine - no pauses, 0-lines of stop-the-world code in the GC.
Write the damn part of your application that requires lots of memory in C++, like LinkedIn did with social graph processing. You still won't avoid all the problems by doing this (e.g. heap fragmentation), but it would be definitely easier to keep the pauses low.

I am CEO of Azul Systems so I am obviously biased in my opinion on this topic! :) That being said...
Azul's CTO, Gil Tene, has a nice overview of the problems associated with Garbage Collection and a review of various solutions in his Understanding Java Garbage Collection and What You Can Do about It presentation, and there's additional detail in this article: http://www.infoq.com/articles/azul_gc_in_detail.
Azul's C4 Garbage Collector in our Zing JVM is both parallel and concurrent, and uses the same GC mechanism for both the new and old generations, working concurrently and compacting in both cases. Most importantly, C4 has no stop-the-world fall back. All compaction is performed concurrently with the running application. We have customers running very large (hundreds of GBytes) with worse case GC pause times of <10 msec, and depending on the application often times less than 1-2 msec.
The problem with CMS and G1 is that at some point Java heap memory must be compacted, and both of those garbage collectors stop-the-world/STW (i.e. pause the application) to perform compaction. So while CMS and G1 can push out STW pauses, they don't eliminate them. Azul's C4, however, does completely eliminate STW pauses and that's why Zing has such low GC pauses even for gigantic heap sizes.

We have an application that we allocate 12-16 Gb for but it really only reaches 8-10 during normal operation. We use the Sun JVM (tried IBMs and it was a bit of a disaster but that just might have been ignorance on our part...I have friends that swear by it--that work at IBM). As long as you give your app breathing room, the JVM can handle large heap sizes with not too much GC. Plenty of 'extra' memory is key.
Linux is almost always more stable than Windows and when it is not stable it is a hell of a lot easier to figure out why. Solaris is rock solid as well and you get DTrace too :)
With these kind of loads, why on earth would you be using Vista or XP? You are just asking for trouble.
We don't do anything fancy with the GC params. We do set the minimum allocation to be equal to the maximum so it is not constantly trying to resize but that is it.

I have used over 60 GB heap sizes on two different applications under Linux and Solaris respectively using 64-bit versions (obviously) of the Sun 1.6 JVM.
I never encountered garbage collection problems with the Linux-based application except when pushing up near the heap size limit. To avoid the thrashing problems inherent to that scenario (too much time spent doing garbage collection), I simply optimized memory usage throughout the program so that peak usage was about 5-10% below a 64 GB heap size limit.
With a different application running under Solaris, however, I encountered significant garbage-collection problems which made it necessary to do a lot of tweaking. This consisted primarily of three steps:
Enabling/forcing use of the parallel garbage collector via the -XX:+UseParallelGC -XX:+UseParallelOldGC JVM options, as well as controlling the number of GC threads used via the -XX:ParallelGCThreads option. See "Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning" for more details.
Extensive and seemingly ridiculous setting of local variables to "null" after they are no longer needed. Most of these were variables that should have been eligible for garbage collection after going out of scope, and they were not memory leak situations since the references were not copied. However, this "hand-holding" strategy to aid garbage collection was inexplicably necessary for some reason for this application under the Solaris platform in question.
Selective use of the System.gc() method call in key code sections after extensive periods of temporary object allocation. I'm aware of the standard caveats against using these calls, and the argument that they should normally be unnecessary, but I found them to be critical in taming garbage collection when running this memory-intensive application.
The three above steps made it feasible to keep this application contained and running productively at around 60 GB heap usage instead of growing out of control up into the 128 GB heap size limit that was in place. The parallel garbage collector in particular was very helpful since major garbage-collection cycles are expensive when there are a lot of objects, i.e., the time required for major garbage collection is a function of the number of objects in the heap.
I cannot comment on other platform-specific issues at this scale, nor have I used non-Sun (Oracle) JVMs.

12Gb should be no problem with a decent JVM implementation such as Sun's Hotspot.
I would advice you to use the Concurrent Mark and Sweep colllector ( -XX:+UseConcMarkSweepGC) when using a SUN VM.Otherwies you may face long "stop the world" phases, were all threads are stopped during a GC.
The OS should not make a big difference for the GC performance.
You will need of course a 64 bit OS and a machine with enough physical RAM.

I recommend also considering taking a heap dump and see where memory usage can be improved in your app and analyzing the dump in something such as Eclipse's MAT . There are a few articles on the MAT page on getting started in looking for memory leaks. You can use jmap to obtain the dump with something such as ...
jmap -heap:format=b pid

As mentioned above, if you have a non-interactive program, the default (compacting) garbage collector (GC) should work well. If you have an interactive program, and you (1) don't allocate memory faster than the GC can keep up, and (2) don't create temporary objects (or collections of objects) that are too big (relative to the total maximum JVM memory) for the GC to work around, then CMS is for you.
You run into trouble if you have an interactive program where the GC doesn't have enough breathing room. That's true regardless of how much memory you have, but the more memory you have, the worse it gets. That's because when you get too low on memory, CMS will run out of memory, whereas the compacting GCs (including G1) will pause everything until all the memory has been checked for garbage. This stop-the-world pause gets bigger the more memory you have. Trust me, you don't want your servlets to pause for over a minute. I wrote a detailed StackOverflow answer about these pauses in G1.
Since then, my company has switched to Azul Zing. It still can't handle the case where your app really needs more memory than you've got, but up until that very moment it runs like a dream.
But, of course, Zing isn't free and its special sauce is patented. If you have far more time than money, try rewriting your app to use a cluster of JVMs.
On the horizon, Oracle is working on a high-performance GC for multi-gigabyte heaps. However, as of today that's not an option.

If you switch to 64-bit you will use more memory. Pointers become 8 bytes instead of 4. If you are creating lots of objects this can be noticeable seeing as every object is a reference (pointer).
I have recently allocated 15GB of memory in Java using the Sun 1.6 JVM with no problems. Though it is all only allocated once. Not much more memory is allocated or released after the initial amount. This was on a Linux but I imagine the Sun JVM will work just as well on 64-bit Windows.

You should try running visualgc against your app. It´s a heap visualization tool that´s part of the jvmstat download at http://java.sun.com/performance/jvmstat/
It is a lot easier than reading GC logs.
It quickly helps you understand how the parts (generations) of the heap are working. While your total heap may be 10GB, the various parts of the heap will be much smaller. GCs in the Eden portion of the heap are relatively cheap, while full GCs in the old generation are expensive. Sizing your heap so that that the Eden is large and the old generation is hardly ever touched is a good strategy. This may result in a very large overall heap, but what the heck, if the JVM never touches the page, it´s just a virtual page, and doesn´t have to take up RAM.

A couple of years ago, I compared JRockit and the Sun JVM for a 12G heap. JRockit won, and Linux hugepages support made our test run 20% faster. YMMV as our test was very processor/memory intensive and was primarily single-threaded.

here's an article on gc FROM one of Java Champions --
http://kirk.blog-city.com/is_your_concurrent_collector_failing_you.htm
Kirk, the author writes
"Send me your GC logs
I'm currently interested in studying Sun JVM produced GC logs. Since these logs contain no business relevent information it should be ease concerns about protecting proriatary information. All I ask that with the log you mention the OS, complete version information for the JRE, and any heap/gc related command line switches that you have set. I'd also like to know if you are running Grails/Groovey, JRuby, Scala or something other than or along side Java. The best setting is -Xloggc:. Please be aware that this log does not roll over when it reaches your OS size limit. If I find anything interesting I'll be happy to give you a very quick synopsis in return. "

An article from Sun on Java 6 can help you: https://www.oracle.com/java/technologies/javase/troubleshooting-javase.html

The max memory that XP can address is 4 gig(here). So you may not want to use XP for that(use a 64 bit os).

sun has had an itanium 64-bit jvm for a while although itanium is not a popular destination. The solaris and linux 64-bit JVMs should be what you should be after.
Some questions
1) is your application stable ?
2) have you already tested the app in a 32 bit JVM ?
3) is it OK to run multiple JVMs on the same box ?
I would expect the 64-bit OS from windows to get stable in about a year or so but until then, solaris/linux might be better bet.

Related

How fast is the go 1.5 gc with terabytes of RAM?

Java cannot use terabytes of RAM because the GC pause is way too long (minutes). With the recent update to the Go GC, I'm wondering if its GC pauses are short enough for use with huge amounts of RAM, such as a couple of terabytes.
Are there any benchmarks of this yet? Can we use a garbage-collected language with this much RAM now?
tl;dr:
You can't use TBs of RAM with a single Go process right now. Max is 512 GB on Linux, and most that I've seen tested is 240 GB.
With the current background GC, GC workload tends to be more important than GC pauses.
You can understand GC workload as pointers * allocation rate / spare RAM. Of apps using tons of RAM, only those with few pointers or little allocation will have a low GC workload.
I agree with inf's comment that huge heaps are worth asking other folks about (or testing). JimB notes that Go heaps have a hard limit of 512 GB right now, and 18 240 GB is the most I've seen tested.
Some things we know about huge heaps, from the design document and the GopherCon 2015 slides:
The 1.5 collector doesn't aim to cut GC work, just cut pauses by working in the background.
Your code is paused while the GC scans pointers on the stack and in globals.
The 1.5 GC has a short pause on a GC benchmark with a roughly 18GB heap, as shown by the rightmost yellow dot along the bottom of this graph from the GopherCon talk:
Folks running a couple production apps that initially had about 300ms pauses reported drops to ~4ms and ~20ms. Another app reported their 95th percentile GC time went from 279ms to ~10ms.
Go 1.6 added polish and pushed some of the remaining work to the background. As a result, tests with heaps up to a bit over 200GB still saw a max pause time of 20ms, as shown in a slide in an early 2016 State of Go talk:
The same application that had 20ms pause times under 1.5 had 3-4ms pauses under 1.6, with about an 8GB heap and 150M allocations/minute.
Twitch, who use Go for their chat service, reported that by Go 1.7 pause times had been reduced to 1ms with lots of running goroutines.
1.8 took stack scanning out of the stop-the-world phase, bringing most pauses well under 1ms, even on large heaps. Early numbers look good. Occasionally applications still have code patterns that make a goroutine hard to pause, effectively lengthening the pause for all other threads, but generally it's fair to say the GC's background work is now usually much more important than GC pauses.
Some general observations on garbage collection, not specific to Go:
The frequency of collections depends on how quickly you use up the RAM you're willing to give to the process.
The amount of work each collection does depends in part on how many pointers are in use.
(That includes the pointers within slices, interface values, strings, etc.)
Rephrased, an application accessing lots of memory might still not have a GC problem if it only has a few pointers (e.g., it handles relatively few large []byte buffers), and collections happen less often if the allocation rate is low (e.g., because you applied sync.Pool to reuse memory wherever you were chewing through RAM most quickly).
So if you're looking at something involving heaps of hundreds of GB that's not naturally GC-friendly, I'd suggest you consider any of
writing in C or such
moving the bulky data out of the object graph. For example, you could manage data in an embedded DB like bolt, put it in an outside DB service, or use something like groupcache or memcache if you want more of a cache than a DB
running a set of smaller-heap'd processes instead of one big one
just carefully prototyping, testing, and optimizing to avoid memory issues.
The new Java ZGC garbage collector can now use 16 Terrabytes of memory and garbage collect in under 10ms.

is the jvm faster under load?

Lots of personal experience, anecdotal evidence, and some rudimentary analysis suggests that a Java server (running, typically, Oracle's 1.6 JVM) has faster response times when it's under a decent amount of load (only up to a point, obviously).
I don't think this is purely hotspot, since response times slow down a bit again when the traffic dies down.
In a number of cases we can demonstrate this by averaging response times from server logs ... in some cases it's as high as 20% faster, on average, and with a smaller standard deviation.
Can anyone explain why this is so? Is it likely a genuine effect, or are the averages simply misleading? I've seen this for years now, through several jobs, and tend to state it as a fact, but have no explanation for why.
Thanks,
Eric
EDIT a fairly large edit for wording and adding more detail throughout.
A few thoughts:
Hotspot kicks in when a piece of code is being executed significantly more than other pieces (it's the hot spot of the program). This makes that piece of code significantly faster (for the normal path) from that point forward. The rate of call after the hotspot compilation is not important, so I don't think this is causing the effect you are mentioning.
Is the effect real? It's very easy to trick yourself with statistics. Not saying you are, but be sure that all your runs are included in the result, and that all other effects (such as other programs, activity, and your monitoring program are the same in all cases. I have more than one had my monitoring program, such as top, cause a difference in behaviour). On one occasion, the performance of the application went up appreciably when the caches warmed up on the database - there was memory pressure from other applications on the same DB instance.
The Operating System and/or CPU may well be involved. The OS and CPU both actively and passively do things to improve the responsiveness of the main program as it moves from being mainly running to being mainly waiting for I/O and vice versa, including:
OS paging memory to disk while it's not being used, and back to RAM when the program is running
OS will cache frequently used disk blocks, which again may improve the application performance
CPU instruction and memory caches fill with the active program's instruction and data
Java applications particularly sensitive to memory paging effects because:
A typical Java application server will pre-allocate almost all free memory to Java. The large memory makes the application inherently more sensitive to memory effects
The generational garbage collector used to manage Java memory ends up creating new objects over a lot of pages, so each request to the application will need more page requests than in other languages. (this is true principally for 'new' objects that have not been through many garbage collections. Objects promoted to the permanent generation are actually very compactly stored)
As most available physical memory is allocated on the system, there is always a pressure on memory, and the largest, least recently run application is a perfect candidate to be pages out.
With these considerations, there is much more probability that there will be page misses and therefore a performance hit than environments with smaller memory requirements. These will be particularly manifest after Java has been idle for some time.
If you use Solaris or Mac, the excellent dTrace can trace memory and disk paging specific to an application. The JVM has numerous dTrace hooks that can be used as triggers to start and stop page monitoring.
On Solaris, you can use large memory pages (even over 1GB in size) and pin them to RAM so they will never be paged out. This should eliminate the memory page problem stated above. Remember to leave a good chunk of free memory for disk caching and for other system/maintenance/backup/management apps. I am sure that other OSes support similar features.
TL/DR: The currently running program in modern operating systems will appear to run faster after a few seconds as the OS brings the program and data pages back from disk, places frequently used disk pages in disk cache and the OS instruction and data caches will tend to be "warmer" for the main program. This effect is not unique to the JVM but is more visible due to the memory requirements of typical Java applications and the garbage collection memory model.

many-core CPU's: Programming techniques to avoid disappointing scalability

We've just bought a 32-core Opteron machine, and the speedups we get are a little disappointing: beyond about 24 threads we see no speedup at all (actually gets slower overall) and after about 6 threads it becomes significantly sub-linear.
Our application is very thread-friendly: our job breaks down into about 170,000 little tasks which can each be executed separately, each taking 5-10 seconds. They all read from the same memory-mapped file of size about 4Gb. They make occasional writes to it, but it might be 10,000 reads to each write - we just write a little bit of data at the end of each of the 170,000 tasks. The writes are lock-protected. Profiling shows that the locks are not a problem. The threads use a lot of JVM memory each in non-shared objects and they make very little access to shared JVM objects and of that, only a small percentage of accesses involve writes.
We're programming in Java, on Linux, with NUMA enabled. We have 128Gb RAM. We have 2 Opteron CPU's (model 6274) of 16 cores each. Each CPU has 2 NUMA nodes. The same job running on an Intel quad-core (i.e. 8 cores) scaled nearly linearly up to 8 threads.
We've tried replicating the read-only data to have one-per-thread, in the hope that most lookups can be local to a NUMA node, but we observed no speedup from this.
With 32 threads, 'top' shows the CPU's 74% "us" (user) and about 23% "id" (idle). But there are no sleeps and almost no disk i/o. With 24 threads we get 83% CPU usage. I'm not sure how to interpret 'idle' state - does this mean 'waiting for memory controller'?
We tried turning NUMA on and off (I'm referring to the Linux-level setting that requires a reboot) and saw no difference. When NUMA was enabled, 'numastat' showed only about 5% of 'allocation and access misses' (95% of cache misses were local to the NUMA node). [Edit:] But adding "-XX:+useNUMA" as a java commandline flag gave us a 10% boost.
One theory we have is that we're maxing out the memory controllers, because our application uses a lot of RAM and we think there are a lot of cache misses.
What can we do to either (a) speed up our program to approach linear scalability, or (b) diagnose what's happening?
Also: (c) how do I interpret the 'top' result - does 'idle' mean 'blocked on memory controllers'? and (d) is there any difference in the characteristics of Opteron vs Xeon's?
I also have a 32 core Opteron machine, with 8 NUMA nodes (4x6128 processors, Mangy Cours, not Bulldozer), and I have faced similar issues.
I think the answer to your problem is hinted at by the 2.3% "sys" time shown in top. In my experience, this sys time is the time the system spends in the kernel waiting for a lock. When a thread can't get a lock it then sits idle until it makes its next attempt. Both the sys and idle time are a direct result of lock contention. You say that your profiler is not showing locks to be the problem. My guess is that for some reason the code causing the lock in question is not included in the profile results.
In my case a significant cause of lock contention was not the processing I was actually doing but the work scheduler that was handing out the individual pieces of work to each thread. This code used locks to keep track of which thread was doing which piece of work. My solution to this problem was to rewrite my work scheduler avoiding mutexes, which I have read do not scale well beyond 8-12 cores, and instead use gcc builtin atomics (I program in C on Linux). Atomic operations are effectively a very fine grained lock that scales much better with high core counts. In your case if your work parcels really do take 5-10s each it seems unlikely this will be significant for you.
I also had problems with malloc, which suffers horrible lock issues in high core count situations, but I can't, off the top of my head, remember whether this also led to sys & idle figures in top, or whether it just showed up using Mike Dunlavey's debugger profiling method (How can I profile C++ code running in Linux?). I suspect it did cause sys & idle problems, but I draw the line at digging through all my old notes to find out :) I do know that I now avoid runtime mallocs as much as possible.
My best guess is that some piece of library code you are using implements locks without your knowledge, is not included in your profiling results, and is not scaling well to high core-count situations. Beware memory allocators!
I'm sure the answer will lie in a consideration of the hardware architecture. You have to think of multi core computers as if they were individual machines connected by a network. In fact that's all that Hypertransport and QPI are.
I find that to solve these scalability problems you have to stop thinking in terms of shared memory and start adopting the philosophy of Communicating Sequential Processes. It means thinking very differently, ie imagine how you would write the software if your hardware was 32 single core machines connected by a network. Modern (and ancient) CPU architectures are not designed to give unfettered scaling of the sort you're after. They are designed to allow many different processes to get on with processing their own data.
Like everything else in computing these things go in fashions. CSP dates back to the 1970s, but the very modern and Java derived Scala is a popular embodiment of the concept. See this section on Scala concurrency on Wikipedia.
What the philosophy of CSP does is force you to design a data distribution scheme that fits your data and the problem you're solving. That's not necessarily easy, but if you manage it then you have a solution that will scale very well indeed. Scala may make it easier to develop.
Personally I do everything in CSP and in C. It's allowed me to develop a signal processing application that scales perfectly linearly from 8 cores to several thousand cores (the limit being how big my room is).
The first thing you're going to have to do is actually use NUMA. It isn't a magic setting that you turn on, you have to exploit it in your software's architecture. I don't know about Java, but in C one would bind a memory allocation to a specific core's memory controller (aka memory affinity), and similarly for threads (core affinity) in cases where the OS doesn't get the hint.
I presume that your data doesn't break down into 32 neat, discrete chunks? It's difficult to give advice without knowing exactly the data flows implicit in your program. But think about it in terms of data flow. Draw it out even; Data Flow Diagrams are useful for this (another ancient graphical formal notation). If your picture shows all your data going through a single object (eg through a single memory buffer) then it's going to be slow...
I assume you have optimized your locks, and synchronization made a minimum. In such a case, it still depends a lot on what libraries you are using to program in parallel.
One issue that can happen even if you have no synchronization issue, is memory bus congestion. This is very nasty and difficult to get rid of.
All I can suggest is somehow make your tasks bigger and create fewer tasks. This depends highly on the nature of your problem. Ideally you want as many tasks as the number of cores/threads, but this is not easy (if possible) to achieve.
Something else that can help is to give more heap to your JVM. This will reduce the need to run Garbage Collector frequently, and speeds up a little.
does 'idle' mean 'blocked on memory controllers'
No. You don't see that in top. I mean if the CPU is waiting for memory access, it will be shown as busy. If you have idle periods, it is either waiting for a lock, or for IO.
I'm the Original Poster. We think we've diagnosed the issue, and it's not locks, not system calls, not memory bus congestion; we think it's level 2/3 CPU cache contention.
To reiterate, our task is embarrassingly parallel so it should scale well. However, one thread has a large amount of CPU cache it can access, but as we add more threads, the amount of CPU cache each process can access gets lower and lower (the same amount of cache divided by more processes). Some levels on some architectures are shared between cores on a die, some are even shared between dies (I think), and it may help to get "down in the weeds" with the specific machine you're using, and optimise your algorithms, but our conclusion is that there's not a lot we can do to achieve the scalability we thought we'd get.
We identified this as the cause by using 2 different algorithms. The one which accesses more level 2/3 cache scales much worse than the one which does more processing with less data. They both make frequent accesses to the main data in main memory.
If you haven't tried that yet: Look at hardware-level profilers like Oracle Studio has (for CentOS, Redhat, and Oracle Linux) or if you are stuck with Windows: Intel VTune. Then start looking at operations with suspiciously high clocks per instruction metrics. Suspiciously high mean a lot higher than the same code on a single-numa, single-L3-cache machine (like current Intel desktop CPUs).

Memory management for intentionally memory intensive applications

Note: I am aware of the question Memory management in memory intensive application, however that question appears to be about applications that make frequent memory allocations, whereas my question is about applications intentionally designed to consume as much physical memory as is safe.
I have a server application that uses large amounts of memory in order to perform caching and other optimisations (think SQL Server). The application runs on a dedicated machine, and so can (and should) consume as much memory as it wants / is able to in order to speed up and increase throughput and response times without worry of impacting other applications on the system.
The trouble is that if memory usage is underestimated, or if load increases its possible to end up with nasty failures as memory allocations fail - in this situation obviously the best thing to do is to free up memory in order to prevent the failure at the expense of performance.
Some assumptions:
The application is running on a dedicated machine
The memory requirements of the application exceed the physical memory on the machine (that is, if additional memory was available to the application it would always be able to use that memory to in some way improve response times or throughput)
The memory is effectively managed in a way such that memory fragmentation is not an issue.
The application knows what memory can be safely freed, and what memory should be freed first for the least performance impact.
The app runs on a Windows machine
My question is - how should I handle memory allocations in such an application? In particular:
How can I predict whether or not a memory allocation will fail?
Should I leave a certain amount of memory free in order to ensure that core OS operations remain responsive (and don't in that way adversely impact the applications performance), and how can I find out how much memory that is?
The core objective is to prevent failures as a result of using too much memory, while at the same time using up as much memory as possible.
I'm a C# developer, however my hope is that the basic concepts for any such app are the same regardless of the language.
In linux, the memory usage percentage is divided into following levels.
0 - 30% - no swapping
30 - 60% - swap dirty pages only
60 - 90% - swap clean pages also based on LRU policy.
90% - Invoke OOM(Out of memory) killer and kill the process consuming maximum memory.
check this - http://linux-mm.org/OOM_Killer
In think windows might have similar policy, so you can check the memory stats and make sure you never get to the max threshold.
One way to stop consuming more memory is to go to sleep and give more time for memory cleanup threads to run.
That is a very good question, and bound to be subjective as well, because the very nature of the fundamental of C# is that all memory management is done by the runtime, i.e. Garbage Collector. The Garbage Collector is a non-deterministic entity that manages and sweeps the memory for reclaiming, depending on how often the memory gets fragmented, the GC will kick in hence to know in advance is not easy thing to do.
To properly manage the memory sounds tedious but common sense applies, such as the using clause to ensure an object gets disposed. You could put in a single handler to trap the OutOfMemory Exception but that is an awkward way, since if the program has run out of memory, does the program just seize up, and bomb out, or should it wait patiently for the GC to kick in, again determining that is tricky.
The load of the system can adversely affect the GC's job, almost to a point of a Denial of Service, where everything just grind to a halt, again, since the specifications of the machine, or what is the nature of that machine's job is unknown, I cannot answer it fully, but I'll assume it has loads of RAM..
In essence, while an excellent question, I think you should not worry about it and leave it to the .NET CLR to handle the memory allocation/fragmentation as it seems to do a pretty good job.
Hope this helps,
Best regards,
Tom.
Your question reminds me of an old discussion "So what's wrong with 1975 programming ?". The architect of varnish-cache argues, that instead of telling the OS to get out of the way and manage all memory yourself, you should rather cooperate with the OS and let it understand what you intend to do with memory.
For example, instead of simply reading data from disk, you should use memory-mapped files. This allows the OS to apply its LRU algorithm to write-back data to disk when memory becomes scarce. At the same time, as long as there is enough memory, your data will stay in memory. Thus, your application may potentially use all memory, without risking getting killed.

Is it reasonable for modern applications to consume large amounts of memory?

Applications like Microsoft Outlook and the Eclipse IDE consume RAM, as much as 200MB. Is it OK for a modern application to consume that much memory, given that few years back we had only 256MB of RAM? Also, why this is happening? Are we taking the resources for granted?
Is it acceptable when most people have 1 or 2 gigabytes of RAM on their PCS?
Think of this - although your 200mb is small and nothing to worry about given a 2Gb limit, everyone else also has apps that take masses of RAM. Add them together and you find that the 2Gb I have very quickly gets all used up. End result - your app appears slow, resource hungry and takes a long time to startup.
I think people will start to rebel against resource-hungry applications unless they get 'value for ram'. you can see this starting to happen on servers, as virtualised systems gain popularity - people are complaining about resource requirements and corresponding server costs.
As a real-world example, I used to code with VC6 on my old 512Mb 1.7GHz machine, and things were fine - I could open 4 or 5 copies along with Outlook, Word and a web browser and my machine was responsive.
Today I have a dual-processor 2.8Ghz server box with 3Gb RAM, but I cannot realistically run more than 2 copies of Visual Studio 2008, they both take ages to start up (as all that RAM still has to be copied in and set up, along with all the other startup costs we now have), and even Word take ages to load a document.
So if you can reduce memory usage you should. Don't think that you can just use whatever bloated framework/library/practice you want with impunity.
http://en.wikipedia.org/wiki/Moore%27s_law
also:
http://en.wikipedia.org/wiki/Wirth%27s_law
There's a couple of things you need to think about.
1/ Do you have 256M now? I wouldn't think so - my smallest memory machine is 2G so a 200M application is not much of a problem.
2a/ That 200M you talk about might not be "real" memory. It may just be address space in which case it might not all be in physical memory at once. Some bits may only be pulled in to physical memory when you choose to do esoteric things.
2b/ It may also be shared between other processes (such as a DLL). This means it could be only held in physical memory as one copy but be present in the address space of many processes. That way, the usage is amortized over those many processes. Both 2a and 2b depend on where your figure of 200M actually came from (which I don't know and, running Linux, I'm unlikel to find out without you telling me :-).
3/ Even if it is physical memory, modern operating systems aren't like the old DOS or Windows 3.1 - they have virtual memory where bits of applications can be paged out (data) or thrown away completely (code, since it can always reload from the executable). Virtual memory gives you the ability to use far more memory than your actual physical memory.
Many modern apps will take advantage of the existance of more memory to cache more. Some like firefox and SQL server have explicit settings for how much memory they will use. In my opinion, it's foolish to not use available memory - what's the point of having 2GB of RAM if your apps all sit around at 10MB leaving 90% of your physical memory unused. Of course, if your app does use caching like this, it better be good at releasing that memory if page file thrashing starts, or allow the user to limit the cache size manually.
You can see the advantage of this by running a decent-sized query against SQL server. The first time you run the query, it may take 10 seconds. But when you run that exact query again, it takes less than a second - why? The query plan was only compiled the first time and cached for use later. The database pages that needed to be read were only loaded from disk the first time - the second time, they were still cached in RAM. If done right, the more memory you use for caching (until you run into paging) the faster you can re-access data. You'll see the same thing in large documents (e.g. in Word and Acrobat) - when you scroll to new areas of a document, things are slow, but once it's been rendered and cached, things speed up. If you don't have enough memory, that cache starts to get overwritten and going to the old parts of the document gets slow again.
If you can make good use of the RAM, it is your responsability to use it.
Yes, it is perfectly normal. Also something big was changed since 256MB were normal... and do not forget that before that 640Kb were supposed to be enough for everybody!
Now most software solutions are build with a garbage collector: C#, Java, Ruby, Python... everybody love them because certainly development can be faster, however there is one glitch.
The same program can be memory leak free with either manual or automatic memory deallocation. However in the second case it is likely for the memory consumption to grow. Why? In the first case memory is deallocated and kept clean immediately after something becomes useless (garbage). However it takes time and computing power to detect that automatically, hence most collectors (except for reference counting) wait for garbage to accumulate in order to make worth the cost of the exploration. The more you wait the more garbage you can sweep with the cost of one blow, but more memory is needed to accumulate that garbage. If you try to force the collector constantly, your program would spend more time exploring memory than working on your problems.
You can be completely sure than as long as programmers get more resources, they will sacrifice them using heavier tools in exchange for more freedom, abstraction and faster development.
A few years ago 256 MB was the norm for a PC, then Outlook consumed about 30 - 35 MB or so of memory, that's around 10% of the available memory, Now PC's have 2 GB or more as a norm, and outlook consumes 200 MB of memory, that's about 10% also.
The 1st conclusion: as more memory is available applications use more of it.
The 2nd conclusion: no matter what time frame you pick there are applications that are true memory hogs (like Outlook) and applications that are very efficient memory wise.
The 3rd conclusion: memory consumption of a app can't go down with time, else 640K would have been enough even today.
It completely depends on the application.

Resources