Configure GWT compiler to use less memory? - maven

I am trying to build a large Maven-GWT project on a virtual host, which has a limit amount of RAM and cannot use swap space.
The GWT compile stage (where it computes permutations) uses a massive amount of the CPU and memory, and I was wondering if there was any way I could impose a limit on how much of each it uses, even if it takes much longer to compile.
Thanks

If you are using more than one worker thread, decrease it to just one worker thread - that will decrease the memory required. however, the compile will be correspondingly slower. Otherwise, there isn't much you can do to decrease the memory requirements. Setting the xmx to a lower number will work too, but that will cause OOME if it is too low. i think about 256m is the minimum, tho 128m works for most projects of a small to medium size.
add -Dgwt-plugin.localWorkers="1" and -Dgwt-plugin.extraJvmArgs="-Xmx128m -Xms16m" to your MAVEN_OPTS, and tweak those numbers till it works nicely.

Related

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.
The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

maxed CPU performance - stack all tasks or aim for less than 100%?

I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?
Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.
Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.
tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.

Cache miss in an Out of order processor

Imagine an application is running on an Out of order processor and it has a lot of last level cache(LLC) misses (more than 70%). Do you think that if we decrease the frequency of the processor and set it to a smaller value then the execution time of the application will increase in a big way or doesn't effect much? Could you please explain your answer
Thanks and regards
As is the case with most micro-architectural features, the safe answer would be - "it might, and it might not - depends on the exact characteristics of your application".
Take for e.g. a workload that runs through a large graph that resides in the memory - each new node needs to be fetched and processed before the new node can be chosen. If you reduce the frequency you would harm the execution phase which is latency critical since the next set of memory accesses depends on it.
On the other hand, a workload that is bandwidth-bounded (i.e.- performs as fast as the system memory BW limits), is probably not fully utilizing the CPU and would therefore not be harmed as much.
Basically the question boils down to how well your application utilizes the CPU, or rather - where between the CPU and the memory, can you find the performance bottleneck.
By the way, note that even if reducing the frequency does impact the execution time, it could still be beneficial for you power/performance ratio, depends where along the power/perf curve you're located and on the exact values.

cache memory size limitations

I knew that cache memory stores the frequently used data to speed up process execution instead fetching them from main memory -which is slower- every time , and it's size always small in comparison with main memory because it's expensive technology and because always the real data are being processed at a time is very smaller than the whole data process held by main memory .
But is there any limitations or constrains regarding cache memory size at a some CPU speed or a some main memory size ? theoretically , if we increased the cache memory much .. will that affect in an opposite way ? or just it will be a waste increase ?
Indeed the performance gain become less and less significant after 64KB of cache size.
Here is graph from wikipedia showing that regardless of the scheme of set-associativity the miss-rate decrease only slightly as the cache size increases pass 64KB
Caches are small because the silicon used to build them is quite expensive and, expecially on CISC-type CPUs, there might not be enough space on the chip to hold them. Also making chips bigger has it cost and there's the possibility that it won't fit in its socket, which adds many more issues. It's not that simple ;)
EDIT:
Well, I haven't got any papers about this, but I'll explain my opinion anyway with a simple question: if a programs needs x bytes of memory, what would be the difference if the cache's size is 10 * x bytes or 100 * x? Once all the data is loaded in the cache (which doesn't depend on its size at all), the difference is all in the cache's access speed. And given locality of reference, it's not necessary having everything on cache.
Also, having big chaches requires having better algorithm for searching requested data in it. For example accessing data in a fully associative caches will become slower than accessing the main memory as the cache size increases (which implies there are more and more places to look for the data). Considering multitasking system, though, introduces other issues which I don't actually know much of.
To conclude, the performance gain caused by increasing caches' size becomes slighter as it approaches the usual amount of data used by the whole software running on a given machine.

Java heap bottleneck - how to identify the cause?

I have a J2EE project running on JBoss, with a maximum heap size of 2048m, which is giving strange results under load testing. I've benchmarked the heap and cpu usage and received the following results (series 1 is heap usage, series 2 is cpu usage):
It seems as if the heap is being used properly and getting garbage collected properly around A. When it gets to B however, there appears to be some kind of a bottleneck as there is heap space available, but it never breaks that imaginary line. At the same time, at C, the cpu usage drops dramatically. During this period we also receive an "OutOfMemoryError (GC overhead limit exceeded)," which does not make much sense to me as there is heap space available.
My guess is that there is some kind of bottleneck, but what exactly I can't even imagine. How would you suggest going about finding the cause of the issue? I've profiled the memory usage and noticed that there are quite a few instances of the one class (around a million), but the total size of these instances is fairly small (around 50MB if I remember correctly).
Edit: The server is dedicated to to this application and the CPU usage given is only for the JVM (there should not be any significant CPU usage outside of the JVM). The memory usage is only for the heap, it does not include the permgen space. This problem is reproducible. My main concern is surrounding the limit encountered around B, for which I have not found a plausible explanation yet.
Conclusion: Turns out this was caused by a bunch of long running SQL queries being called concurrently. The returned ResultSets were also very large, possibly explaining the OOME. I still have no reasonable explanation for why there appears to be some limit at B.
From the error message it appears that the JVM is using the parallel scavenger algorithm for garbage collection. The message is dumped along with an OOME error when a lot of time is spent on GC, but not a lot of the heap is recovered.
The document from Sun does not specify if the 98% of the total time consumed is to be read as 98% of the CPU utilization of the process or that of the CPU itself. In either case, I have to draw the following inferences (with limited information):
The garbage collector or the JVM process does not have enough CPU utilization, most likely due to other processes consuming CPU at the same time.
The garbage collector does not have enough CPU utilization since it is a low priority thread, and another memory intensive (but not CPU intensive) thread in the JVM is doing work at the same time, which results in the failure to de-allocate memory.
Based on the above inferences (all, one or none of them could be true), it would be worthwhile to correlate the graph that you're obtained with the runtime behavior of the application as far as users are concerned. In other words, you might find it useful to determine if other processes are kicked off (when your problem occurs), or the part of the application that is in operation (again, when the problem occurs).
In any case, the page referenced above, does give an option to disable the GC overhead limit used by the GC algorithm.
EDIT: If the problem occurs periodically, and can be reproduced, it might turn out to be a memory leak, otherwise (i.e. it occurs sporadically), you are better off tuning the GC algorithm or even changing it.
If I want to know where the "bottlenecks" are, I just get a few stackshots. There's no need to wonder and guess and play detective. They will just tell you.
Usually memory problems and performance problems go hand in hand, so if you fix the performance problems, you will also fix the memory problems (not for certain, though).

Resources