Compile with Haskell Stack in environment with limited memory - haskell-stack

By default, the Haskell Stack tool seems to build projects in parallel with as many cores are available. In environments with limited memory (low memory per core), this can lead to out of memory or thrashing swap.
Is there a way to run stack install <package> while reducing the number of threads? (I see no such option in the help message.)

Related

Why would the JVM suddenly not allocate up to the maximum heap setting, even when running low on memory and with plenty of free OS memory?

My JVM is set to have a maximum heap size of 2GB. It is currently running slowly due to being low on memory, but it will not allocate beyond 1841MB (even though it has done so before on this run). I have over 16GB memory free.
Why would this suddenly happen to a running JVM? Could it be because it is "fenced in" - it cannot get a larger continuous range of physical memory?
This is for java 1.8.0_73 (64bit) on Windows 10. But I have seen this now and then for other java versions and on Windows 7 and XP too.
32 bit JVMs usually struggle to use more than about 1800Mb. Exactly how much they can allocate depends on your Operating System and how it lays out the 32 bit address space (which can vary between runs).
Use a 64 bit JVM to get more.
Start JVM with
java -Xmx2048m -Xms2048m
This will preallocate 2GB at JVM startup (even if not needed).
You cannot make a program run faster by just increasing the heap memory. A program may be slow due to various reasons.
In your case, may be it's not because of the memory usage, as the increase in the heap memory does not cause the program to use that memory to the fullest, or run faster. The heap is used up if you create a lot of objects which are in use and not garbage collected.
The other reason for the slowness could be due to some parts of the program using up the processing power (poorly performing algorithms ?).
It could also be due to slow I/O operations (file reads/writes ?).
These are only a few reasons. We can determine the slowness by getting to know more about your program.
You could look for slow running parts of your code by going through its logs (if any) or by using various profiling tools like jconsole (shipped with jdk), VisualVM, etc.
You could also tune your JVM by passing various parameters to customize the Garbage collection, various parts of the heap, thread stack size, etc.

Methods to decrease Memory consumption in ArangoDB

We currently have a ArangoDB cluster running on version 3.0.10 for a POC with about 81 GB stored on disk and Main memory consumption of about 98 GB distributed across 5 Primary DB servers. There are about 200 million vertices and 350 million Edges. There are 3 Edge collections and 3 document collections, most of the memory(80%) is consumed due to the presence of the edges
I'm exploring methods to decrease the main memory consumption. I'm wondering if there are any methods to compress/serialize the data so that less amount of main memory is utilized.
The reason for decreasing memory is to reduce infrastructure costs, I'm willing to trade-off on speed for my use case.
Please can you let me know, if there any Methods to reduce main memory consumption for ArangoDB
It took us a while to find out that our original recommendation to set vm.overcommit_memory to 2 this is not good in all situations.
It seems that there is an issue with the bundled jemalloc memory allocator in ArangoDB with some environments.
With an vm.overcommit_memory kernel settings value of 2, the allocator had a problem of splitting existing memory mappings, which made the number of memory mappings of an arangod process grow over time. This could have led to the kernel refusing to hand out more memory to the arangod process, even if physical memory was still available. The kernel will only grant up to vm.max_map_count memory mappings to each process, which defaults to 65530 on many Linux environments.
Another issue when running jemalloc with vm.overcommit_memory set to 2 is that for some workloads the amount of memory that the Linux kernel tracks as "committed memory" also grows over time and does not decrease. So eventually the ArangoDB daemon process (arangod) may not get any more memory simply because it reaches the configured overcommit limit (physical RAM * overcommit_ratio + swap space).
So the solution here is to modify the value of vm.overcommit_memory from 2 to either 1 or 0. This will fix both of these problems.
We are still observing ever-increasing virtual memory consumption when using jemalloc with any overcommit setting, but in practice this should not cause problems.
So when adjusting the value of vm.overcommit_memory from 2 to either 0 or 1 (0 is the Linux kernel default btw.) this should improve the situation.
Another way to address the problem, which however requires compilation of ArangoDB from source, is to compile a build without jemalloc (-DUSE_JEMALLOC=Off when cmaking). I am just listing this as an alternative here for completeness. With the system's libc allocator you should see quite stable memory usage. We also tried another allocator, precisely the one from libmusl, and this also shows quite stable memory usage over time. The main problem here which makes exchanging the allocator a non-trivial issue is that jemalloc has very nice performance characteristics otherwise.
(quoting Jan Steemann as can be found on github)
Several new additions to the rocksdb storage engine were made meanwhile. We demonstrate how memory management works in rocksdb.
Many Options of the rocksdb storage engine are exposed to the outside via options.
Discussions and a research of a user led to two more options to be exposed for configuration with ArangoDB 3.7:
--rocksdb.cache-index-and-filter-blocks-with-high-priority
--rocksdb.pin-l0-filter-and-index-blocks-in-cache
--rocksdb.pin-top-level-index-and-filter

User space Vs Kernel space program performance difference

I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC).
So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ...
No system call overhead (how many CPU cycles is gained per system call). This is less critical as I am barely using any system call in my program except for allocating memory that too just when the program starts.
Control over scheduling, I can create a kernel thread and make it run on a given core without being thrown away.
I can use kmalloc memory allocation and thus can have more control over memory allocated, may can also control the cache coloring more precisely by controlling allocated memory. Is it worth trying?
My questions to the kernel experts...
Have I missed any factors in the above list that can improve performance further?
Is it worth trying or it is straight way known that I will NOT get much performance improvement?
If performance gain is possible in kernel, is there any estimate how much gain it can be (any theoretical guess)?
Thanks.
Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much.
Regarding point 2: you can pin a thread to a specific core by setting its affinity, using sched_setaffinity() on Linux.
Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using mmap(). This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that kmalloc() allocates wired (non-pageable) memory. I don't see how this would help.
I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.
Create a dedicated cpuset for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like:
struct sched_param schedparams;
// Be portable - don't just set priority to 99 :)
schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &schedparams);
Don't do that on a single-core system!
Reserve large enough stack space with alloca(3) and touch all of the allocated stack memory, map more than enough heap space and then use mlock(2) or mlockall(2) to pin process memory.
Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from numa(3) to allocate and keep memory as close to the NUMA node where your program executes as possible.
Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.

How to analyze the main memory and the cache access patterns?

I am looking for a way how to analyze the main memory access times. Such method should give me a distribution of RAM and Cache accesses to analyze CPU stalls in time. I wonder if it's possible entirely in software (a kernel module?) or maybe a Virtual Machine may provide a feedback?
The performance counters in modern x86_64 CPUs are perfect for determining what code is executing when there's events like cache misses, branch mispredictions, instruction/data TLB misses, prefetches, etc.
On linux, there's tools like perf and oprofile. AMD and Intel both offer commercial tools (for linux and other platforms) to record and analyze these same performance counters.

Memory management and Process

Is HEAP local to a process? In other words we have stack which is always local to a process and for each process it is seprate. Does the same apply to heap? Also, if HEAP is local, i believe HEAP size should change during runtime as we request more and more memory from CPU, so who puts a top limit on how much memory can be requested?
Heaps are indeed local to the process. Limits are placed by the operating system. Memory can also be limited by the number of bits used for addressing (i.e. 32-bits can only address 2G 4G of memory at a time).
Yes, on modern operating systems there exists a separate heap for each process. There is by the way not just a separate stack for every process, but there is a separate stack for every thread in the process. Thus a process can have quite a number of independent stacks.
But not all operating systems and not all hardware platforms offer this feature. You need a memory management unit (in hardware) for that to work. But desktop computers have that feature since... well... a while back... The 386-CPU? (leave a comment if you know better). You may though find yourself on some kind of micro processor that does not have that feature.
Anyway: The limit to the heap size is mainly limited by the operating system and the hardware. The hardware limits especially due to the limited amount of address space that it allows. For example a 32bit-CPU will not address more than 4GB (2^32). A CPU that features physical address extensions (PAE), which the current CPUs do support, can address up to 64GB, but that's done by the use of segments and one single process will not be able to make use of this feature. It will always see 4GB max.
Additionally the operating system can limit the memory as it sees fit. On Linux you can see and set limits using the ulimit command. If you are running some code not natively, but for example in an interpreter/virtual machine (such as Java, or PHP), then that environment can additionally limit the heap size.
'heap' is local to a proccess, but it is shared among threads, while stack is not, it is per-thread.
Regarding the limit, for example in linux it is set by ulimit (see manpage).
On a modern, preemptively multi-tasking OS, each process gets its own address space. The set of memory pages that it can see are separate from the pages that other processes can see. As a result, yes, each process sees its own stack and heap, because the stack and heap are just areas of memory.
On an older, cooperatively multi-tasking OS, each process shared the same address space, so the heap was effectively shared among all processes.
The heap is defined by the collection of things in it, so the heap size only changes as memory is allocated and freed. This is true regardless to how the OS is managing memory.
The top limit of how much memory can be requested is determined by the memory manager. In a machine without virtual memory, the top limit is simply how much memory is installed in the computer. With virtual memory, the top limit is defined by physical memory plus the size of the swap file on the disk.

Resources