Redis support 3 memory allocator: libc, jemalloc, tcmalloc. When i do memory usage test, i find that mem_fragmentation_ratio in INFO MEMORY could be less than 1 with libc allocator. With jemalloc or tcmalloc, this value is greater or equal than 1 as it should be.
Could anyone explain why mem_fragmentation_ratio is less than 1 with libc?
Redis version:2.6.12. CentOS 6
Update:
I forgot to mention that one possible reason is that swap happens and mem_fragmentation_ratio will be < 1.
But when i do my test, i adjust swapiness, even turn swap off. The result is the same. And my redis instance actually do not cost too much memory.
Generally, you will have less fragmentation with jemalloc or tcmalloc, than with libc malloc. This is due to 4 factors:
more granular allocation classes for jemalloc and tcmalloc. It reduces internal fragmentation, especially when Redis has to allocate a lot of very small objects.
better algorithms and data structures to prevent external fragmentation (especially for jemalloc). Obviously, the gain depends on your long term memory allocation patterns.
support of "malloc size". Some allocators offer an API to return the size of allocated memory. With glibc (Linux), malloc does not have this capability, so it is emulated by explicitly adding an extra prefix to each allocated memory block. It increases internal fragmentation. With jemalloc and tcmalloc (or with the BSD libc malloc), there is no such
overhead.
jemalloc (and tcmalloc with some setting changes) can be more aggressive than glibc to release memory to the OS - but again, it depends on the allocation patterns.
Now, how is it possible to get inconsistent values for mem_fragmentation_ratio?
As stated in the INFO documentation, the mem_fragmentation_ratio value is calculated as the ratio between memory resident set size of the process (RSS, measured by the OS), and the total number of bytes allocated by Redis using the allocator.
Now, if more memory is allocated with libc (compared to jemalloc,tcmalloc), or if more memory is used by some other processes on your system during your benchmarks, Redis memory may be swapped out by the OS. It will reduce the RSS (since a part of Redis memory will not be in main memory anymore). The resulting fragmentation ratio will be less than 1.
In other words, this ratio is only relevant if you are sure Redis memory has not been swapped out by the OS (if it is not the case, you will have performance issues anyway).
Other than swap, I know 2 ways to make "memory fragmentation ratio" to be less than 1:
Have a redis instance with little or no data, but thousands of idling client connections. From my testing, it looks like redis will have to allocate about 20 KB of memory for each client connections, but most of it won't actually be used (i.e. won't appear in RSS) until later.
Have a master-slave setup with let's say 8 GB of repl-backlog-size. The 8 GB will be allocated as soon as the replication starts (on master only for version <4.0, on both master and slave otherwise), but the memory will only be used as we start writing to the master. So the ratio will be way below 1 initially, and then get closer and closer to 1 as the replication backlog get filled.
Related
We currently have a ArangoDB cluster running on version 3.0.10 for a POC with about 81 GB stored on disk and Main memory consumption of about 98 GB distributed across 5 Primary DB servers. There are about 200 million vertices and 350 million Edges. There are 3 Edge collections and 3 document collections, most of the memory(80%) is consumed due to the presence of the edges
I'm exploring methods to decrease the main memory consumption. I'm wondering if there are any methods to compress/serialize the data so that less amount of main memory is utilized.
The reason for decreasing memory is to reduce infrastructure costs, I'm willing to trade-off on speed for my use case.
Please can you let me know, if there any Methods to reduce main memory consumption for ArangoDB
It took us a while to find out that our original recommendation to set vm.overcommit_memory to 2 this is not good in all situations.
It seems that there is an issue with the bundled jemalloc memory allocator in ArangoDB with some environments.
With an vm.overcommit_memory kernel settings value of 2, the allocator had a problem of splitting existing memory mappings, which made the number of memory mappings of an arangod process grow over time. This could have led to the kernel refusing to hand out more memory to the arangod process, even if physical memory was still available. The kernel will only grant up to vm.max_map_count memory mappings to each process, which defaults to 65530 on many Linux environments.
Another issue when running jemalloc with vm.overcommit_memory set to 2 is that for some workloads the amount of memory that the Linux kernel tracks as "committed memory" also grows over time and does not decrease. So eventually the ArangoDB daemon process (arangod) may not get any more memory simply because it reaches the configured overcommit limit (physical RAM * overcommit_ratio + swap space).
So the solution here is to modify the value of vm.overcommit_memory from 2 to either 1 or 0. This will fix both of these problems.
We are still observing ever-increasing virtual memory consumption when using jemalloc with any overcommit setting, but in practice this should not cause problems.
So when adjusting the value of vm.overcommit_memory from 2 to either 0 or 1 (0 is the Linux kernel default btw.) this should improve the situation.
Another way to address the problem, which however requires compilation of ArangoDB from source, is to compile a build without jemalloc (-DUSE_JEMALLOC=Off when cmaking). I am just listing this as an alternative here for completeness. With the system's libc allocator you should see quite stable memory usage. We also tried another allocator, precisely the one from libmusl, and this also shows quite stable memory usage over time. The main problem here which makes exchanging the allocator a non-trivial issue is that jemalloc has very nice performance characteristics otherwise.
(quoting Jan Steemann as can be found on github)
Several new additions to the rocksdb storage engine were made meanwhile. We demonstrate how memory management works in rocksdb.
Many Options of the rocksdb storage engine are exposed to the outside via options.
Discussions and a research of a user led to two more options to be exposed for configuration with ArangoDB 3.7:
--rocksdb.cache-index-and-filter-blocks-with-high-priority
--rocksdb.pin-l0-filter-and-index-blocks-in-cache
--rocksdb.pin-top-level-index-and-filter
I have heard in embedded system, we should use some preallocated fixed-size memory chunks(like buddy memory system?). Could somebody give me a detailed explanation why?
Thanks,
In embedded systems you have very limited memory. Therefore, if you occasionally lose only one byte of memory (because you allocate it , but you dont free it), this will eat up the system memory pretty quickly (1 GByte of RAM, with a leak rate of 1/hour will take its time. If you have 4kB RAM, not as long)
Essentially the behaviour of avoiding dynamic memory is to avoid the effects of bugs in your program. As static memory allocation is fully deterministic (while dynamic memory alloc is not), by using only static memory allocation one can counteract such bugs. One important factor for that is that embedded systems are often used in security-critical application. A few hours of downtime could cost millions or an accident could happen.
Furthermore, depending on the dynamic memory allocator, the indeterminism also might take an indeterminate amount of time, which can lead to more bugs especially in systems relying on tight timing (thanks to Clifford for mentioning this). This type of bug is often hard to test and to reproduce because it relies on a very specific execution path.
Additionally, embedded systems don't usually have MMUs, so there is nothing like memory protection. If you run out of memory and your code to handle that condition doesn't work, you could end up executing any memory as instruction (bad things could happen! However this case is only indirectly related to dynamic mem allocation).
As Hao Shen mentioned, fragmentation is also a danger. Whether it may occur depends on your exact usecase, but in embedded systems it is quite easy to loose 50% of your RAM due to fragmentation. You can only avoid fragmentation if you allocate chunks that always have the exact same size.
Performance also plays a role (depends on the usecase - thanks Hao Shen). Statically allocated memory is allocated by the compiler whereas malloc() and similar need to run on the device and therefore consume CPU time (and power).
Many embedded OSs (e.g. ChibiOS) support some kind of dynamic memory allocator. But using it only increases the possibility of unexpected issues to occur.
Note that these arguments are often circumvented by using smaller statically allocated memory pools. This is not a real solution, as one can still run out of memory in those pools, but it will only affect a small part of the system.
As pointed out by Stephano Sanfilippo, some system don't even have enough resources to support dynamic memory allocation.
Note: Most coding standard, including the JPL coding standard and DO-178B (for critical avionics code - thanks Stephano Sanfilippo) forbid the use of malloc.
I also assume the MISRA C standard forbids malloc() because of this forum post -- however I don't have access to the standard itself.
The main reasons not to use dynamic heap memory allocation here are basically:
a) Determinism and, correlated,
b) Memory fragmentation.
Memory leaks are usually not a problem in those small embedded applications, because they will be detected very early in development/testing.
Memory fragmentation can however become non-deterministic, causing (best case) out-of-memory errors at random times and points in the application in the field.
It may also be non-trivial to predict the actual maximum memory usage of the application during development with dynamic allocation, whereas the amount of statically allocated memory is known at compile time and it is absolutely trivial to check if that memory can be provided by the hardware or not.
Allocating memory from a pool of fixed size chunks has a couple advantages over dynamic memory allocation. It prevents heap fragmentation and it is more deterministic.
With dynamic memory allocation, dynamically sized memory chunks are allocated from a fixed size heap. The allocations aren't necessarily freed in the same order that they're allocated. Over time this can lead to a situation where the free portions of the heap are divided up between allocated portions of the heap. As this fragmentation occurs, it can become more difficult to fulfill requests for larger allocations of memory. If a request for a large memory allocation is made, and there is no contiguous free section in the heap that's large enough then the allocation will fail. The heap may have enough total free memory but if it's all fragmented and there is not a contiguous section then the allocation will fail. The possibility of malloc() failing due to heap fragmentation is undesirable in embedded systems.
One way to combat fragmentation is rejoin the smaller memory allocations into larger contiguous sections as they are freed. This can be done in various ways but they all take time and can make the system less deterministic. For example, if the memory manager scans the heap when a memory allocation is freed then the amount of time it takes free() to complete can vary depending on what types of memory are adjacent to the allocation being freed. That is non-deterministic and undesirable in many embedded systems.
Allocating from a pool of fixed sized chunks does not cause fragmentation. So long as there is some free chunks then an allocation won't fail because every chunk is the right size. Plus allocating and freeing from a pool of fixed size chunks is simpler. So the allocate and free functions can be written to be deterministic.
Redis is used to save data but it costs a lot of memory, and its memory usage up to 52.5%.
I deleted half of the keys in redis, and the return code of the delete operation is ok, but its memory usage doesn't reduce.
What's the reason? Thanks in Advance.
My operation code is as below:
// save data
m_pReply = (redisReply *)redisCommand(m_pCntxt, "set %b %b", mykey.data(), mykey.size(), &myval, sizeof(myval));
// del data
m_pReply = (redisReply *)redisCommand(m_pCntxt, "del %b", mykey.data(), mykey.size());
The redis info:
redis 127.0.0.1:6979> info
redis_version:2.4.8
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.4.6
process_id:28799
uptime_in_seconds:1289592
uptime_in_days:14
lru_clock:127925
used_cpu_sys:148455.30
used_cpu_user:38023.92
used_cpu_sys_children:23187.60
used_cpu_user_children:123989.72
connected_clients:22
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:31903334872
used_memory_human:29.71G
used_memory_rss:34414981120
used_memory_peak:34015653264
used_memory_peak_human:31.68G
mem_fragmentation_ratio:1.08
mem_allocator:jemalloc-2.2.5
loading:0
aof_enabled:0
changes_since_last_save:177467
bgsave_in_progress:0
last_save_time:1343456339
bgrewriteaof_in_progress:0
total_connections_received:820
total_commands_processed:2412759064
expired_keys:0
evicted_keys:0
keyspace_hits:994257907
keyspace_misses:32760132
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:11672476
vm_enabled:0
role:slave
master_host:192.168.252.103
master_port:6479
master_link_status:up
master_last_io_seconds_ago:0
master_sync_in_progress:0
db0:keys=66372158,expires=0
Please refer to Memory allocation section on the following link:
http://redis.io/topics/memory-optimization
I quoted it here:
Redis will not always free up (return) memory to the OS when keys are
removed. This is not something special about Redis, but it is how most
malloc() implementations work. For example if you fill an instance
with 5GB worth of data, and then remove the equivalent of 2GB of data,
the Resident Set Size (also known as the RSS, which is the number of
memory pages consumed by the process) will probably still be around
5GB, even if Redis will claim that the user memory is around 3GB. This
happens because the underlying allocator can't easily release the
memory. For example often most of the removed keys were allocated in
the same pages as the other keys that still exist.
Since Redis 4.0.0 there's a command for this:
MEMORY PURGE
Should do the trick: https://redis.io/commands/memory-purge
Note however that command docs state:
This command is currently implemented only when using jemalloc as an allocator, and evaluates to a benign NOOP for all others.
And the README reminds us that:
Redis is compiled and linked against libc
malloc by default, with the exception of jemalloc being the default on Linux
systems. This default was picked because jemalloc has proven to have fewer
fragmentation problems than libc malloc.
A good starting point is to use the Redis CLI command: MEMORY DOCTOR.
It can give you very valuable information and point you to the potential issue.
some useful links:
MEMORY DOCTOR command docs
What is defragmentation and what are the Redis defragmentation configs
example:
Peak memory: In the past this instance used more than 150% the memory that is currently using. The allocator is normally not able to release memory after a peak, so you can expect to see a big fragmentation ratio, however this is actually harmless and is only due to the memory peak, and if the Redis instance Resident Set Size (RSS) is currently bigger than expected, the memory will be used as soon as you fill the Redis instance with more data. If the memory peak was only occasional and you want to try to reclaim memory, please try the MEMORY PURGE command, otherwise the only other option is to shutdown and restart the instance.
High total RSS: This instance has a memory fragmentation and RSS overhead greater than 1.4 (this means that the Resident Set Size of the Redis process is much larger than the sum of the logical allocations Redis performed). This problem is usually due either to a large peak memory (check if there is a peak memory entry above in the report) or may result from a workload that causes the allocator to fragment memory a lot. If the problem is a large peak memory, then there is no issue. Otherwise, make sure you are using the Jemalloc allocator and not the default libc malloc. Note: The currently used allocator is "jemalloc-5.1.0".
High allocator fragmentation: This instance has an allocator external fragmentation greater than 1.1. This problem is usually due either to a large peak memory (check if there is a peak memory entry above in the report) or may result from a workload that causes the allocator to fragment memory a lot. You can try enabling 'activedefrag' config option.
I am writing a C++ program that essentially works with very large arrays. On Windows, I am using VirtualAlloc to allocate memory to my arrays. Now I fully understand the difference between reserving and committing memory using VirutalAlloc; however, I am wondering whether there is any benefit in committing memory page-by-page to a reserved region. In particular, MSDN (http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx) contains the following explanation for the MEM_COMMIT option:
Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.
My experiments confirm this: I can reserve and commit several GB of memory wihtout increasing memory usage of my process (as shown in Task Manager); actual memory gets allocated only when I actually access memory.
Now I saw quite a few examples arguing that one should reserve a large portion of the address space and then commit memory page-by-page (or in some larger blocks, depending on the app's logic). As explained above, however, memory does not seem to be committed before one accesses it; thus, I'm wondering whether there is any real benefit in committing memory page-by-page. In fact, committing memory page-by-page might actually slow my program down due to many system calls for actually comitting memory. If I commit the entire region at once, I pay for just one system call, but the kernel seems to be smart enough to actually allocate only memory that I actually use.
I would appreciate it if someone could explain to me which strategy is better.
The difference is that commit "backs" the memory against the page file. To give an example:
Given 2GB of physical ram and 2GB of swap (assume fixed-size swap for this purpose).
Reserve 6GB - OK.
Commit first 2GB - OK.
Commit remaining 4GB - fails.
Extend swap file to 8GB
Commit remaining 4GB - succeeds.
The reason for using MEM_COMMIT would primarily be for runtime error suppression (app stability). If you have a process that commits pages on-demand then there's always a chance that a commit along-the-way could fail if it exceeds amount of memory+swap available. When memory has been backed by the page file then you have a strong guarantee that the memory is available for use from now until the point that you release it.
There's a number of reasons to go one way or the other, and I don't think there's any perfect science to deciding which. MEM_RESERVE alone is only needed for very large sparse array scenarios, ex: multi-gigabyte array which has at most 25-33% utilization (a popular technique for accelerating hash tables, etc).
Almost everything else is gray area where you could probably go either way -- MEM_COMMIT up-front would make your own app a little more stable and essentially give it priority to physical ram over competing apps that might allocate on-demand. (if you grab the ram first then your app will be the last left standing when physical memory is exhausted) At the same time, if you're not actually using all that ram then you may end up limiting the multi-tasking potential of your client's machine or causing unnecessary wasted disk space via a growing page file.
I'm using VMMap from SysInternals to look at memory allocated by my Win32 C++ process on WinXP, and I see a bunch of allocations where portions of the allocated memory are reserved but not committed. As far as I can tell, from my reading and testing, all of the common memory allocators (e.g., malloc, new, LocalAlloc, GlobalAlloc) used in a C++ program always allocate fully committed blocks of memory.
Heaps are a common example of code that reserves memory but doesn't commit it until needed. I suspect that some of these blocks are Windows/CRT heaps, but there appears to be more of these types of blocks than I would expect for heaps. I see on the order of 30 of these blocks in my process, between 64k and 8MB in size, and I know that my code never intentionally calls VirtualAlloc to allocate reserved, uncommitted memory.
Here are a couple of examples from VMMap: http://www.flickr.com/photos/95123032#N00/5280550393/
What else would allocate such blocks of memory, where much of it is reserved but not committed? Would it make sense that my process has 30 heaps? Thanks.
I figured it out - it's the CRT heap that gets allocated by calls to malloc. If you allocate a large chunk of memory (e.g., 2 MB) using malloc, it allocates a single committed block of memory. But if you allocate smaller chunks (say 177kb), then it will reserve a 1 MB chunk of memory, but only commit approximately what you asked for (e.g., 184kb for my 177kb request).
When you free that small chunk, that larger 1 MB chunk is not returned to the OS. Everything but 4k is uncommitted, but the full 1 MB is still reserved. If you then call malloc again, it will attempt to use that 1 MB chunk to satisfy your request. If it can't satisfy your request with the memory that it's already reserved, it will allocate a new chunk of memory that's twice the previous allocation (in my case it went from 1 MB to 2 MB). I'm not sure if this pattern of doubling continues or not.
To actually return your freed memory to the OS, you can call _heapmin. I would think that this would make a future large allocation more likely to succeed, but it would all depend on memory fragmentation, and perhaps heapmin already gets called if an allocation fails (?), I'm not sure. There would also be a performance hit since heapmin would release the memory (taking time) and malloc would then need to re-allocate it from the OS when needed again. This information is for Windows/32 XP, your mileage may vary.
UPDATE: In my testing, heapmin did absolutely nothing. And the malloc heap is only used for blocks that are less than 512kb. Even if there are MBs of contiguous free space in the malloc heap, it will not use it for requests over 512kb. In my case, this freed, unused, yet reserved malloc memory chewed up huge parts of my process' 2GB address space, eventually leading to memory allocation failures. And since heapmin doesn't return the memory to the OS, I haven't found any solution to this problem, other than restarting my process or writing my own memory manager.
Whenever a thread is created in your application a certain (configurable) amount of memory will be reserved in the address space for the call stack of the thread. There's no need to commit all the reserved memory unless your thread is actually going to need all of that memory. So only a portion needs to be committed.
If more than the committed amount of memory is required, it will be possible to obtain more system memory.
The practical consideration is that the reserved memory is a hard limit on the stack size that reduces address space available to the application. However, by only committing a portion of the reserve, we don't have to consume the same amount of memory from the system until needed.
Therefore it is possible for each thread to have a portion of reserved uncommitted memory. I'm unsure what the page type will be in those cases.
Could they be the DLLs loaded into your process? DLLs (and the executable) are memory mapped into the process address space. I believe this initially just reserves space. The space is backed by the files themselves (at least initially) rather than the pagefile.
Only the code that's actually touched gets paged in. If I understand the terminology correctly, that's when it's committed.
You could confirm this by running your application in a debugger and looking at the modules that are loaded and comparing their locations and sizes to what you see in VMMap.