Disk persistent cache in ehcache 3.4 is using (leaking?) direct memory - ehcache

I am running a web application that makes use of Ehcache 3.4.0. I have a cache configuration that defines a simple default of 1000 in-memory objects:
<cache-template name="default">
<key-type>java.lang.Object</key-type>
<value-type>java.lang.Object</value-type>
<heap unit="entries">1000</heap>
</cache-template>
I then have some disk-based caches that use this default template, but override all values (generated programmatically, so that's why they even use the default template at all) like so:
<cache alias='runViewCache' uses-template='default'>
<key-type>java.lang.String</key-type>
<value-type>java.lang.String</value-type>
<resources>
<heap unit='entries'>1</heap>
<disk unit='GB' persistent='true'>1</disk>
</resources>
</cache>
As data is written into my disk-based cache, direct/off-heap memory is used by the JVM, and never freed. Even clearing the cache does not free the memory. The memory used is directly related (nearly byte-for-byte as far as I can tell) to the data written to the disk-based cache.
The authoritative tier for this cache is an instance of org.ehcache.impl.internal.store.disk.OffHeapDiskStore.
This appears to be a memory leak (memory is consumed and never freed) but I am by no means an expert at configuring ehcache. Can anyone suggest a configuration change that will cause my disk tier to NOT use off-heap memory? Or, is there something else that I am just completely misunderstanding that someone else can point out?
Thank you!

How do you measure "used"?
TL;DR No, disk tier does not waste RAM.
As of v3.0.0 Ehcache uses memory mapped files for disk persistence:
Replacement of the port of Ehcache 2.x open source disk store by one that leverages the offheap library and memory mapped files.
This means, Ehcache uses in-memory address space to access files on disk. This does consume 0 bytes of your RAM. (At least directly. As #louis-jacomet already stated, the OS can decide to cache parts of the files in RAM.)
When you're running on Linux you should compare the VIRT and RES values of your process. VIRT is the amount of virtual bytes used by the process. RES is the amount of real RAM (RESident) bytes used by the process. VIRT should increase, while disk store cache is populated, but RES should remain pretty stable.

Related

What part of the RAM is used by the system file cache in Windows?

According to general notions about the page cache and this answer the system file cache essentially uses all the RAM not used by any other process. This is, as far as I know, the case for the page cache in Linux.
Since the notion of "free RAM" is a bit blurry in Windows, my question is, what part of the RAM does the system file cache use? For example, is the same as "Available RAM" in the task manager?
Yes, the RAM used by the file cache is essentially the RAM displayed as available in the Task Manager. But not exactly. I'll go into details and explain how to measure it more precisely.
The file cache is not a process listed in the list of processes in the Task Manager. However, since Vista, its memory is managed like a process. Thus I'll explain a bit of memory management for processes, the file cache being a special case.
In Windows, the RAM used by a process has essentially two states: "Active" and "Standby":
"Active" RAM is displayed in the Task Manager and resource monitor as "In Use". It is also the RAM displayed for each process in the Task Manager.
"Standby" RAM is visible in the Resource monitor globally and for each process with RAMMap.
"Standby" + "Free" RAM is what is called "Available" in the task manager. "Free" RAM tends to be near 0 in Windows but you can meaningfully consider Standby RAM is free as well.
Standby RAM is considered as "not used for a while by the process". It is the part of the RAM that will be used to give new memory to processes needing it. But it still belongs to the process and could be used directly if the owning process suddenly access it (which is considered as unlikely by the system).
Thus the file cache has "Active" RAM and "Standby" RAM. "Active" RAM is somehow the cache for data recently accessed. "Standby" RAM is the cache for data accessed a while ago. The "Active" RAM of the file cache is usually relatively small. The Standby RAM of the file cache is most often all the RAM of your computer: Total RAM - Active RAM of all processes. Indeed, other processes rarely have Standby RAM because it tends to go to the file cache if you do disk I/O quite a bit.
This is the info displayed by RAMMap for a busy server doing a lot of I/O and computation:
The file cache is the second row called "Mapped file". See that most of the 32 GB is either in the Active part of other processes, or in the Standby part of the file cache.
So finally, yes, the RAM used by the file cache is essentially the RAM displayed as available in the Task Manager. If you want to measure with more certainty, you can use RAMMap.
Your answer is not entirely true.
The file cache, also called the system cache, describes a range of virtual addresses, it has a physical working set that is tracked by MmSystemCacheWs, and that working set is a subset of all the mapped file physical pages on the system.
The system cache is a range of virtual addresses, hence PTEs, that point to mapped file pages. The mapped file pages are brought in by a process creating a mapping or brought in by the system cache manager in response to a file read.
Existing pages that are needed by the file cache in response to a read become part of the system working set. If a page in a mapped file is not present then it is paged in and it becomes part of the system working set. When a page is in more than one working set (i.e. system and a process or process and another process), it is considered to be in a shared working set on programs like VMMap.
The actual mapped file pages themselves are controlled by a section object, one per file, a data control area (for the file) and subsection objects for the file, and a segment object for the file with prototype PTEs for the file. These get created the first time a process creates a mapping object for the file, or the first time the system cache manager creates the mapping object (section object) for the file due to it needing to access the file in response to a file IO operation performed by a process.
When the system cache manager needs to read from the file, it maps 256KiB views of the file at a time, and keeps track of the view in a VACB object. A process maps a variable view of a file, typically the size of the whole file, and keeps track of this view in the process VAD. The act of mapping the view is simply filling in PTEs to point to physical pages that contain the file that are already resident by looking at the prototype PTE for that range in the file and seeing what it contains, and in the event that the prototype PTE does not point to a physical page, initialising the PTE to point to the prototype PTE instead of the page it points to, and the PTE is left invalid, and this fault will be resolved on demand on a page by page basis when the read from the view is actually performed.
The VACBs keep track of the 256KiB views of files that the cache manager has opened and the virtual address range of that view, which describes the range of 64 PTEs that service that range of virtual addresses. There is no virtual external fragmentation or page table external fragmentation as all views are the same size, and there is no physical external fragmentation, because all pages in the view are 4KiB. 256KiB is the size chosen because if it were smaller, there would be too many VACB objects (64 times as many, taking up space), and if it were larger, there would effectively be a lot of internal fragmentation from reads and hence large virtual address pollution, and also, the VACB uses the lower bits of the virtual address to store the number of I/O operations that are currently being performed on that range, so the VACB size would have to be increased by a few bits or it would be able to handle fewer concurrent I/O operations.
If the view were the whole size of the file, there would quickly be a lot of virtual address pollution, because it would be mapping in the whole of every file that is read, and file mappings are supposed to be for user processes which knowingly map a whole file view into its virtual address space, expecting the whole of the file to be accessed. There would also be a lot of virtual external fragmentation, because the views wouldn't be the same size.
As for executable images, they are mapped in separately with separate prototype PTEs and separate physical pages, separate control area, separate segment and subsection object to the data file map for the file. The process maps the image in, but the kernel also maps images for ntoskrnl.exe, hal.dll in large pages, and then driver images are on the system PTE working set.

Coherent DMA memory on ARM

I'm new to ARM/Linux and there are things that aren't clear to me. ( I might be completely off on this)
I'm trying to get a coherent mem allocated for my device driver (i.e, a region that is non-cached or write-through).
So I attempt to do that with dma_alloc_coherent in Linux.
When I inspect the page table attributes, I notice that I get "Shareable device" memory type.
There are a few memory types regarding the cache policy as in the link below:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Cacgehgd.html
I was expecting that I would get a non-cacheable or a write-through memory. What is the cache policy of the"Shareable Device" type?? and how does it differ from explicit non-cacheable and write-through memory types??
Actually depending on the ARM architecture release is possible that cached memory regions are coherent after DMA transfers. There is an extension in the AMBA spec (AXI Coherent Extensions) that keeps the coherence of caches memories after another master has performed a transfer, in other words, that after another core or DMA performs a transfer, your cache will have the updated values (or at least the tags are marked as invalid).
It means that, if the kernel of linux is aware of your ARM architecture release it trusts on the coherency mechanism to update caches and thus the pages are marked as shareable.
Please see the issue D of the ACE Protocol Specification on ARM site (registration required) for more information.

What's the reason why we must flush data cache after copy code from flash to ram?

In embedded system, boot-loader used to init the board and load image. Usually, boot-loader runs in norflash during the 1st stage and need copy itself (.txte+.date code) from flash to ram, then jump to ram execute code.
My question is: when copy code from flash to RAM and cache enable, do we must flush data cache and invalidate the instruction cache? I found uboot and other bootloader execute this operation, but if I don't do that, the system still could boot successful, what's the reason why we must flush data cache after copy code from flash to ram?
Simple embedded MCUs usually do not have any means to "snoop" the bus checking if anybody (even itself) invalidates cache contents with writes to cached memory addresses.
If your MCU has separate data and instruction caches (which most modern MCUs have) and you copy code as data from flash to RAM, you need to flush the data cache (to ensure everything you copied is physically written to RAM) and invalidate the instruction cache (which might contain "old" information from before the copy) to really execute the code you just copied instead of executing what was there before and still resides in instruction cache.
You might get away not doing the latter if you can be sure your MCU has never "seen" the memory area before you just copied (since it will not have cached anything and needs to physically read RAM anyway), but it's good practice to do data cache flushes and instruction cache invalidation nevertheless to stay on the safe side.
Copying code from flash to RAM is a special case of self modifying code and you as a programmer need to make sure it's doing no damage.
I think a main reason can be found in "more than one core" CPUs. Very important in case of asymmetrical cores, eg. i.MX6SoloX (Cortex A9 and Cortex M4 on single chip).
For example in i.MX6SoloX if the slave core (M5) run on RAM (DDR) the main core (A9) is the main cpu that has to provide the M4 core code loaded into RAM at the correct position. Those cores have different D-caches that don't see each other. If the A9 core, after the FLASH to RAM copy, doesn’t flush it’s D-Chache, some part of code is not copied to RAM actually, because of still in D-Cache memory. If you perform this copy from u-boot you can see that A9 (that is running U-Boot) see all data correctly copied, but the M4 see all code, but not the code that still in D-Cache of A9 core.
In your case (a single core like you, I guess) is not mandatory that U-Boot flush the D-cache (after the kernel copy I guess), because the owner of the D-cache is the core itself: it can see its code inside all its memories.
At the very end the reason is that to grant that a performed data copy completely wrote data to a specific address you have to flush D-Cache, otherwise some data can still in caches.

why virtual memory copy on write need to be backed by disk page

reading the copy on write about window's memory management, it is saying that system will find a free page in the RAM for the shared memory ( be backed immediately by disk page ).
why it is necessary to back the RAM page with disk page ? it is not swapped out, it is just created ?
I remember the RAM page only get swapped when there is not enough RAM page
The system needs the guarantee that when a write happens, space will be available. You can't fail an allocation now if the system will run out of diskspace later.
That doesn't mean the disk s written to; the page reservation is merely bookkeeping.

redis bgsave failed because fork Cannot allocate memory

all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?
More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf
From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.
Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).

Resources