ext4 commit= mount option and dirty_writeback_centisecs - performance

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.

Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.

The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

Related

OpenZFS on Windows: less available space than capacity in single disk pool

Creating a new pool by using the instructions from readme, as follows:
zpool create -O casesensitivity=insensitive -O compression=lz4 -O atime=off -o ashift=12 tank PHYSICALDRIVE1
I get less available space showing up in file explorer and zpool, than the disk capacity itself: 1.76TiB vs 1.81TiB
zpool list and zfs list -r poolname show the difference:
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1,81T 360K 1,81T - - 0% 0% 1.00x ONLINE -
zfs list -r tank
NAME USED AVAIL REFER MOUNTPOINT
tank 300K 1,76T 96K /tank
I'm not sure of the reason. Is there something that ZFS uses the space for?
Does it ever become available for use, or is it reserved, e.g. for root like on ext4?
Because it is copy on write, even deleting stuff requires using a tiny bit of extra storage in ZFS: until the old data has been marked as free (which requires writing newly-created metadata), you can’t start allocating the space it was using to store new data. A small amount of storage is reserved so that if you completely fill your pool it’s still possible to delete stuff to free up space. If it didn’t do this, you could get wedged, which wouldn’t be fixable unless you added more disks to your pool / made one of the disks larger if you are using virtualized storage.
There are also other small overheads (metadata storage, etc.) but I think most of the holdback you’re seeing is related to the above since it doesn’t look like you’ve written anything into the pool yet.

how metadata journal stalls write system call?

I am using CentOS 7.1.1503 with kernel linux 3.10.0-229.el7.x86_64, ext4 file system with ordered journal mode, and delalloc enabled.
When my app writes logs to a file continually (about 6M/s), I can find the write system call stalled for 100-700 ms occasionally, When I disable ext4's journal feature, or set the journal mode to writeback, or disable delay allocate, the stalling disappears. When I set linux's writeback more frequent, dirty page expire time shorter, the problem reduced.
I printed the process's stack when stalling happends, I got this:
[<ffffffff812e31f4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0195854>] ext4_da_get_block_prep+0x1a4/0x4b0 [ext4]
[<ffffffff811fbe17>] __block_write_begin+0x1a7/0x490
[<ffffffffa019b71c>] ext4_da_write_begin+0x15c/0x340 [ext4]
[<ffffffff8115685e>] generic_file_buffered_write+0x11e/0x290
[<ffffffff811589c5>] __generic_file_aio_write+0x1d5/0x3e0
[<ffffffff81158c2d>] generic_file_aio_write+0x5d/0xc0
[<ffffffffa0190b75>] ext4_file_write+0xb5/0x460 [ext4]
[<ffffffff811c64cd>] do_sync_write+0x8d/0xd0
[<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
[<ffffffff811c76b8>] SyS_write+0x58/0xb0
[<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
I read the source code of linux kernel, and found that call_rwsem_down_read_failed will call rwsem_down_read_failed which will keep waiting for rw_semaphore.
I think the reason is that metadata journal flushing must wait for related dirty pages flushed, when dirty pages flushing took long time, the journal blocked, and journal commit has this inode's rw_semaphore, write system call to this inode is stalled.
I really hope I can find evidence to prove it.

What's the reason why we must flush data cache after copy code from flash to ram?

In embedded system, boot-loader used to init the board and load image. Usually, boot-loader runs in norflash during the 1st stage and need copy itself (.txte+.date code) from flash to ram, then jump to ram execute code.
My question is: when copy code from flash to RAM and cache enable, do we must flush data cache and invalidate the instruction cache? I found uboot and other bootloader execute this operation, but if I don't do that, the system still could boot successful, what's the reason why we must flush data cache after copy code from flash to ram?
Simple embedded MCUs usually do not have any means to "snoop" the bus checking if anybody (even itself) invalidates cache contents with writes to cached memory addresses.
If your MCU has separate data and instruction caches (which most modern MCUs have) and you copy code as data from flash to RAM, you need to flush the data cache (to ensure everything you copied is physically written to RAM) and invalidate the instruction cache (which might contain "old" information from before the copy) to really execute the code you just copied instead of executing what was there before and still resides in instruction cache.
You might get away not doing the latter if you can be sure your MCU has never "seen" the memory area before you just copied (since it will not have cached anything and needs to physically read RAM anyway), but it's good practice to do data cache flushes and instruction cache invalidation nevertheless to stay on the safe side.
Copying code from flash to RAM is a special case of self modifying code and you as a programmer need to make sure it's doing no damage.
I think a main reason can be found in "more than one core" CPUs. Very important in case of asymmetrical cores, eg. i.MX6SoloX (Cortex A9 and Cortex M4 on single chip).
For example in i.MX6SoloX if the slave core (M5) run on RAM (DDR) the main core (A9) is the main cpu that has to provide the M4 core code loaded into RAM at the correct position. Those cores have different D-caches that don't see each other. If the A9 core, after the FLASH to RAM copy, doesn’t flush it’s D-Chache, some part of code is not copied to RAM actually, because of still in D-Cache memory. If you perform this copy from u-boot you can see that A9 (that is running U-Boot) see all data correctly copied, but the M4 see all code, but not the code that still in D-Cache of A9 core.
In your case (a single core like you, I guess) is not mandatory that U-Boot flush the D-cache (after the kernel copy I guess), because the owner of the D-cache is the core itself: it can see its code inside all its memories.
At the very end the reason is that to grant that a performed data copy completely wrote data to a specific address you have to flush D-Cache, otherwise some data can still in caches.

redis bgsave failed because fork Cannot allocate memory

all:
here is my server memory info with 'free -m'
total used free shared buffers cached
Mem: 64433 49259 15174 0 3 31
-/+ buffers/cache: 49224 15209
Swap: 8197 184 8012
my redis-server has used 46G memory, there is almost 15G memory left free
As my knowledge,fork is copy on write, it should not failed when there has 15G free memory,which is enough to malloc necessary kernel structures .
besides, when redis-server used 42G memory, bgsave is ok and fork is ok too.
Is there any vm parameter I can tune to make fork return success ?
More specifically, from the Redis FAQ
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can't tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 says Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
Redis doesn't need as much memory as the OS thinks it does to write to disk, so may pre-emptively fail the fork.
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then restart sysctl with:
On FreeBSD:
sudo /etc/rc.d/sysctl reload
On Linux:
sudo sysctl -p /etc/sysctl.conf
From proc(5) man pages:
/proc/sys/vm/overcommit_memory
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE set are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4
any non-zero value implies mode 1. In mode 2 (available since Linux 2.6), the total virtual address space on the system is limited to (SS + RAM*(r/100)), where SS is the size
of the swap space, and RAM is the size of the physical memory, and r is the contents of the file /proc/sys/vm/overcommit_ratio.
Redis's fork-based snapshotting method can effectively double physical memory usage and easily OOM in cases like yours. Reliance on linux virtual memory for doing snapshotting is problematic, because Linux has no visibility into Redis data structures.
Recently a new redis-compatible project Dragonfly has been released. Among other things, it solves the OOM problem entirely. (disclosure - I am the author of this project).

Memory Usage in R

After creating large objects and running out of RAM, I will try and delete the objects in my current environment using
rm(list=ls())
When I check my RAM usage, nothing has changed. Even after calling gc() nothing has changed. I can only replenish my RAM by quitting R.
Anybody have advice for dealing with memory-intensive objects within R?
Memory for deleted objects is not released immediately. R uses a technique called "garbage collection" to reclaim memory for deleted objects. Periodically, it cycles through the list of accessible objects (basically, those that have names and have not been deleted and can therefore be accessed by the user), and "tags" them for retention. The memory for any untagged objects is returned to the operating system after the garbage-collection sweep.
Garbage collection happens automatically, and you don't have any direct control over this process. But you can force a sweep by calling the command gc() from the command line.
Even then, on some operating systems garbage collection might not reclaim memory (as reported by the OS). Older versions of Windows, for example, could increase but not decrease the memory footprint of R. Garbage collection would only make space for new objects in the future, but would not reduce the memory use of R.
On Windows, the technique you describe works for me. Try the following example.
Open the Windows Task Manager (CTRL+SHIFT+ESC).
Start RGui. RGui.exe mem usage is 27 460K.
Type
gcinfo(TRUE)
x <- rnorm(1e8)
RGui.exe mem usage is now 811 100K.
Type rm("x"). RGui.exe mem usage is still 811 100K.
Type gc(). RGui.exe mem usage is now 28 332K.
Note that gc shoud be called automatically if you have removed objects from your workspace, and then you try to allocate more memory to new variables.
My impression is that multiple forms of gc() are tried before R reports failed memory allocation. I'm not aware of a solution for this at present, other than restarting R as you suggest. It appears that R does not defragment memory.
An old question, I realize, but I've found that (on OS Mojave), invoking pryr::mem_used() in the R session causes the activity monitor to immediately update the reported memory usage to reflect only the objects retained in the R environment.

Resources