Write buffer reaction on MESI-induced messages - cpu

Suppose we have the following situation:
2 CPU wtih write buffers and MESI is used as the cache coherence protocol. And we have one shared cache line between the CPUs:
CPU1 cache: |I|I|S|I|I|
CPU2 cache: |I|I|S|I|I|
Now CPU1 decides to modify the shared line. It puts the change record to its write buffer and sends invalidation message to CPU2. CPU2 receives it and sends an acknowledgment:
CPU1 cache: |I|I|S|I|I|, CPU1 write buffer: change for the 3rd cache line
CPU2 cache: |I|I|I|I|I|
Is it right that on receiving the acknowledgment CPU1 doesn't have to flush the write buffer and change the cache line state to M(modified)? If not than let's go further.
Suppose now CPU2 wants to read this cache line again. Should snooping CPU1 intercept this read request, and flush buffer->flush cache line->send the last value of the cache line to CPU2? Or it might ignore it and CPU2 would still have the old value by asking the RAM(which wasn't changed yet)?

In the standard MESI protocol, there is no acknowledgment of of invalidate transactions. In your example, the cache line transitions to the M state in CPU1 and the write is retired from the write buffer.
When CPU2 does a read, CPU1 writes out the modified line to memory. CPU2 can get the value either from CPU1 or from memory (only after CPU1's write completes). The write to memory is needed because there there is no O (Owned) state.
There are protocols like the Dragon protocol that do use a signal to indicate if a cache line is shared or not.

Related

When should I use REQ_OP_FLUSH in a kernel blockdev driver? (Do REQ_OP_FLUSH bio's flush dirty RAID controller caches?)

When should I use REQ_OP_FLUSH in my kernel blockdev driver, and what is the expected behavior of the hardware that receives the REQ_OP_FLUSH (or equivalent SCSI cmd)?
In the Linux kernel, when a struct bio is flagged as REQ_OP_FLUSH is passed to a RAID controller volume in writeback mode, is the RAID controller supposed to flush its dirty caches?
It seems to me that this is the purpose of REQ_OP_FLUSH but that is at odds with wanting to be fast with writeback: If the cache is battery-backed, shouldn't the controller ignore the flush?
In ext4's super.c ext4_sync_fs() function, the write skips a call to blkdev_issue_flush() when barriers are disabled via the barrier=0 mount option. This seems to imply that RAID controllers will flush their caches when they are told to...but does RAID firmware ever break the rules?
Is the flush behavior dependent on the firmware implementation and manufacturer?
Where is the SAS/SCSI specification on the subject?
Other considerations?
Christoph Hellwig on the linux-block mailing list said:
Devices with power fail
protection will advertise that (using VWC flag in NVMe for example) and [the Linux kernel] will never send flushes.
Keith Busch at kernel.org:
You can check the queue attribute, /sys/block/<disk>/queue/write_cache. If the
value is "write through", then the device is reporting it doesn't have a
volatile cache. If it is "write back", then it has a volatile cache.
If this sounds backwards, then consider this using a RAID
controller cache as an example:
A RAID controller with a non-volatile "writeback" cache (from the
controller's perspective, ie, with battery) is a "write through"
device as far as the kernel is concerned because the controller will
return the write as complete as soon as it is in the persistent cache.
A RAID controller with a volatile "writeback" cache (from the
controller's perspective, ie without battery) is a "write back"
device as far as the kernel is concerned because the controller will
return the write as complete as soon as it is in the cache, but the
cache is not persistent! So in that case flush/FUA is necessary.
[ Reference: https://lore.kernel.org/all/273d3e7e-4145-cdaf-2f80-dc61823dd6ea#ewheeler.net/ ]
From personal experience, not all raid controllers will properly set queue/write_cache as indicated by Keith above. If you know your array has a non-volatile cache running in write-back mode then check make sure it is in "write through" so flushes will be dropped:
]# cat /sys/block/<disk>/queue/write_cache
<cache status>
and fix it if it isn't in the proper mode. These settings below might seem backdwards, but if they do, then re-read #1 and #2 above because these are correct:
If you have a non-volatile cache (ie, with BBU):
]# echo "write through" > /sys/block/<disk>/queue/write_cache
If you have a volatile cache (ie, without BBU):
]# echo "write back" > /sys/block/<disk>/queue/write_cache
So the answer to the question about when to flag REQ_OP_FLUSH in your kernel code is this: whenever you think your code should commit to disk. Since the block layer can re-order any bio request,
Send a WRITE IO, wait for its completion
Send a flush, wait for flush completion
and then you are guaranteed to have the IO from #1 on disk.
However, if the device being written has cache_mode in "write through" mode, then the flush will complete immediately and its up to your controller do do its job and keep the non-volatile cache active, even after a power loss (BBU, supercap, flashcache, etc).

how metadata journal stalls write system call?

I am using CentOS 7.1.1503 with kernel linux 3.10.0-229.el7.x86_64, ext4 file system with ordered journal mode, and delalloc enabled.
When my app writes logs to a file continually (about 6M/s), I can find the write system call stalled for 100-700 ms occasionally, When I disable ext4's journal feature, or set the journal mode to writeback, or disable delay allocate, the stalling disappears. When I set linux's writeback more frequent, dirty page expire time shorter, the problem reduced.
I printed the process's stack when stalling happends, I got this:
[<ffffffff812e31f4>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffffa0195854>] ext4_da_get_block_prep+0x1a4/0x4b0 [ext4]
[<ffffffff811fbe17>] __block_write_begin+0x1a7/0x490
[<ffffffffa019b71c>] ext4_da_write_begin+0x15c/0x340 [ext4]
[<ffffffff8115685e>] generic_file_buffered_write+0x11e/0x290
[<ffffffff811589c5>] __generic_file_aio_write+0x1d5/0x3e0
[<ffffffff81158c2d>] generic_file_aio_write+0x5d/0xc0
[<ffffffffa0190b75>] ext4_file_write+0xb5/0x460 [ext4]
[<ffffffff811c64cd>] do_sync_write+0x8d/0xd0
[<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
[<ffffffff811c76b8>] SyS_write+0x58/0xb0
[<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
I read the source code of linux kernel, and found that call_rwsem_down_read_failed will call rwsem_down_read_failed which will keep waiting for rw_semaphore.
I think the reason is that metadata journal flushing must wait for related dirty pages flushed, when dirty pages flushing took long time, the journal blocked, and journal commit has this inode's rw_semaphore, write system call to this inode is stalled.
I really hope I can find evidence to prove it.

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.
Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.
The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

Cache line locking

I understand there is a cache line locking instruction in Mips which prevents your data from being ejected from the cache. I am curious as to what happens when you lock down all the cache lines and a new address is read.
In that case, the data at the new address is simply read from memory and not saved in the cache. Nothing terrible happens.

Cleared RW (write protect) flag for PTEs of a process in kernel yet no segmentation fault on write

I implemented incremental process checkpointing at page level(I just dump the data from the process address space into a file).
The approach I used is as follows. I used two system calls:
Complete Checkpoint: copy entire address space. Also if write bit
is set for a page, clear it.
Incremental checkpoint: only dump data if write bit is set and clear it again. So basically, I check if write bit is set for an incremental checkpoint. If yes, dump the page data.
Test program:
char a[10000];
sys_cp_range(a,a+10000);
a[3]='A';
sys_incr_cp_range(a,a+10000);
From what I know, the kernel should be doing page fault and handle illegal write case by killing the process with SIGSEGV. Yet the program is successfully checkpointed.
What is exactly happening here ?
If you modify a PTE when it's still cached in the TLB, the effect of the modification may be unseen for a while (until the PTE gets evicted from the TLB and has to be reread from the page table).
You need to invalidate the PTE in the TLB with the invlpg (I'm assuming x86) instruction after PTE modification. And it has to be done on all CPUs. There must be a dedicated function for this purpose in the kernel.
Also it wouldn't hurt to double check that the compiler didn't reorder or throw away anything from the above code.

Resources