implementation of dirty_expire_centisecs - linux-kernel

I'm trying to understand the behavior of dirty_expire_centisecs parameter on servers with 2.6 and 3.0 kernels.
Kernel documentation says (vm.txt/dirty_expire_centisecs)
"Data which has been dirty in-memory for longer than this interval will be written out next time a flusher thread wakes up."
which implies, dirty data that has been in memory for shorter than this interval will not be written.
According to my testing, behavior of dirty_expire_centisecs is as follows: when writeback timer fires before the expire timer, then no pages will be flushed, else all pages will be flushed.
If background_bytes limit reaches, it flushes all or portion depending on the rate, independent of both timers.
My testing tells me at low write rates (less than 1MB per sec), dirty_background_bytes trigger will flush all dirty pages and at slightly higher data rates (higher than 2MB per sec), it flushes only a portion of the dirty data, independent of expiry value.
This is different from what is said in the vm.txt. It make sense not to flush the most recent data. To me, observed behavior is not logical and practically useless. What do you guys think ?
My test setup:
Server with 16GB of RAM running Suse 11 SP1, SP2 and RedHat 6.2 (multi boot setup)
vm.dirty_bytes = 50000000 // 50MB <br>
vm.dirty_background_bytes = 30000000 // 30MB <br>
vm.dirty_writeback_centisecs = 1000 // 10 seconds <br>
vm.dirty_expire_centisecs = 1500 // 15 seconds <br>
with a file writing tool where I can control the write()'s per sec rate and size.

I asked this question on the linux-kernel mailing list and got an answer from Jan Kara. The timestamp that expiration is based on is the modtime of the inode of the file. Thus, multiple pages dirtied in the same file will all be written when the expiration time occurs because they're all associated with the same inode.
http://lkml.indiana.edu/hypermail/linux/kernel/1309.1/01585.html

Related

Modifying the cache access delay in gem5 does not work

When testing the cache access latency on my gem5, the access latency of l1 is 100 cycles lower than that of l2. My modification is to modify the tag_latency, data_latency, and response_latency in the L2 class in gem5/configs/common/Caches.py. Their original value was 20. I changed them all to 5 or all to 0. Every time I recompile gem5, when I run it again, the time does not change. Why is that?
I am using classical cache
By the way, does the meaning of data_latency, tag_latency and response_latency mean data access delay, tag delay, and delay in responsing to CPU ?
gem5/build/X86/gem5.opt --debug-flags=O3CPUAll --debug-start=120000000000
--outdir=gem5/results/test/final gem5/configs/example/attack_code_config.py
--cmd=final
--benchmark_stdout=gem5/results/test/final/final.out
--benchmark_stderr=gem5/results/test/final/final.err
--mem-size=4GB --l1d_size=32kB --l1d_assoc=8 --l1i_size=32kB --l1i_assoc=8
--l2_size=256kB --l2_assoc=8 --l1d_replacement=LRU --l1i_replacement=LRU
--caches --cpu-type=DerivO3CPU
--cmd --l1d_replacement etc. are the options I added to the
option.

Call to ExAllocatePoolWithTag never returns

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.
Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.
The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

How much delay can be achieved using jiffies in kernel

I need to emulate MDC/MDIO bus using the bit-banging for MDC line. I need to get a clock with frequency of 1.5 Mhz, 1 Mhz will also do.
I am trying to use udelay and ndelay from linux/delay.h. I am working with kernel 2.6.32 and MPC8569E processor from freescale. ndelay is not giving me dealy in nanoseconds but microseconds, I saw it using a logic analyzer on the wires. So ndelay(1) and udelay(1) are effectively behaving the same, giving 1 microsecond delay.
Now In a bit-bang model the codes gonna be something like
par_io_data_set(2/*C port*/,30 /*MDIO pin*/,val /*value*/); //write data to the line MDIO line
//clock pulse for setting the data
ndelay(MDIO_DELAY);
par_io_data_set(2/*C port*/,31 /*MDC pin*/,1 /*value*/);
ndelay(MDIO_DELAY);
par_io_data_set(2/*C port*/,31 /*MDC pin*/,0 /*value*/);
Where I have defined MDIO_DELAY as 1. So I am able to get a clock of around 0.4MHz. I want to bitbang at 1.5 Mhz, but I can't just do so until I can't give delay in nanoseconds.
So I was looking at chapter 7 of ldd, jiffies. Well my HZ is 250, so the kernel interrupts for time every 1/250 secs i.e. 4milli secs right? So jiffies gonna be incremented every 4ms. So I can't expect jiffies to get a counter of the order of nano-seconds, right?
How do I get this job done?

WSASYSCALLFAILURE with overlapped IO on Windows XP

I hit a bug in my code which uses WSARecv and WSAGetOverlapped result on an overlapped socket. Under heavy load, WSAGetOverlapped returns with WSASYSCALLFAILURE ('A system call that should never fail has failed') and my TCP stream is out of sync afterwards, causing mayhem in the upper levels of my program.
So far I have not been able to isolate it to a given set of hardware or drivers. Has somebody hit this issue as well, and found a solution or workaround?
How many connections, how many pending recvs, how many outsanding sends? What does perfmon or task manager say about the amount of non-paged pool used? How much memory in the box? Does it go away if you run the program on Vista or above? Do you have any LSPs installed?
You could be exhausting non-paged pool and causing a badly written driver to misbehave when it fails to allocate memory. This issue is less likely to bite on Vista or later as the amount of non-paged pool available has increased dramatically (see http://www.lenholgate.com/blog/2009/03/excellent-article-on-non-paged-pool.html for details). Alternatively you might be hitting the "locked pages" limit (you can only lock a fixed number of pages in memory on the OS and each pending I/O operation locks one or more pages depending on buffer size and allocation alignment).
It seems I have solved this issue by sleeping 1ms and retrying the WSAGetOverlapped result when it reports a WSASYSCALLFAILURE.
I had another issue related to overlapped events firing, even though there is no data, which I also had to solve first. The test is now running for over an hour, with a few WSASYSCALLFAILURE handled correctly. Hopefully the overnight test will succeed as well.
#Len: thanks again for your help.
EDIT: The overnight test was successful. My bug was caused by two interdependent issues:
Issue 1: WaitForMultipleObjects in ConnectionSet::select occasionally
signals data on an empty socket, causing SocketConnection::readSync to
deadlock.
Fix: Do a non-blocking read on the first byte of each packet. Reset
ConnectionSet if socket was empty
Issue 2: WSAGetOverlappedResult returns occasionally WSASYSCALLFAILURE,
causing out-of-sync on the TCP stream.
Fix: Retry WSAGetOverlappedResult after a small sleep period.
http://equalizer.svn.sourceforge.net/viewvc/equalizer?view=revision&revision=4649

Resources