the linux page cache flush order - linux-kernel

There is page cache before we write data to disk.
So if I have two operations.
write(fileA)
write(fileB)
Then if the system is suddenly shutdown. We don't initiative call the sync() call.
I want to know if it is possible that the data we wrote to fileB has flush to the disk, while the data we wrote to fileA haven't been flush to the disk?

I believe that it is possible for fileB to be written to disk before fileA, as the writes will be bundled into block I/O requests and can be reordered at the block device layer by the I/O scheduler in an attempt to minimise disk seeking.
See the kernel documentation for more info about the I/O scheduler (elevator):
http://lxr.free-electrons.com/source/Documentation/block/biodoc.txt#L885

To answer you question in short you may want to consider calling sync() or fsync() system call in your application after the write() to make sure that data is sync-ed to the disk immediately.
flush (or pdflush) kernel threads are responsible for syncing dirty pages to the disk. When the system is being shutdown properly, all the dirty buffers get synced / written to disk. However, this is not the same in case of abrupt power failures as data which is not yet flushed / sync-ed to the disk is obviously lost.
If you don’t call sync() in your application, then dirty buffers get written to the disk upon certain kernel tunables. You could control how the application data gets synced (inactive dirty pages) through sysctl kernel tunable. You may want to consider reading more about the following:
vm.dirty_expire_centisecs - how old (in 1/100th of a sec) the dirty
pages must be before writing them to the disk
vm.dirty_writeback_centisecs - how often the kernel will wake up the
BDI-flush thread to sync the dirty pages onto the disk
vm.dirty_background_ratio - percentage of system memory which when
dirty then system can start writing data to the disks
vm.dirty_ratio - percentage of system memory which when dirty the the
process doing writes should block to write out dirty pages to the
disks

Related

Can I create a file in Windows that only exists in memory - and if so, how?

This question is not a duplicate of any of these existing questions:
How can I store an object File that only exists in memory as a file inside of my storage system? - This question is not about Java's File API.
Temp file that exists only in RAM? - This is close to what I'm asking, except the OP isn't asking how to create files from memory for the purposes of sharing passing them to child-processes
I'm not asking about Win32's Memory-mapped File either - as they're essentially the opposite of what I'm after: a memory-mapped file is a file-on disk that's mapped to a process' virtual memory space - whereas what I want is a file that exists in the OS' filesystem (but not the disk's physical filesystem) like a mount-point and that file's data is mapped to an existing buffer in memory.
I.e., with Memory Mapped Files, writing/writing to a byte at a particular buffer address and offset in memory will cause the byte at the same offset from the start of the file to be modified - but the file physically exists on-disk, which isn't what I want.
To elaborate and to provide context:
I have an ASP.NET Core server-side application that receives request streams sized between 1 and 10MB on a regular basis. This program will run only on Windows / Windows Server, so using Windows-specific functionality is fine.
75% of the time my application just reads through these streams by itself and that's it.
But a minority of the time it needs to have a separate applications read the data which it starts using Process.Start and passing the file-name as a command-line argument.
It passes the data to these separate applications by saving the stream to a temporary file on-disk and passing the filename of that stream.
Unfortunately it can't write the content to the child-process's stdin because some of the those programs expect a file on-disk rather than reading from stdin.
Additionally, while the machine it's running on has lots of RAM (so keeping the streams buffered in-memory is fine) it has slow spinning-rust HDDs, which is further reason to avoid temporary files on-disk.
I'd like to avoid unnecessary buffering and copies - ideally I'd like to stream the entire 1-10MB request into a single in-memory buffer, and then expose that same buffer to other processes and use that same buffer as the backing for a temporary file.
If I were on Linux, I could use tmpfs - it isn't perfect:
To my knowledge, an existing process can't instruct the OS to take an existing region of its virtual-memory and map a file in tmpfs to that memory region, instead tmpfs still requires that the file be populated by writing (i.e. copying) all of the data to its file-descriptor - which is counter to the aim of having a zero-copy system.
Windows' built-in RAM-disk functionality is limited to providing the basis for a RAM-disk implementation via a third-party device-driver - I'm surprised that Microsoft never shipped Windows with a built-in RAM-disk GUI or API, especially given their relative simplicity.
The ImDisk program is an implementation of a RAM-disk using Microsoft's RAM-disk driver platform, but as far as I can tell while it's more like tmpfs in that it can create a file that exists only in-memory, it doesn't allow the file's data to be backed by a buffer directly accessible to a running process (or a shared-memory buffer).
CreateFileMapping with hFile = INVALID_HANDLE_VALUE "creates a file mapping object of a specified size that is backed by the system paging file instead of by a file in the file system".
From Raymond Chen's The source of much confusion: “backed by the system paging file”:
In other words, “backed by the system paging file” just means “handled like regular virtual memory.”
If the memory is freed before it ever gets paged out, then it will never get written to the system paging file.

Two views of a same memory mapped file in windows, how do they interact flush-wise?

I'm specifically interested in the flush behaviour.
Suppose we created a MMF with CreateFileMapping(), and opened two views, V1 and V2 using MapViewOfFile() with zero offset.
Then I write something to A=V1+a and something to B=V2+b such that A and B belong to different physical memory pages.
Then if I flush the whole first view using FlushViewOfFile(V1, 0), will the dirty pages of the second view also be affected?
My goal is to have 2 views of the same file, where the first view is used for very small writes and very frequent flushes, while the second view is used for massive writes and is flushed only once in a while.
It is important that flushing small writes wouldn't cause flushing of massive writes.
Is this a default behaviour? If not, how to achieve it?
Thanks
CreateFileMappingA, see also CreateProcess, DuplicateHandle and OpenFileMapping functions.
"Creating a file mapping object does not actually map the view into a
process address space. The MapViewOfFile and MapViewOfFileEx
functions map a view of a file into a process address space.
With one important exception, file views derived from any file mapping object that is backed by the same file are coherent
or identical at a specific time. Coherency is guaranteed for views
within a process and for views that are mapped by different
processes.
The exception is related to remote files. Although CreateFileMapping works with remote files, it does not keep them
coherent. For example, if two computers both map a file as writable,
and both change the same page, each computer only sees its own writes
to the page. When the data gets updated on the disk, it is not
merged."
According to the FlushViewOfFile,
"Flushing a range of a mapped view initiates writing of dirty
pages within that range to the disk. Dirty pages are those whose
contents have changed since the file view was mapped. The
FlushViewOfFile function does not flush the file metadata, and it does not wait to return until the changes are flushed from the underlying hardware disk cache and physically written to disk. To
flush all the dirty pages plus the metadata for the file and ensure
that they are physically written to disk, call FlushViewOfFile and
then call the FlushFileBuffers function."
So, when you close views in UnmapViewOfFile
"When a process has finished with the file mapping object, it should
destroy all file views in its address space by using the
UnmapViewOfFile function for each file view.
Unmapping a mapped view of a file invalidates the range occupied by the view in the address space of the process and makes the range
available for other allocations. It removes the working set entry
for each unmapped virtual page that was part of the working set of the
process and reduces the working set size of the process. It also
decrements the share count of the corresponding physical page.
Modified pages in the unmapped view are not written to disk until their share count reaches zero, or in other words, until they are
unmapped or trimmed from the working sets of all processes that share
the pages. Even then, the modified pages are written "lazily" to
disk; that is, modifications may be cached in memory and written to
disk at a later time. To minimize the risk of data loss in the event
of a power failure or a system crash, applications should explicitly
flush modified pages using the FlushViewOfFile function."

Why are read operations much faster after writing files using OS and disk buffers?

I am writing about a hundred files with a size of 50MB each sequentially to a directory on my disk using CreateFile() and WriteFile(). In a second steps, the contents of those files are read using CreateFile() and ReadFile().
I noticed some partially weird things:
If I pass FILE_FLAG_NO_BUFFERING | FILE_FLAG_WRITE_THROUGH when writing the files, reading takes a noticably long time (usually hundreds of milliseconds). However, when I do not pass those flags (but use FlushFileBuffers() instead), writing appears to happen at roughly the same speed but reading those files after writing them is blazingly fast (less than 20 milliseconds per file!).
How is this possible? How do the flags passed when writing 5000MB of data affect reading later? Does the disk cache the whole 5GB in its cache?
When you pass FILE_FLAG_NO_BUFFERING then you are telling the system not to put the data in its disk cache. Then when you read the data, the system has to get the data from the disk.
When you omit FILE_FLAG_NO_BUFFERING, the system can put the data in its disk cache. And so when you read the data subsequently, it can be read directly from memory, which is faster than disk.
From https://support.microsoft.com/en-us/kb/99794:
The FILE_FLAG_WRITE_THROUGH flag for CreateFile() causes any writes made to that handle to be written directly to the file without being buffered. The data is cached (stored in the disk cache); however, it is still written directly to the file. This method allows a read operation on that data to satisfy the read request from cached data (if it's still there), rather than having to do a file read to get the data. The write call doesn't return until the data is written to the file. This applies to remote writes as well--the network redirector passes the FILE_FLAG_WRITE_THROUGH flag to the server so that the server knows not to satisfy the write request until the data is written to the file.
The FILE_FLAG_NO_BUFFERING takes this concept one step further and eliminates all read-ahead file buffering and disk caching as well, so that all reads are guaranteed to come from the file and not from any system buffer or disk cache.
You might find this article from Raymond Chen of interest: We’re currently using FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH, but we would like our WriteFile to go even faster. An excerpt:
A customer said that their program’s I/O pattern is to open a file and
then every so often write about 100KB of data into the file. They are
currently using the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH
flags to open a file, and they wanted to know what else they could do
to make their writes go even faster.
Um, for one thing, you stop passing those two flags!
Those two flags in combination basically mean “Give me the slowest
possible I/O performance!” because they force all I/O to go through to
the physical media right away.

Is the order of cashed writes preserved in Windows 7?

When writing to a file in Windows 7, Windows will cache the writes by default. When it completes the writes, does Windows preserve the order of writes, or can the writes happen out of order?
I have an existing application that writes continuously to a binary file. Every 20 seconds, it writes a block of data, updates the file's Table of Contents, and calls _commit() to flush the data to disk.
I am wondering if it is necessary to call commit, or if we can rely on Windows 7 to get the data to disk properly.
If the computer goes down, I'm not too worried about losing the most recent 20 seconds worth of data, but I am concerned about making the file invalid. If the file's Table of Contents is updated, but the data isn't present, then the file will not be correct. If the data is updated, but the Table of Contents isn't, then there will be extra data at the end of the file, but since it's not referenced by the Table of Contents, it is ignored when reading the file, and we have a correct file.
The writes will not necessarily happen in order. In particular if there are multiple disk I/Os outstanding, the filesystem/disk driver may reorder the I/O operations to reduce head motion. That means that there is no guarantee that data that is written to disk will be written in the order it was written to the file.
Having said that, flushing the file to disk will stall until the I/O is complete - that may mean several dozen milliseconds (or even longer) of inactivity when you application could be doing something more useful.

Custom Prefetch

Any programmatic techniques, portable or specific to NT and Linux that get the result of number of large files loading faster? I am after a 'ahead of time', a prior, whatever you prefer to call it mechanisms that I can control in code for two OS in a question.
Each file has to be processed in full, i.e. completely in size and sequentially for its contents. The aim is to speed up some batch file processing.
I don't know about NT, but one option on Linux would be to use madvise with the MADV_WILLNEED flag shortly before you actually need the next file to start reading it in early.
Alternately, a more portable option would be to simply manually do readahead in a separate thread from your buffer-processing thread - that is, read data in to fill an X MB buffer in thread A, process it as fast as you can in thread B.
I am not aware of a Win32 (NT) API similar to madvise().
However, I would suggest an approach.
First, pass the Win32 flag FILE_FLAG_SEQUENTIAL_SCAN to CreateFile(). This will allow the Windows operating system to perform better buffering of the file once you have opened it.
With FILE_FLAG_SEQUENTIAL_SCAN, your file parser may operate more quickly once the file is in memory. Unlike madvise() on Linux, the file will not begin loading into memory any earlier due to the use of the Win32 flag.
Next, we need to trigger the file to begin loading. Asynchronously read the first page of the file by calling ReadFileEx() with an OVERLAPPED structure and a FileIOCompletionRoutine function.
Your FileIOCompletionRoutine can simply return, or you can set the event in the overlapped structure -- read the MSDN details of ReadFileEx for details.
Since it would not be a critical failure if the pre-fetch hasn't completed when you actually read from the file, the easiest implementation would be to "fire and forget" -- execute the overlapped file read and then never check the result of it. Be sure that you read the data into valid buffers, though!
If you perform this operation for a file while reading the previous file, the result should be that the next file will commence paging in.
Be aware that this may slow your performance. As the next file begins to page in, the disk I/O to access that file will compete with disk I/O for the file you are currently parsing. If the two files are physically distant from each other on the same disk, the result of pre-fetching might be additional delay as the drive head seeks. Although modern drives have huge buffers which mitigate this, queuing the first page of a new file is likely to cause a head seek.
bdonlan's suggestion of a 'pre-fetch' thread which loads the files asynchronously from the processing would be a workable solution for Win32, also.

Resources