Win32: Write to file without buffering? - winapi

I need to create a new file handle so that any write operations to that handle get written to disk immediately.
Extra info: The handle will be the inherited STDOUT of a child process, so I need any output from that process to immediately be written to disk.
Studying the CreateFile documentation, the FILE_FLAG_WRITE_THROUGH flag looked like exactly what I need:
Write operations will not go through
any intermediate cache, they will go
directly to disk.
I wrote a very basic test program and, well, it's not working.
I used the flag on CreateFile then used WriteFile(myHandle,...) in a long loop, writing about 100MB of data in about 15 seconds. (I added some Sleep()'s).
I then set up a professional monitoring environment consisting of continuously hitting 'F5' in explorer. The results: the file stays at 0kB then jumps to 100MB about the time the test program ends.
Next thing I tried was to manually flush the file after each write, with FlushFileBuffers(myHandle). This makes the observed file size grow nice and steady, as expected.
My question is, then, shouldn't the FILE_FLAG_WRITE_THROUGH have done this without manually flushing the file? Am I missing something?
In the 'real world' program, I can't flush the file, 'cause I don't have any control over the child process that's using it.
There's also the FILE_FLAG_NO_BUFFERING flag, that I can't be used for the same reason - no control over the process that's using the handle, so I can't manually align the writes as required by this flag.
EDIT:
I have made a separate project specifically for watching how the size of the file changes. It uses the .NET FileSystemWatcher class. I also write less data - around 100kB in total.
Here's the output. Check out the seconds in the timestamps.
The 'builtin no-buffers' version:
25.11.2008 7:03:22 PM: 10230 bytes added.
25.11.2008 7:03:31 PM: 10240 bytes added.
25.11.2008 7:03:31 PM: 10240 bytes added.
25.11.2008 7:03:31 PM: 10240 bytes added.
25.11.2008 7:03:31 PM: 10200 bytes added.
25.11.2008 7:03:42 PM: 10240 bytes added.
25.11.2008 7:03:42 PM: 10240 bytes added.
25.11.2008 7:03:42 PM: 10240 bytes added.
25.11.2008 7:03:42 PM: 10240 bytes added.
25.11.2008 7:03:42 PM: 10190 bytes added.
... and the 'forced (manual) flush' version (FlushFileBuffers() is called every ~2.5 seconds):
25.11.2008 7:06:10 PM: 10230 bytes added.
25.11.2008 7:06:12 PM: 10230 bytes added.
25.11.2008 7:06:15 PM: 10230 bytes added.
25.11.2008 7:06:17 PM: 10230 bytes added.
25.11.2008 7:06:19 PM: 10230 bytes added.
25.11.2008 7:06:21 PM: 10230 bytes added.
25.11.2008 7:06:23 PM: 10230 bytes added.
25.11.2008 7:06:25 PM: 10230 bytes added.
25.11.2008 7:06:27 PM: 10230 bytes added.
25.11.2008 7:06:29 PM: 10230 bytes added.

I've been bitten by this, too, in the context of crash logging.
FILE_FLAG_WRITE_THROUGH only guarantees that the data you're sending gets sent to the filesystem before WriteFile returns; it doesn't guarantee that it's actually sent to the physical device. So, for example, if you execute a ReadFile after a WriteFile on a handle with this flag, you're guaranteed that the read will return the bytes you wrote, whether it got the data from the filesystem cache or from the underlying device.
If you want to guarantee that the data has been written to the device, then you need FILE_FLAG_NO_BUFFERING, with all the attendant extra work. Those writes have to be aligned, for example, because the buffer is going all the way down to the device driver before returning.
The Knowledge Base has a terse but informative article on the difference.
In your case, if the parent process is going to outlive the child, then you can:
Use the CreatePipe API to create an inheritable, anonymous pipe.
Use CreateFile to create a file with FILE_FLAG_NO_BUFFERING set.
Provide the writable handle of the pipe to the child as its STDOUT.
In the parent process, read from the readable handle of the pipe into aligned buffers, and write them to the file.

This is an old question but I thought I might add a bit to it. Actually everyone here I believe is wrong. When you write to a stream with write-through and unbuffered-io it does write to the disk but it does NOT update the metadata associated with the File System (eg what explorer shows you).
You can find a good reference on this kind of stuff here http://winntfs.com/2012/11/29/windows-write-caching-part-2-an-overview-for-application-developers/
Cheers,
Greg

Perhaps you could be satisfied enough with FlushFileBuffers:
Flushes the buffers of a specified file and causes all buffered data to be written to a file.
Typically the WriteFile and WriteFileEx functions write data to an internal buffer that the operating system writes to a disk or communication pipe on a regular basis. The FlushFileBuffers function writes all the buffered information for a specified file to the device or pipe.
They do warn that calling flush, to flush the buffers a lot, is inefficient - and it's better to just disable caching (i.e. Tim's answer):
Due to disk caching interactions within the system, the FlushFileBuffers function can be inefficient when used after every write to a disk drive device when many writes are being performed separately. If an application is performing multiple writes to disk and also needs to ensure critical data is written to persistent media, the application should use unbuffered I/O instead of frequently calling FlushFileBuffers. To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
If it's not a high-performance situation, and you won't be flushing too frequently, then FlushFileBuffers might be sufficient (and easier).

The size you're looking at in Explorer may not be entirely in-sync with what the file system knows about the file, so this isn't the best way to measure it. It just so happens that FlushFileBuffers will cause the file system to update the information that Explorer is looking at; closing it and reopening may end up doing the same thing as well.
Aside from the disk caching issues mentioned by others, write through is doing what you were hoping it is doing. It's just that doing a 'dir' in the directory may not show up-to-date information.
Answers suggesting that write-through only writes it "to the file system" are not quite right. It does write it into the file system cache, but it also sends the data down to the disk. Write-through might mean that a subsequent read is satisfied from the cache, but it doesn't mean that we skipped a step and aren't writing it to the disk. Read the article's summary very carefully. This is a confusing bit for just about everyone.

Perhaps you wanna consider memory mapping that file. As soon as you write to the memory mapped region, the file gets updated.
Win API File Mapping

Related

Equivalent of Linux sync_file_range system call on windows?

I need to fsync a range of byte I appended to a file without forcing a flush on the metadata (filesize,...).
As you've said it's ring3 and it's C++, here is the answer:
You need to call FlushViewOfFile after mapping the file. According to MSDN:
The FlushViewOfFile function does not flush the file metadata, and it
does not wait to return until the changes are flushed from the
underlying hardware disk cache and physically written to disk.
source: https://msdn.microsoft.com/en-us/library/windows/desktop/aa366563%28v=vs.85%29.aspx
An example code that writes data and uses FlushViewOfFile is here: http://forums.codeguru.com/showthread.php?367742-FlushViewOfFile-does-not-Flush

ext4 commit= mount option and dirty_writeback_centisecs

I'm tring to understand the way bytes go from write() to the phisical disk plate to tune my picture server performance.
Thing I don't understand is what is the difference between these two: commit= mount option and dirty_writeback_centisecs. Looks like they are about the same procces of writing changes to the storage device, but still different.
I do not get it clear which one fires first on the way to the disk for my bytes.
Yeah, I just ran into this investigating mount options for an SDCard Ubuntu install on an ARM Chromebook. Here's what I can tell you...
Here's how to see the dirty and writeback amounts:
user#chrubuntu:~$ cat /proc/meminfo | grep "Dirty" -A1
Dirty: 14232 kB
Writeback: 4608 kB
(edit: This dirty and writeback is rather high, I had a compile running when I ran this.)
So data to be written out is dirty. Dirty data can still be eliminated (if say, a temporary file is created, used, and deleted before it goes to writeback, it'll never have to be written out). As dirty data is moved into writeback, the kernel tries to combine smaller requests that may be into dirty into single larger I/O requests, this is one reason why dirty_expire_centisecs is usually not set too low. Dirty data is usually put into writeback when a) Enough data is cached to get up to vm.dirty_background_ratio. b) As data gets to be vm.dirty_writeback_centisecs centiseconds old (3000 default is 30 seconds) it is put into writeback. vm.dirty_writeback_centisecs, a writeback daemon is run by default every 500 centiseconds (5 seconds) to actually flush out anything in writeback.
fsync will flush out an individual file (force it from dirty into writeback and wait until it's flushed out of writeback), and sync does that with everything. As far as I know, it does this ASAP, bypassing any attempt to try to balance disk reads and writes, it stalls the device doing 100% writes until the sync completes.
commit=5 default ext4 mount option actually forces a sync() every 5 seconds on that filesystem. This is intended to ensure that writes are not unduly delayed if there's heavy read activity (ideally losing a maximum of 5 seconds of data if power is cut or whatever.) What I found with an Ubuntu install on SDCard (in a Chromebook) is that this actually just leads to massive filesystem stalls like every 5 seconds if you're writing much to the card, ChromeOS uses commit=600 and I applied that Ubuntu-side to good effect.
The dirty_writeback_centisecs, configures the daemons of the kernel Linux related to the virtual memory (that's why the vm). Which are in charge of making a write back from the RAM memory to all the storage devices, so if you configure the dirty_writeback_centisecs and you have 25 different storage devices mounted on your system it will have the same amount of time of writeback for all the 25 storage systems.
While the commit is done per storage device (actually is per filesystem) and is related to the sync process instead of the daemons from the virtual memory.
So you can see it as:
dirty_writeback_centisecs
writing from RAM to all filesystems
commit
each filesystem fetches from RAM

Size of exe file vs available memory

I have gone through How does a PE file get mapped into memory?, this is not what i am asking for.
I want to know which sections (data, text, code, ...) of a PE file are always completely loaded into memory by the loader no matter whatever the condition is?
As per my understanding, none of the sections (code,data,resources,text,...) are always loaded completely, they are loaded as and when needed, page by page. If few pages of code (in the middle or at the end), are not required to process user's request then these pages will not always get loaded.
I have tried making exe files with lots of code with/without resources both of which are not used at all, but, every time the exe loads into memory, it takes more memory than the file size. (I might have been looking at the wrong column of Memory in Task Manager)
Matt Pietrek writes here
It's important to note that PE files are not just mapped into memory
as a single memory-mapped file. Instead, the Windows loader looks at
the PE file and decides what portions of the file to map in.
and
A module in memory represents all the code, data, and resources from
an executable file that is needed by a process. Other parts of a PE
file may be read, but not mapped in (for instance, relocations). Some
parts may not be mapped in at all, for example, when debug information
is placed at the end of the file.
In a nutshell,
1- There is an exe of size 1 MB and available memory (physical + virtual) is less than 1 MB, is it consistent that loader will always refuse to load because available memory is less than the size of file?
2- If an exe of size 1 MB takes 2 MB memory when loaded (starts running first line of user code) while available memory (physical + virtual) is 1.5 MB, is it consistent that loader will always refuse to load because there is not enough memory?
3- There is an exe of size 50 MB (lots of code, data and resources) but it requires 500 KB to run the first line of user code, is it consistent that this exe will always run first line of code if available memory (physical + virtual) is 500 KB atleast?

Doing a zero-copy move of data from a Linux kernel buffer to hard disk

am trying to move data from a buffer in kernel space into the hard
disk without having to incur any additional copies from kernel buffer to
user buffers or any other kernel buffers. Any ideas/suggestions would be
most helpful.
The use case is basically a demux driver which collects data into a
demux buffer in kernel space and this buffer has to be emptied
periodically by copying the contents into a FUSE-based partition on the
disk. As the buffer gets full, a user process is signalled which then
determines the sector numbers on the disk the contents need to be copied
to.
I was hoping to mmap the above demux kernel buffer into user address
space and issue a write system call to the raw partition device. But
from what I can see, the this data is being cached by the kernel on its
way to the Hard Disk driver. And so I am assuming that involves
additional copies by the linux kernel.
At this point I am wondering if there is any other mechansim to do this
without involving additional copies by the kernel. I realize this is an
unsual usage scenario for non-embedded environments, but I would
appreciate any feedback on possible options.
BTW - I have tried using O_DIRECT when opening the raw partition, but
the subsequent write call fails if the buffer being passed is the
mmapped buffer.
Thanx!
You need to expose your demux buffer as a file descriptor (presumably, if you're using mmap() then you're already doing this - great!).
On the kernel side, you then need to implement the splice_read member of struct file_operations.
On the userspace side, create a pipe(), then use splice() twice - once to move the data from the demux file descriptor into the pipe, and a second time to move the data from the pipe to the disk file. Use the SPLICE_F_MOVE flag.
As documented in the splice() man page, it will avoid actual copies where it can, by copying references to pages of kernel memory rather than the pages themselves.

Deleting shared memory with ipcrm in Linux

I am working with a shared memory application, and to delete the segments I use the following command:
ipcrm -M 0x0000162e (this is the key)
But I do not know if I'm doing the right things, because when I run ipcs I see the same segment but with the key 0x0000000. So is the memory segment really deleted? When I run my application several times I see different memory segments with the key 0x000000, like this:
key shmid owner perms bytes nattch status
0x00000000 65538 me 666 27 2 dest
0x00000000 98307 me 666 5 2 dest
0x00000000 131076 me 666 5 1 dest
0x00000000 163845 me 666 5 0
What is actually happening? Is the memory segment really deleted?
Edit: The problem was - as said below in the accepted answer - that there were two processes using the shared memory, until all the process were closed, the memory segment is not going to disappear.
I vaguely remember from my UNIX (AIX and HPUX, I'll admit I've never used shared memory in Linux) days that deletion simply marks the block as no longer attachable by new clients.
It will only be physically deleted at some point after there are no more processes attached to it.
This is the same as with regular files that are deleted, their directory information is removed but the contents of the file only disappear after the last process closes it. This sometimes leads to log files that take up more and more space on the file system even after they're deleted as processes are still writing to them, a consequence of the "detachment" between a file pointer (the zero or more directory entries pointing to an inode) and the file content (the inode itself).
You can see from your ipcs output that 3 of the 4 still have attached processes so they won't be going anywhere until those processes detach from the shared memory blocks. The other's probably waiting for some 'sweep' function to clean it up but that would, of course, depend on the shared memory implementation.
A well-written client of shared memory (or log files for that matter) should periodically re-attach (or roll over) to ensure this situation is transient and doesn't affect the operation of the software.
You said that you used the following command
ipcrm -M 0x0000162e (this is the key)
From the man page for ipcrm
-M shmkey
Mark the shared memory segment associated with key shmkey for
removal. This marked segment will be destroyed after the
last detach.
So the behaviour of -M options does exactly what you observed, ie set the segment to be destroyed only after the last detach.

Resources