Windows (ReFS,NTFS) file preallocation hint - windows

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)

Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

Related

Using SetFilePointer to change the location to write in the sector doesn't work?

I'm using SetFilePointer to rewrite the second half of the MBR with something, its a user-mode application and i opened a handle to PhysicalDrive
At first i tried to set the size parameter in WriteFile to 256 but the writefile gave the INVALID_PARAMETER error, as it turns out based on some search on other questions here it seems like this is because we are forced to write in multiplicand of the sector size when the handle is PhysicalDrive for some reason
then i tried to set the filePointer to 256, and Write 512 bytes, both of them return no error, but for some unknown reason it writes from the beginning of the sector! as if the SetFilePointer didn't work even tho the return value of SetFilePointer is OK and it returns 256
So my questions is :
Why the write size have to be multiplicand of sector size when the handle is PhysicalDrive? which other device handles are like this?
Why is this happening and when I set the file pointer to 256, WriteFile still writes from the start?
isn't this really redundant, considering that even if I want to change 1 byte then I have to read the entire sector, change the one byte and then write it back, instead of just writing 1 byte, it seems like 10 times more overhead! isn't there a faster way to write a few bytes in a sector?
I think you are mixing the file system and the storage (block device). File system stays above storage device stack. If your code obtains a handle to a file system device, you can write byte by byte. But if you are accessing storage device stack, you can only write sector by sector (or block size).
Directly writing to block device is definitely slow as you discovered. However, in most cases, people just talk to file systems. Most file system drivers maintain cache and use algorithms for both read and write to improve performance.
Can't comment on file pointer based offset before seeing the actual code. But I guess it might be not sector aligned or it's not used at all.

Writing last N bytes to file opened with FILE_FLAG_NO_BUFFERING

When writing lots of sequential data to disk I found that having an internal 4MB buffer and when opening the file for writing I specify [FILE_FLAG_NO_BUFFERING][1], so that my internal buffer is used.
But that also creates a requirement to write in full sector blocks (512 bytes on my machine).
How do I write the last N<512 bytes to disk?
Is there some flag to WriteFile to allow this?
Do I pad them with extra NUL characters and then truncate the file size down to the correct value?
(With SetFileValidData or similar?)
For those wondering the reason for trying this approach. Our application logs a lot. To handle this a dedicated log-thread exists, which formats and writes logs to disk. Also if we log with fullest detail we might log more per second than the disk-system can handle. (Usually noticed for customers with SAN systems that are not well tweaked.)
So, the goal is log write as much as possible, but also notice when we start to overload the system, and then hold back a bit, like reducing the details of the logs.
Hence the idea to have a fill a big memory-block and give that to the OS, hoping to reduce the overheads.
As the comments suggest, doing file writing this way is probably not the best solution for real world situations. But if writing with FILE_FLAG_NO_BUFFERING is used,
SetFileInformationByHandle is the way to mark the file shorter than whole blocks.
int data_len = len(str);
int len_last_block = BLOCKSIZE%datalen;
int padding_to_fill_block = (data_last_block == BLOCKSIZE ? 0 : (BLOCKSIZE-len_last_block);
str.append('\0', padding_to_fill_block);
ULONG bytes_written = 0;
::WriteFile(hFile, data, data_len+padding_to_fill_block, &bytes_written, NULL));
m_filesize += bytes_written;;
LARGE_INTEGER end_of_file_pos;
end_of_file_pos.QuadPart = m_filesize - padding_to_fill_block;
if (!::SetFileInformationByHandle(hFile, FileEndOfFileInfo, &end_of_file_pos, sizeof(end_of_file_pos)))
{
HRESULT hr = ::GetLastErrorMessage();
}

accessing process memory parts

I'm currently studying memory management of OS by the video lecture. The instructor says,
In fact, you may have, and it is quite often the case that there may
be several parts of the process memory, which are not even accessed at
all. That is, they are neither executed, loaded or stored from memory.
I don't understand the saying since even if in a simple C program, we access whole address space of it. Don't we?
#include <stdio.h>
int main()
{
printf("Hello, World!");
return 0;
}
Could you elucidate the saying? If possible could you provide an example program wherein "several parts of the process memory, which are not even accessed at all" when it is run.
Imagine you have a large and complicated utility (e.g. a compiler), and the user asks it for help (e.g. they type gcc --help instead of asking it to compile anything). In this case, how much of the utility's code and data is used?
Most programs have various optional parts that aren't used (e.g. maybe something that works with graphics will have some code for 16 bits per pixel and other code for 32 bits per pixel, and will determine which code to use and not use the other code). Most heap allocators are "eager" (e.g. they'll ask the OS for 20 MiB of space and then might only "malloc() 2 MiB of it). Sometimes a program will memory map a huge file but then only access a small part of it.
Even for your trivial "hello world" example code; the virtual address space probably contains a huge (several MiB) shared library to support lots of C standard library functions (e.g. puts(), fprintf(), sprintf(), ...) and your program only uses a small part of that shared library; and your program probably reserves a conservative amount of space for its stack (e.g. maybe 20 KiB of space for its stack) and then probably only uses a few hundred bytes of stack.
In a virtual memory system, the address space of the process is created in secondary store at start up. Little or nothing gets placed in memory. For example, the operating system may use the executable file as the page file for the code and static data. It just sets up an internal structure that says some range of memory is mapped to these blocks in the executable file. The same goes for shared libraries. The other data gets mapped to the page file.
As your program runs it starts page faulting rapidly because nothing is in memory and the operating system has to load it from secondary storage.
If there is something that your program does not reference, it never gets loaded into memory.
If you had global variable declared like
char somedata [1045] ;
and your program never references that variable, it will never get loaded into memory. The same goes for code. If you have pages of code that done get execute (e.g. error handling code) it does not get loaded. If you link to shared libraries, you will likely bece including a lot of functions that you never use. Likewise, they will not get loaded if you do not execute them.
To begin with, not all of the address space is backed by physical memory at all times, especially if your address space covers 248+ bytes, which your computer doesn't have (which is not to say you can't map most of the address space to a single physical page of memory, which would be of very little utility for anything).
And then some portions of the address space may be purposefully permanently inaccessible, like a few pages near virtual address 0 (to catch NULL pointer dereferences).
And as it's been pointed out in the other answers, with on-demand loading of programs, you may have some portions of the address space reserved for your program but if the program doesn't happen to need any of its code or data there, nothing needs to be loader there either.

Writing small amount of data to large number of files on GlusterFS 3.7

I'm experimenting with 2 Gluster 3.7 servers in 1x2 configuration. Servers are connected over 1 Gbit network. I'm using Debian Jessie.
My use case is as follows: open file -> append 64 bytes -> close file and do this in a loop for about 5000 different files. Execution time for such loop is roughly 10 seconds if I access files through mounted glusterfs drive. If I use libgfsapi directly, execution time is about 5 seconds (2 times faster).
However, the same loop executes in 50ms on plain ext4 disk.
There is huge performance difference between Gluster 3.7 end earlier versions which is, I believe, due to the cluster.eager-lock setting.
My target is to execute the loop in less than 1 second.
I've tried to experiment with lots of Gluster settings but without success. dd tests with various bsize values behave like that TCP no-delay option is not set, although from Gluster source code it seems that no-delay is default.
Any idea how to improve the performance?
Edit:
I've found a solution that works in my case so I'd like to share it in case anyone else faces the same issue.
The root cause of the problem is the number of roundtrips between client and Gluster server during execution of open/write/close sequence. I don't know exactly what is happening behind but timing measurements shows exactly that pattern. Now, the obvious idea would be to "pack" open/write/close sequence into a single write function. Roughly, the C prototype of such function would be:
int write(const char* fname, const void *buf, size_t nbyte, off_t offset)
But, there is already such API function glfs_h_anonymous_write in libgfapi (thanks goes to Suomya from Gluster mailing group). Kind of hidden thing there is the file identifier which is not plain file name, but something of type struct glfs_object. Clients obtain an instance of such object through API calls glfs_h_lookupat/glfs_h_creat. The point here is that glfs_object representing filename is "stateless" in a sense that corresponding inode is left intact (not ref counted). One should think of glfs_object as plain filename identifier and use it as you would use filename (actually, glfs_object stores plain pointer to corresponding inode without ref counting it).
Finally, we should use glfs_h_lookupat/glfs_h_creat once and write many times to the file using glfs_h_anonymous_write.
That way I was able to append 64 bytes to 5000 files in 0.5 seconds, which is 20 times faster than using mounted volume and open//write/close sequence.

Why does the memory mapped file ever need to be flushed when access is RDWR?

I was reading through one of golang's implementation of memory mapped files, https://github.com/edsrzf/mmap-go/. First he describes the several access modes:
// RDONLY maps the memory read-only.
// Attempts to write to the MMap object will result in undefined behavior.
RDONLY = 0
// RDWR maps the memory as read-write. Writes to the MMap object will update the
// underlying file.
RDWR = 1 << iota
// COPY maps the memory as copy-on-write. Writes to the MMap object will affect
// memory, but the underlying file will remain unchanged.
COPY
But in gommap test file I see this:
func TestReadWrite(t *testing.T) {
mmap, err := Map(f, RDWR, 0)
... omitted for brevity...
mmap[9] = 'X'
mmap.Flush()
So why does he need to call Flush to make sure the contents are written to the file if the access mode is RDWR?
Or is the OS managing this so it only writes when it thinks it should?
If the last option, could you please explain it in a little more detail - what i read is that when the OS is low in memory it writes to the file and frees up memory. Is this correct and does it apply only to RDWR or only to COPY?
Thanks
The program maps a region of memory using mmap. It then modifies the mapped region. The system isn't required to write those modifications back to the underlying file immediately, so a read call on that file (in ioutil.ReadAll) could return the prior contents of the file.
The system will write the changes to the file at some point after you make the changes.
It is allowed to write the changes to the file any time after the changes are made, but by default makes no guarantees about when it writes those changes. All you know is that (unless the system crashes), the changes will be written at some point in the future.
If you need to guarantee that the changes have been written to the file at some point in time, then you must call msync.
The mmap.Flush function calls msync with the MS_SYNC flag. When that system call returns, the system has written the modifications to the underlying file, so that any subsequent call to read will read the modified file.
The COPY option sets the mapping to MAP_PRIVATE, so your changes will never be written back to the file, even if you using msync (via the Flush function).
Read the POSIX documentation about mmap and msync for full details.

Resources