What determines bv_len inside BIO structure (for I/O request)? - linux-kernel

I built a ram based virtual block device driver with blk-mq API that uses none for I/O scheduler. I am running fio to perform random read/write on the device and noticed that the bv_len in each bio request is always 1024 bytes. I am not aware any place in code that sets this value explicitly. The file system is ext4.
Is this a default config or something I could change in code?

I am not aware any place in code that sets this [bv_len] value explicitly.
In a 5.7 kernel isn't it set explicitly in __bio_add_pc_page__bio_add_pc_page() and __bio_add_page() (which are within block/bio.c)? You'll have to trace back through callers to see how the passed len was set though.
(I found this by searching for the bv_len identifier in LXR and then going through results)
However, #stark's comment about tune2fs is the key to any answer. You never told us the filesystem block size and if your block device is "small" it's likely your filesystem is also small and by default the choice of block size is dependent on that. If you read the mke2fs man page you will see it says the following:
-b block-size
Specify the size of blocks in bytes. Valid block-size values are 1024, 2048 and 4096 bytes per block. If omitted, block-size is heuristically determined by the filesystem size and the expected usage of the filesystem (see the -T option).
[...]
-T usage-type[,...]
[...]
If this option is is not specified, mke2fs will pick a single default usage type based on the size of the filesystem to be created. If the filesystem size is less than or equal to 3 megabytes, mke2fs will use the filesystem type floppy. [...]
And if you look in the default mke2fs.conf the blocksize for a floppy is 1024.

Related

Using SetFilePointer to change the location to write in the sector doesn't work?

I'm using SetFilePointer to rewrite the second half of the MBR with something, its a user-mode application and i opened a handle to PhysicalDrive
At first i tried to set the size parameter in WriteFile to 256 but the writefile gave the INVALID_PARAMETER error, as it turns out based on some search on other questions here it seems like this is because we are forced to write in multiplicand of the sector size when the handle is PhysicalDrive for some reason
then i tried to set the filePointer to 256, and Write 512 bytes, both of them return no error, but for some unknown reason it writes from the beginning of the sector! as if the SetFilePointer didn't work even tho the return value of SetFilePointer is OK and it returns 256
So my questions is :
Why the write size have to be multiplicand of sector size when the handle is PhysicalDrive? which other device handles are like this?
Why is this happening and when I set the file pointer to 256, WriteFile still writes from the start?
isn't this really redundant, considering that even if I want to change 1 byte then I have to read the entire sector, change the one byte and then write it back, instead of just writing 1 byte, it seems like 10 times more overhead! isn't there a faster way to write a few bytes in a sector?
I think you are mixing the file system and the storage (block device). File system stays above storage device stack. If your code obtains a handle to a file system device, you can write byte by byte. But if you are accessing storage device stack, you can only write sector by sector (or block size).
Directly writing to block device is definitely slow as you discovered. However, in most cases, people just talk to file systems. Most file system drivers maintain cache and use algorithms for both read and write to improve performance.
Can't comment on file pointer based offset before seeing the actual code. But I guess it might be not sector aligned or it's not used at all.

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

SCSI Write Buffer command "Download microcode with offset and save" vs "Download microcode with save" mode

I want to use Write Buffer SCSI command to upload a firmware of a tape drive (LTO-6).
As described in IBM LTO SCSI Reference section "5.2.41.6: MODE[07h] – Download microcode with offsets, save, and activate", microcode is transferred to the device using one or more WRITE BUFFER commands, saved to nonvolatile storage (Page 180).
According to the CDB (Page 132), the Buffer Offset can be expressed in 3 bytes so does the Parameter List Length.
As I understand you may want to use more than one Write Buffer command in case the firmware size can't be expressed in 3 bytes (more than about 16M), and if so you can use the offset for that. But if the offset itself can't be expressed in more than 3 bytes, that means one can't write at offset 17M for example (therefor can't use this command more than twice in a row).
Does anybody know if this is the real use of "offset and save" mode?
You can use the mode 07h (Section 5.2.17.4) where the write buffer uses a shifted offset and thus you can express offsets larger than 16MB.
Looks like one can't upload more than 32MB in the firmware buffer, and what was meant by 2 or more Write Buffer commands is to issue them with smaller value than the maximum(16MB) if you have a DMA(Direct Memory Access) limitation.
One can use the interpretation mentioned by Baruch Even with Read Buffer command with mode 07h (It's not supported by all Buffer IDs, one can check by issuing Read Buffer with mode 07h and it will return illegal request if it's not supported).
On the other hand, Write Buffer commands sections shows no such interpretation to any of the modes.

File System Block Size while creating the File System using mkfs

I am trying to use BUSE (with NBD) to create a block device in user space. I am not clearly understanding the block access patterns when creating a file system. As shown in the example when I mount the nbd device and create a ext4 file system with a block size of 4096, I am seeing the reads and writes are in multiples of 1024 and not 4096.
However once the file system is created, when I mount the device and try to read/write files the requests are being sent in multiples of 4096.
So it looks like, while creating the file system using mkfs.ext4, the block device is accessed with 1024 as block size and only after the file system is created, the user specified block size will be used. I am correct in making this inference? If so, can someone explain what happens at the backend and why 1024 is chose initially?
Thanks and Regards,
Sharath

ext4 pointers from in-memory inode

I'm trying to retrieve in a kernel module the direct/indirect etc addresses in an ext4 file system inode. I understand that I need to look into ext_inode_info struct (I do this via container_of using the relevant vfs_inode).
But to which field am I supposed to look into?
Where can I find for example the first direct pointer? I thought it was stored in i_data array (it is in ext3_inode_info).
But for an ext4 inode when I examine the first entry in i_data, I get a sector address that is not remotely similar to the real sector holding the address of the first data block.
Any help will be appreciated.
==EDIT==
ok, so I seemed to have understood the basic problem. I have an extent-based ext4 file system. Wasn't aware of this change, and that this is enabled by default. So is there a simple way to extract the physical addresses of blocks by offset? I'm trying again as verification to look at the first physical block (logical 0), by looking at the first extent, but I get some gibberish numbers (though consistent and unique for every inode/file, so some progress was made).

Resources