Meta data file size changeable - windows

I recently learned about metadata and how its information about the data itself.
Seeing how file size includes among those statistics would it be possible to change the file size to something absurd and unreasonable like 1000 petabytes; in this case, you can what would the effects be on a computer and how would it affect a windows 11 computer?

Not all metadata is directly editable. Some of it are simply properties of the file. You can't just set the file size to an arbitrary number, you have to actually edit the file itself. So to create a 1,000 petabytes file, you need to have that much disk space first.
Another example would by the file type. You can't change a jpeg into a png by setting filetype=png. You have to process and convert the file, giving you an entirely new and different file with it's own set of metadata/properties.

Related

How to create a partially modifiable binary file format?

I'm creating my custom binary file extension.
I use the RIFF standard for encoding data. And it seems to work pretty well.
But there are some additional requirements:
Binary files could be large up to 500 MB.
Real-time saving data into the binary file in intervals when data on the application has changed.
Application could run on the browser.
The problem I face is when I want to save data it needs to read everything from memory and rewrite the whole binary file.
This won't be a problem when data is small. But when it's getting larger, the Real-time saving feature seems to be unscalable.
So main requirement of this binary file could be:
Able to partially read the binary file (Cause file is huge)
Able to partially write changed data into the file without rewriting the whole file.
Streaming protocol like .m3u8 is not an option, We can't split it into chunks and point it using separate URLs.
Any guidance on how to design a binary file system that scales in this scenario?
There is an answer from a random user that has been deleted here.
It seems great to me.
You can claim your answer back and I'll delete this one.
He said:
If we design the file to be support addition then we able to add whatever data we want without needing to rewrite the whole file.
This idea gives me a very great starting point.
So I can append more and more changes at the end of the file.
Then obsolete old chunks of data in the middle of the file.
I can then reuse these obsolete data slots later if I want to.
The downside is that I need to clean up the obsolete slot when I have a chance to rewrite the whole file.

How do I create a CFSTR_FILEDESCRIPTOR of unknown size?

I have an email client that allows the user to export a folder of email as a MBOX file. The UX is that they drag the folder from inside the application to an explorer folder (e.g. the Desktop) and a file copy commences via the application adding CFSTR_FILEDESCRIPTOR and CFSTR_FILECONTENTS to the data object being dropped. The issue arises when working out how to specify the size of the "file". Because internally I store the email in a database and it takes quite a while to fully encode the output MBOX, especially if the folder has many emails. Until that encoding is complete I don't have an exact size... just an estimate.
Currently I return an IStream pointer to windows, and over-specify the size in the file descriptor (estimate * 3 or something). Then when I hit the end of my data I return a IStream::Read length less then the input buffer size. Which causes Windows to give up on the copy. In Windows 7 it leaves the "partial" file there in the destination folder which is perfect, but in XP it fails the copy completely, leaving nothing in the destination folder. Other versions may exhibit different behaviour.
Is there a way of dropping a file of unknown size onto explorer that has to be generated by the source application?
Alternatively can I just get the destination folder path and do all the copy progress + output internally to my application? This would be great, I have all the code to do it already. Problem is I'm not the process accepting the drop.
Bonus round: This also needs to work on Linux/GTK and Mac/Carbon so any pointers there would be helpful too.
Windows Explorer use three methods to detect size of stream (in order of priority):
nFileSizeHigh/Low fields of FILEDESCRIPTOR structure if FD_FILESIZE flags is present.
Calling IStream.Seek(0, STREAM_SEEK_END, FileSize).
Calling IStream.Stat. cbSize field of STATSTG structure is used as MAX file size only.
To pass to Explorer a file with unknown size it is necessary:
Remove FD_FILESIZE flags from FILEDESCRIPTOR structure.
IStream.Seek must not be implemented (must return E_NOTIMPL).
IStream.Stat must set cbSize field to -1 (0xFFFFFFFFFFFFFFFF).
Is there a way of dropping a file of unknown size onto explorer that has to be generated by the source application?
When providing CFSTR_FILEDESCRIPTOR, you don't have to provide a file size at all if you don't know it ahead of time. The FD_FILESIZE flag in the FILEDESCRIPTOR::dwFlags field is optional. Provide an exact size only if you know it, otherwise don't provide a size at all, not even an estimate. The copy will still proceed, but the target won't know the final size until IStream::Read() returns S_FALSE to indicate the end of the stream has been reached.
Alternatively can I just get the destination folder path and do all the copy progress + output internally to my application?
A drop operation does not provide any information about the target at all. And for good reason - the source doesn't need to know. A drop target does not need to know where the source data is coming from, only how to access it. The drag source does not need to know how the drop target will use the data, only how to provide the data to it.
Think of what happens if you drop a file (virtual or otherwise) onto an Explorer folder that is implemented as a Shell Namespace Extension that saves the file to another database, or uploads it to a remote server. The filesystem is not involved, so you wouldn't be able to manually copy your data to the target even if you wanted to. Only the target knows how its data is stored.
That being said, the only way I know to get the path of a filesystem folder being dropped into is to drag&drop a dummy file, and then monitor the filesystem for where the drop target creates/copies the file to. Then you can replace the dummy file with your real file. But this is not very reliable, and not very friendly to the target.

What does SetFileValidData doing ? what is the difference with SetEndOfFile?

I look for a way to extend a file asynchronously and efficiently .
In a support document Asynchronous Disk I/O Appears as Synchronous on Windows NT, Windows 2000, and Windows XP said:
NOTE: Applications can make the previously mentioned write operation
asynchronous by changing the Valid Data Length of the file by using
the SetFileValidData function, and then issuing a WriteFile.
in MSDN, SetFileValidData is a function for Sets the valid data length of the specified file.
But I still not understand what is the "valid data", what is the difference between it and the size of file?
I can use SetFilePointerEx and SetEndOfFile to extend the file size, but how do this by SetFileValidData?
SetFileValidData cannot input a argument large than the size of file. In this case, what is the living meaning of SetFileValidData?
When you use SetEndOfFile to increase the length of a file, the logical file length changes and the necessary disk space is allocated, but no data is actually physically written to the disk sectors corresponding to the new part of the file. The valid data length remains the same as it was.
This means you can use SetEndOfFile to make a file very large very quickly, and if you read from the new part of the file you'll just get zeros. The valid data length increases when you write actual data to the new part of the file.
That's fine if you just want to reserve space, and will then be writing data to the file sequentially. But if you make the file very large and immediately write data near the end of it, zeros need to be written to the new part of the file, which will take a long time. If you don't actually need the file to contain zeros, you can use SetFileValidData to skip this step; the new part of the file will then contain random data from previously deleted files.
Addendum:
The rules for sparse files are different.
You should not use SetFileValidData on a file that non-privileged users have read access to; this could leak content from deleted files that belonged to other users.
Please note that SetEndOfFile() doesn't write any zeros to any allocated sectors on disk, it just allocates the space pointers inside MFT records and then updates the space bitmap of the whole file system. But the OS, or FS, will record the valid/logical file length in its MFT record.
If you enlarge the file, from 1GB to 2GB, then the appended 1GB should be all zeros, but the FS won't write the zeros to disks, it refers to this file's valid length to know that the 1GB should be zeros. If you try to read from this enlarged 1GB portion, it will fill zeros directly in RAM and then feedback to your application. But if you write any byte inside this 1GB portion, the FS has to fill with zeros from the original 1GB offset to the current pointer that your application is trying to write to, but not the other bytes from the current location to the tail of the file. Meanwhile, it records the valid/logical length to be from 0 to the current location, the physical size and allocated size is still 2GB.
But, if you use SetFileValidData(), the FS will set the valid length to 2GB directly, and won't bother to fill any zeros. Whatever location you write to, it just writes, but whatever location you read from, you may read out some garbage data which was previously generated by other applications before the file was extended into that disk space.
Agree with Harry Johnston's answer, and from the practice point of view, while SetFileValidData has performance advantage because it does not require writing zeros, it does have security implications because the file might contain data from other deleted files. So a special privilege, SE_MANAGE_VOLUME_NAME, is required, as MSDN mentioned: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365544(v=vs.85).aspx
The reason is that, if the user account of the running program doesn't have that privilege, using SetFileValidData can expose other user's deleted data into the view of that particular file, so normal users (non-administrators) are not allowed to do that. Even for privileged users, they still need to take care to use ACL (access control lists) in the file system to protect that file so that it is not shared with non-privileged users.
It seems that SenEndofFile does not really allocate reserved disk space for the target file, SetFileValidData is responsible for the job.
Refered to MSDN,
You can use the SetFileValidData function to create large files in very specific circumstances so that the performance of subsequent file I/O can be better than other methods. Specifically, if the extended portion of the file is large and will be written to randomly, such as in a database type of application, the time it takes to extend and write to the file will be faster than using SetEndOfFile and writing randomly.
If SetEndOfFile really allocate space, then SetFileValidData will do nothing better than SetEndOfFile when writing randomly. So SetEndOfFile may just create a sparse file with hole(s), while SetFileValidData do the actual allocation.

Get file offset on disk/cluster number

I need to get any information about where the file is physically located on the NTFS disk. Absolute offset, cluster ID..anything.
I need to scan the disk twice, once to get allocated files and one more time I'll need to open partition directly in RAW mode and try to find the rest of data (from deleted files). I need a way to understand that the data I found is the same as the data I've already handled previously as file. As I'm scanning disk in raw mode, the offset of the data I found can be somehow converted to the offset of the file (having information about disk geometry). Is there any way to do this? Other solutions are accepted as well.
Now I'm playing with FSCTL_GET_NTFS_FILE_RECORD, but can't make it work at the moment and I'm not really sure it will help.
UPDATE
I found the following function
http://msdn.microsoft.com/en-us/library/windows/desktop/aa364952(v=vs.85).aspx
It returns structure that contains nFileIndexHigh and nFileIndexLow variables.
Documentation says
The identifier that is stored in the nFileIndexHigh and nFileIndexLow members is called the file ID. Support for file IDs is file system-specific. File IDs are not guaranteed to be unique over time, because file systems are free to reuse them. In some cases, the file ID for a file can change over time.
I don't really understand what is this. I can't connect it to the physical location of file. Is it possible later to extract this file ID from MFT?
UPDATE
Found this:
This identifier and the volume serial number uniquely identify a file. This number can change when the system is restarted or when the file is opened.
This doesn't satisfy my requirements, because I'm going to open the file and the fact that ID might change doesn't make me happy.
Any ideas?
Use the Defragmentation IOCTLs. For example, FSCTL_GET_RETRIEVAL_POINTERS will tell you the extents which contain file data.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?
As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.
Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.
Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!
An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.
First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.
Check Solid File System - it seems to be what you need.

Resources