I am currently trying to write to different locations of a pre-allocated file.
I first allocated my file like so:
File.open("file", "wb") { |file| file.truncate(size) }
Size being the total size of the file.
Afterwards I receive data of XX size which fits into Y location of that file. Keep in mind this portion of the process is forked. Each fork has it's own unique socket and opens it's own unique file handle, writes to the file, then closes it as so.
data = socket.read(256)
File.open("file", "wb") do |output|
output.seek(location * 256, IO::SEEK_SET)
output.write(data)
end
This should in turn allow the forked processes to open a file handle, seek to the correct location (If location is 2 and data_size is 256, then the write location is 512 -> 768) and write the chunk of data that it received.
Although what this is doing is beyond my comprehension. I monitor the files size as it is being populated and it is bouncing around from different file sizes which should not be changing.
When analyzing the file with a hex editor, where the file data header should be at the top is filled with nullbytes (like wise with 1/4 of the file). Although if I limit the forked processes to only write 1 file chunk and then exit the writes are fine and at their proper location.
I have done some other testing such as dumping that part locations, and the start locations of the data and my equation for seeking to the correct location of the file seems to be correct as well.
Is there something I am missing here or is there another way to have multiple threads/processes open a file handle to a file, seek to a specific location, and then write a chunk of data?
I have also attempted to use FLOCK on the file, and it yields the same results, likewise with using the main process instead of forking.
I have tested the same application, but rather than opening/closing the file handle each time I need to write data in rapid succession (transferring close to 70mb/s), I created one file handle per forked process and kept it open. This fixed the problem resulting in a 1:1 duplication of the file with matching checksums.
So the question is, why is opening/writing/closing file handles to a file in rapid succession causing this behavior?
It's your file mode.
File.open("file", "wb")
"wb" means "upon opening, truncate the file to zero length".
I suggest "r+b", which means "reading and writing, no truncation". Read more about available modes here: http://ruby-doc.org/core-2.2.2/IO.html#method-c-new
BTW, "b" in those modes means "binary" (as opposed to default "t" (text))
Related
I recently learned about metadata and how its information about the data itself.
Seeing how file size includes among those statistics would it be possible to change the file size to something absurd and unreasonable like 1000 petabytes; in this case, you can what would the effects be on a computer and how would it affect a windows 11 computer?
Not all metadata is directly editable. Some of it are simply properties of the file. You can't just set the file size to an arbitrary number, you have to actually edit the file itself. So to create a 1,000 petabytes file, you need to have that much disk space first.
Another example would by the file type. You can't change a jpeg into a png by setting filetype=png. You have to process and convert the file, giving you an entirely new and different file with it's own set of metadata/properties.
I have a Python program which performs a simple operation on a file:
with open(self.cache_filename_url, "a", encoding="utf8") as f:
w = csv.writer(f, delimiter=',', quotechar='"', lineterminator='\n')
w.writerow([cache_url, rpd_products])
As you can see it just opens the file and appends a CSV line to it. It does this a lot, in a loop.
I accidentally ran two copies of this program simultaneously, so I think they would have been appending to the file simultaneously. I am trying to determine the worst-case-scenario for file corruption.
Do you think the writes would at least be atomic operations in this case? For example this wouldn't be a problem for me:
old line
old line
new line written by instance 1
new line written by instance 2
new line written by one
This would be a problem for me:
old line
old line
[half of new line written by instance 1] [half of new line by instance 2]
etc
To put it another way, is it possible for the two append operations to "interfere" with each other?
EDIT: I am using Windows 7
Opening the same file multiple times in shared write mode can definitely be problematic. And, if they don't open in shared mode, you'll get one of them throwing exceptions that it cannot open the file.
If SHARED mode:
Both instances will have their own internal pointer. In most cases, they will probably write independently. You could get:
Process A opens file, sets pointer to end (byte 1024)
Process B opens file, sets pointer to end (byte 1024)
Process B writes at byte 1024 and closes file
Process A writes at byte 1024 and closes file.
Both processes will have written to the file at the same location. You've basically lost the record from Process B, and depending on how the close works (if it truncates), if the lines it writes are different lengths, you could get part of Process B if the line was longer.
If it is in EXCLUSIVE mode, one process will fail to open the file, and whatever exception handling you have will kick in.
Which mode you are in can be system dependent, as Python doesn't seem to provide any mechanisms for controlling the share mode.
Update: I ran a check on my file, and I did indeed have corrupted partial lines (the case under "This would be a problem for me" in my question)
It's unfortunate, especially since it implies you could have problems even when you intend to share a file between two processes.
I am still interested in any pointers on how to avoid this outcome. I will hold off on marking an answer as accepted for now. (The other answer is good, but doesn't provide enough details on these modes or how to determine which will be used.)
I look for a way to extend a file asynchronously and efficiently .
In a support document Asynchronous Disk I/O Appears as Synchronous on Windows NT, Windows 2000, and Windows XP said:
NOTE: Applications can make the previously mentioned write operation
asynchronous by changing the Valid Data Length of the file by using
the SetFileValidData function, and then issuing a WriteFile.
in MSDN, SetFileValidData is a function for Sets the valid data length of the specified file.
But I still not understand what is the "valid data", what is the difference between it and the size of file?
I can use SetFilePointerEx and SetEndOfFile to extend the file size, but how do this by SetFileValidData?
SetFileValidData cannot input a argument large than the size of file. In this case, what is the living meaning of SetFileValidData?
When you use SetEndOfFile to increase the length of a file, the logical file length changes and the necessary disk space is allocated, but no data is actually physically written to the disk sectors corresponding to the new part of the file. The valid data length remains the same as it was.
This means you can use SetEndOfFile to make a file very large very quickly, and if you read from the new part of the file you'll just get zeros. The valid data length increases when you write actual data to the new part of the file.
That's fine if you just want to reserve space, and will then be writing data to the file sequentially. But if you make the file very large and immediately write data near the end of it, zeros need to be written to the new part of the file, which will take a long time. If you don't actually need the file to contain zeros, you can use SetFileValidData to skip this step; the new part of the file will then contain random data from previously deleted files.
Addendum:
The rules for sparse files are different.
You should not use SetFileValidData on a file that non-privileged users have read access to; this could leak content from deleted files that belonged to other users.
Please note that SetEndOfFile() doesn't write any zeros to any allocated sectors on disk, it just allocates the space pointers inside MFT records and then updates the space bitmap of the whole file system. But the OS, or FS, will record the valid/logical file length in its MFT record.
If you enlarge the file, from 1GB to 2GB, then the appended 1GB should be all zeros, but the FS won't write the zeros to disks, it refers to this file's valid length to know that the 1GB should be zeros. If you try to read from this enlarged 1GB portion, it will fill zeros directly in RAM and then feedback to your application. But if you write any byte inside this 1GB portion, the FS has to fill with zeros from the original 1GB offset to the current pointer that your application is trying to write to, but not the other bytes from the current location to the tail of the file. Meanwhile, it records the valid/logical length to be from 0 to the current location, the physical size and allocated size is still 2GB.
But, if you use SetFileValidData(), the FS will set the valid length to 2GB directly, and won't bother to fill any zeros. Whatever location you write to, it just writes, but whatever location you read from, you may read out some garbage data which was previously generated by other applications before the file was extended into that disk space.
Agree with Harry Johnston's answer, and from the practice point of view, while SetFileValidData has performance advantage because it does not require writing zeros, it does have security implications because the file might contain data from other deleted files. So a special privilege, SE_MANAGE_VOLUME_NAME, is required, as MSDN mentioned: http://msdn.microsoft.com/en-us/library/windows/desktop/aa365544(v=vs.85).aspx
The reason is that, if the user account of the running program doesn't have that privilege, using SetFileValidData can expose other user's deleted data into the view of that particular file, so normal users (non-administrators) are not allowed to do that. Even for privileged users, they still need to take care to use ACL (access control lists) in the file system to protect that file so that it is not shared with non-privileged users.
It seems that SenEndofFile does not really allocate reserved disk space for the target file, SetFileValidData is responsible for the job.
Refered to MSDN,
You can use the SetFileValidData function to create large files in very specific circumstances so that the performance of subsequent file I/O can be better than other methods. Specifically, if the extended portion of the file is large and will be written to randomly, such as in a database type of application, the time it takes to extend and write to the file will be faster than using SetEndOfFile and writing randomly.
If SetEndOfFile really allocate space, then SetFileValidData will do nothing better than SetEndOfFile when writing randomly. So SetEndOfFile may just create a sparse file with hole(s), while SetFileValidData do the actual allocation.
I have a custom file type that is implemented in sections with a header at the shows the offset and length of each section within the file.
Currently, whenever I want to interact with the file, I must either load and parse the entire thing up front, or else pick only the sections that I need and load just them.
What I would like to do is to achieve a hybrid approach where each of the sections is loaded on-demand.
It seems however that doing this has a lot of potential downsides in terms of leaving filesystem handles open for longer than I would like and the additional code complexity that I would incur.
Are there any standard patterns for this sort of thing? It seems that my options are to:
Just load the entire file and stop grousing about the cycles/memory wasted
Load the entire file into memory as raw bytes and then satisfy any requests for unloaded sections from the memory buffer rather than disk. This saves me the cost of parsing the unneeded sections and requires less memory (since the disk representation is much more compact than the object model around it), but still means that I waste memory for sections that I never end up loading.
Load whatever sections I need right away and close the file but hold onto the source location of the file. Then if another section is requested, re-open the file and load the data. In this case I could get strange results if the underlying file is changed.
Same as the above but leave a file handle open (perhaps allowing read sharing).
Load the file using Memory-Mapped IO and leave a view on the file open.
Any thoughts
If possible, MMAP-ing the whole file is usually the easiest thing to do if you have a random-access pattern. This way you just delegate the loading/unloading issue to the OS and you have 1 & 2 for free.
If you have very special access patterns, you can even use something like fadvise() (I don't the exact Win32 equivalent) to tell the OS your access intend.
If your file is more than 2GB and you can either go the 64bits way or to mmap() the file on demand.
If the file is relatively small, mmap-ing the entire file is good enough. If the file is large, you could leave a mmap view open, and just move it around the file and resize it to view each section when needed.
I've been writing a ruby programme that merges the content of two files.
For example if a torrent have been downloaded two times separately, it tries to merge their contents for the blocks which have been completed.
So, I've been looking for a method which modifies a stream only at the place required and saves only that block instead of saving the whole stream again.
I'm reading the file in blocks of 16 KiBs, and how do I "replace" (not append) the content of that 16 KiBs so that only those bytes are written to disk and not the whole file is re-written each time!
Kind of,
#Doesn't exist unfortunately.
#By default it appends instead of replacing, so file size grows.
IO.write(file_name, content, offset, :replace => true)
Is there exists a method which achieves kind of that functionality?
Open the file in "r+b" mode, seek to the location and just write to it:
f=File.new("some.existing.file", "r+b");
f.seek(1024);
f.write("test\n");
f.close()
This will overwrite 5 characters of the file, following offset 1024.
If the file is shorter than your seek offset, an appropriate number of null characters are inserted to the file.