worse case scenario: launched two copies of a program which appends lines to a file - parallel-processing

I have a Python program which performs a simple operation on a file:
with open(self.cache_filename_url, "a", encoding="utf8") as f:
w = csv.writer(f, delimiter=',', quotechar='"', lineterminator='\n')
w.writerow([cache_url, rpd_products])
As you can see it just opens the file and appends a CSV line to it. It does this a lot, in a loop.
I accidentally ran two copies of this program simultaneously, so I think they would have been appending to the file simultaneously. I am trying to determine the worst-case-scenario for file corruption.
Do you think the writes would at least be atomic operations in this case? For example this wouldn't be a problem for me:
old line
old line
new line written by instance 1
new line written by instance 2
new line written by one
This would be a problem for me:
old line
old line
[half of new line written by instance 1] [half of new line by instance 2]
etc
To put it another way, is it possible for the two append operations to "interfere" with each other?
EDIT: I am using Windows 7

Opening the same file multiple times in shared write mode can definitely be problematic. And, if they don't open in shared mode, you'll get one of them throwing exceptions that it cannot open the file.
If SHARED mode:
Both instances will have their own internal pointer. In most cases, they will probably write independently. You could get:
Process A opens file, sets pointer to end (byte 1024)
Process B opens file, sets pointer to end (byte 1024)
Process B writes at byte 1024 and closes file
Process A writes at byte 1024 and closes file.
Both processes will have written to the file at the same location. You've basically lost the record from Process B, and depending on how the close works (if it truncates), if the lines it writes are different lengths, you could get part of Process B if the line was longer.
If it is in EXCLUSIVE mode, one process will fail to open the file, and whatever exception handling you have will kick in.
Which mode you are in can be system dependent, as Python doesn't seem to provide any mechanisms for controlling the share mode.

Update: I ran a check on my file, and I did indeed have corrupted partial lines (the case under "This would be a problem for me" in my question)
It's unfortunate, especially since it implies you could have problems even when you intend to share a file between two processes.
I am still interested in any pointers on how to avoid this outcome. I will hold off on marking an answer as accepted for now. (The other answer is good, but doesn't provide enough details on these modes or how to determine which will be used.)

Related

How to protect a file with Win32 API from being corrupted if the power is reset?

In a C++ Win32 app I write a large file by appending blocks about 64K using a code like this:
auto h = ::CreateFile(
"uncommited.dat",
FILE_APPEND_DATA, // open for writing
FILE_SHARE_READ, // share for reading
NULL, // default security
CREATE_NEW, // create new file only
FILE_ATTRIBUTE_NORMAL, // normal file
NULL); // no attr. template
for (int i = 0; i < 10000; ++i) { ::WriteFile(h, 64K);}
As far as I see if the process is terminated unexpectedly, some blocks with numbers i >= N are lost, but blocks with numbers i < N are valid, and I can read them when the app restarts, because the blocks themselves are not corrupted.
But what happens if the power is reset? Is it true that entire file can be corrupted, or even have zero length?
Is it a good idea to do
FlushFileBuffers(h);
MoveFile("uncommited.dat", "commited.dat");
assuming that MoveFile is some kind of an atomic operation, and when the app restarts open "commited.dat" as valid and delete "uncommited.dat" as corrupted. Or is there a better way?
MoveFile can work all right in the right situation. It has a few problems though--for example, you can't have an existing file by the new name.
If that might occur (you're basically updating an existing file you want to assure won't get corrupted by making a copy, modifying the copy, then replacing the old with the new), rather than MoveFile you probably want to use ReplaceFile.
With ReplaceFile, you write your data to the uncommitted.dat (or whatever name you prefer). Then yes, you probably want to do FlushFileBuffers, and finally ReplaceFile to replace the old file with the new one. This makes use of the NTFS journaling (which applies to file system metadata, not the contents of your files), assuring that only one of two possibilities can happen: either you have the old file (entirely intact) or else the new one (also entirely intact). If power dies in the middle of making a change, NTFS will use its journal to roll back the transaction.
NTFS does also support transactions, but Microsoft generally recommends against applications trying to use this directly. It apparently hasn't been used much since they added it (in Windows Vista), and MSDN hints that it's likely to be removed in some future version of Windows.
For append only scenario you can split data in blocks (constant or variable size). Each block should be accompanied with some form of checksum (SHA, MD5, CRC).
After crash you can read sequentially each block and verify it's checksum. First damaged block and all following it should be treated as lost (eventually you can inspect them and recover manually).
To append more data, truncate file to the end of last correct block.
You can write two copies in parallel and after crash select one with more good blocks.

Closing all pipes of a process

I am working on making a program that will act in a similar way as a shell, but supports only foreground processes and pipes. I have multiple processes writing to the same pipe and some other properties that differ from the normal usage of pipes. Anyhow, my question is,
Is there any easy (automatic) way to close all file descriptors of a process except the three basic ones?
I am asking this question since I have a lot of difficulties keeping track of all file descriptors for every process. And sometimes they act in some unpredictable ways to me. It could be also because of the fact that I don't have a very thorough understanding of them.
Is there any easy way(automatic) to close all file descriptors of a process except the three basic ones?
The normal way to do this is to simply iterate over all of them and close them:
for (i = getdtablesize(); i > 3;) close(--i);
That's already a one-liner. It doesn't get any more "automatic" than that.
I am asking this question since I have a lot of difficulty keeping track of all file descriptors for every process.
It will be worth your time to think about the life cycle of each file descriptor you open, when it gets duplicated (e.g. dup2() and fork()), how it gets used, and make sure you account for how each one is going to get closed when it is no longer needed. Papering over a problem of leaked file descriptors by indiscriminately closing them all is not going to be sustainable.
I have multiple processes writing to the same pipe
If you do this, then you need to be aware that the order in which data arrive at the other end of the pipe is going to be unpredictable. It will be difficult to avoid corrupting the data stream.
Use the closefrom(3) C library function.
From the manpage:
The closefrom() system call deletes all open file descriptors greater
than or equal to lowfd from the per-process object reference table.
Any errors encountered while closing file descriptors are ignored.
Example usage:
#include <unistd.h>
int main() {
// Close everything except stdin, stdout and stderr
closefrom(3); // Were 3 is the lowest file descriptor you wish to close
printf("Clear of all, but the three basic file descriptors!\n");
return 0;
}
This works in most unices, but requires the libbsd support library for Linux.

Write to specific part of preallocated file

I am currently trying to write to different locations of a pre-allocated file.
I first allocated my file like so:
File.open("file", "wb") { |file| file.truncate(size) }
Size being the total size of the file.
Afterwards I receive data of XX size which fits into Y location of that file. Keep in mind this portion of the process is forked. Each fork has it's own unique socket and opens it's own unique file handle, writes to the file, then closes it as so.
data = socket.read(256)
File.open("file", "wb") do |output|
output.seek(location * 256, IO::SEEK_SET)
output.write(data)
end
This should in turn allow the forked processes to open a file handle, seek to the correct location (If location is 2 and data_size is 256, then the write location is 512 -> 768) and write the chunk of data that it received.
Although what this is doing is beyond my comprehension. I monitor the files size as it is being populated and it is bouncing around from different file sizes which should not be changing.
When analyzing the file with a hex editor, where the file data header should be at the top is filled with nullbytes (like wise with 1/4 of the file). Although if I limit the forked processes to only write 1 file chunk and then exit the writes are fine and at their proper location.
I have done some other testing such as dumping that part locations, and the start locations of the data and my equation for seeking to the correct location of the file seems to be correct as well.
Is there something I am missing here or is there another way to have multiple threads/processes open a file handle to a file, seek to a specific location, and then write a chunk of data?
I have also attempted to use FLOCK on the file, and it yields the same results, likewise with using the main process instead of forking.
I have tested the same application, but rather than opening/closing the file handle each time I need to write data in rapid succession (transferring close to 70mb/s), I created one file handle per forked process and kept it open. This fixed the problem resulting in a 1:1 duplication of the file with matching checksums.
So the question is, why is opening/writing/closing file handles to a file in rapid succession causing this behavior?
It's your file mode.
File.open("file", "wb")
"wb" means "upon opening, truncate the file to zero length".
I suggest "r+b", which means "reading and writing, no truncation". Read more about available modes here: http://ruby-doc.org/core-2.2.2/IO.html#method-c-new
BTW, "b" in those modes means "binary" (as opposed to default "t" (text))

MPI file view and IO

I am writing an MPI program that needs to read a part of a file into memory, one piece at a time, with each piece going to an available process. I am therefore using a shared filepointer. The first part of the file is a header which I want to read to read and distribute to all processes, I have managed to do this by reading it on the master process and broadcasting it to all other processes.
The next part of the file is a long (in theory up to several gigabytes) array of float triples. I want to set the fileview for all the processes so that it starts at the beginning of this array, and each process should be able to see the whole array. Furthermore, and this is my real problem, I do not want the processes to see beyond this array, so that after they encounter the last set of 3 floats they report EOF. So in practice each process just sees one long 3-float array and nothing else.
After the header has been read this is my code:
MPI_Datatype particle_type;
MPI_Type_contiguous(3,MPI_FLOAT,&particle_type);
MPI_Type_commit(&particle_type);
MPI_Offset cur_file_pos;
MPI_File_get_position_shared(fh,&cur_file_pos);
MPI_File_set_view(fh, cur_file_pos, particle_type, particle_type, (char *) "native", MPI_INFO_NULL); /* fh is the file-handle from MPI_File_open */
As I understand, this simply skips the header, but the file view does not stop after the array, it continues into the next part of the file which I am not interested in. Can anyone help me with this simple problem? I have not been able to find any thorough explanations (with examples) of file views anywhere.
Unfortunately, MPI_File_set_view won't do this for you; once you go beyond the filetype the filetype repeats. While MPI_File_set_view will allow you to partition the view of the file between processes, it won't let you "truncate" the view of the file like this.
If you're using the shared file pointer, presumably the simplest thing to do is to loop until the new position == number of particles (once the view is set, the file pointer is in units of etypes).

Is appending to a file atomic with Windows/NTFS?

If I'm writing a simple text log file from multiple processes, can they overwrite/corrupt each other's entries?
(Basically, this question Is file append atomic in UNIX? but for Windows/NTFS.)
You can get atomic append on local files. Open the file with FILE_APPEND_DATA access (Documented in WDK). When you omit FILE_WRITE_DATA access then all writes will ignore the the current file pointer and be done at the end-of file. Or you may use FILE_WRITE_DATA access and for append writes specify it in overlapped structure (Offset = FILE_WRITE_TO_END_OF_FILE and OffsetHigh = -1 Documented in WDK).
The append behavior is properly synchronized between writes via different handles. I use that regularly for logging by multiple processes. I do write BOM at every open to offset 0 and all other writes are appended. The timestamps are not a problem, they can be sorted when needed.
Even if append is atomic (which I don't believe it is), it may not give you the results you want. For example, assuming a log includes a timestamp, it seems reasonable to expect more recent logs to be appended after older logs. With concurrency, this guarantee doesn't hold - if multiple processes are waiting to write to the same file, any one of them might get the write lock - not just the oldest one waiting. Thus, logs can be written out of sequence.
If this is not desirable behaviour, you can avoid it by publishing logs entries from all processes to a shared queue, such as a named pipe. You then have a single process that writes from this queue to the log file. This avoids the conccurrency issues, ensures that logs are written in order, and works when file appends are not atomic, since the file is only written to directly by one process.
From this MSDN page on creating and opening Files:
An application also uses CreateFile to specify whether it wants to share the file for reading, writing, both, or neither. This is known as the sharing mode. An open file that is not shared (dwShareMode set to zero) cannot be opened again, either by the application that opened it or by another application, until its handle has been closed. This is also referred to as exclusive access.
and:
If you specify an access or sharing mode that conflicts with the modes specified in the previous call, CreateFile fails.
So if you use CreateFile rather than say File.Open which doesn't have the same level of control over the file access, you should be able to open a file in such a way that it can't get corrupted by other processes.
You'll obviously have to add code to your processes to cope with the case where they can't get exclusive access to the log file.
No it isn't. If you need this there is Transactional NTFS in Windows Vista/7.

Resources