What is a good algorithm for saving a file periodically and ensuring there's a backup file as well - algorithm

One of our processes(Writer) will be creating a file routinely. Given that Other processes(let's call them readers) will be reading this file at different times, the following questions arise.
a) when the Writer is writing to the file, at the same time the Reader(independent of Writer) doesn't read an incomplete version of the same file. ?
b) should we create a backup file (file.bin~) ? What happens at the instant we are renaming our old file(file.bin) to the backup file (file.bin~) and creating new file (file.bin)?
This is somewhat similar to a backup program that backs up a file while someone is saving a file with an editor.
SUMMARY : while all the file saving and backup file creation are going on, How do you ensure that the Reader program never gets an incomplete file (otherway put, how do you ensure that the Reader program always gets a complete file ?)
Thank you,

Find a way to let the readers know that the writer is writing, so they can stop reading. (Or alternatively make the writer wait for the readers to finish)
Make the writer write the data to a new file (file_new.bin)
Copy/move the old file (file.bin) to a new location (file_old.bin)
Replace the old file (file.bin) with the new file (file_new.bin)
You might consider keeping the old version (file_old.bin) for a simple backup (or alternatively send it to a separate folder(or server) with datestamp etc. for more advanced backup) or simply delete it to save storage

Related

how to know a file's create time by others

My Dear Friends,
I have a question which puzzled me for quite a long time. It is about the create time of a file. Some one create a file on his PC. There should contain a create time for this file. Like below:
The if he copied this file to other folders or send this file to others by email. The create time will change. So this create time does not mean the time the file was initially created by the guy, but means the time the file was moved to the folder.
Here comes the question: how can i know the correct initial create time of the file(should be independent of a system)?
Thanks so much for your reply.
There is no general way to do this. The create time for a file is stored on the filesystem or in an archive (ZIP files store the last modification date and time only, for example).
Sometimes, but not always, a file's creation and modification times are updated when it is copied to another filesystem, device, or archive. This behavior depends on the tool used to do the copying. If the original date/time are not preserved during the copy, then that information is lost.

How to reliably overwrite a file in Windows

I want to overwrite the content of a file atomically. I need this to maintain integrity when overwriting a config file, so an update should either pass or fail and never leave the file half-written or corrupted.
I went through multiple iteration to solve this problem, here is my current solution.
Steps to overwrite file "foo.config":
Enter a global mutex (unique per file name)
Write the new content in "foo.config.tmp"
Call FlushFileBuffers on the file handle before closing the file to flush the OS file buffers
Call ReplaceFile which will internally
rename "foo.config" to "foo.config.bak"
rename "foo.config.tmp" to "foo.config"
Delete "foo.config.bak"
Release the global mutex
I thought this solution to be robust, but the dreaded issue occurred again in production after a power failure. The config file was found corrupted, filled with 'NULL' character, .tmp or .bak file did not exist.
My theory is that the original file content was zeroed out when deleting "foo.config.bak" but the filesystem metadata update caused by the ReplaceFile call was not flushed to disk. So after reboot, "foo.config" is pointing to the original file content that has been zeroed out, is that even possible since ReplaceFile is called before DeleteFile?
The file was stored on an SSD (SanDisk X110).
Do you see a flaw in my file overwrite procedure? Could it be an hardware failure in the SSD? Do you have an idea to guarantee the atomicity of the file overwrite even in case of power failure? Ideally I'd like to delete the tmp and bak file after the overwrite.
Thanks,
Use MoveFileEx with the MOVEFILE_WRITE_THROUGH flag when renaming the file. This should tell windows to write the file right away, not caching it.

How to process an open file using MapReduce framework

I have a file that get aggregated and written into HDFS. This file will be opened for an hour before it is closed. Is it possible to compute this file using MapReduce framework, while it is open? I tried it but it's not picking up all appended data. I could query the data in HDFS and it available but not when done by MapReduce. Is there anyway I could force MapReduce to read an open file? Perhaps customize the FileInputFormat class?
You can read what was physically flushed. Since close() makes the final flush of the data, your reads may miss some of the most recent data regardless how you access it (mapreduce or command line).
As a solution I would recommend periodically close the current file, and then open a new one (with some incremented index suffix). You can run you map reduce on multiple files. You would still end up with some data missing in the most recent file, but at least you can control it by frequency of of your file "rotation".

File Unlocking and Deleting as single operation

Please note this is not duplicate of File r/w locking and unlink. (The difference - platform. Operations of files like locking and deletion have totally different semantics, thus the sultion would be different).
I have following problem. I want to create a file system based session storage where each session data is stored in simple file named with session ids.
I want following API: write(sid,data,timeout), read(sid,data,timeout), remove(sid)
where sid==file name, Also I want to have some kind of GC that may remove all timed-out sessions.
Quite simple task if you work with single process but absolutly not trivial when working with multiple processes or even over shared folders.
The simplest solution I thought about was:
write/read:
hanlde=CreateFile
LockFile(handle)
read/write data
UnlockFile(handle)
CloseHanlde(handle)
GC (for each file in directory)
hanlde=CreateFile
LockFile(handle)
check if timeout occured
DeleteFile
UnlockFile(handle)
CloseHanlde(handle)
But AFIAK I can't call DeleteFile on opended locked file (unlike in Unix where file locking is
not mandatory and unlink is allowed for opened files.
But if I put DeleteFile outside of Locking loop bad scenario may happen
GC - CreateFile/LockFile/Unlock/CloseHandle,
write - oCreateFile/LockFile/WriteUpdatedData/Unlock/CloseHandle
GC - DeleteFile
Does anybody have an idea how such issue may be solved? Are there any tricks that allow
combine file locking and file removal or make operation on file atomic (Win32)?
Notes:
I don't want to use Database,
I look for a solution for Win32 API for NT 5.01 and above
Thanks.
I don't really understand how this is supposed to work. However, deleting a file that's opened by another process is possible. The process that creates the file has to use the FILE_SHARE_DELETE flag for the dwShareMode argument of CreateFile(). A subsequent DeleteFile() call will succeed. The file doesn't actually get removed from the file system until the last handle on it is closed.
You currently have data in the record that allows the GC to determine if the record is timed out. How about extending that housekeeping info with a "TooLateWeAlreadyTimedItOut" flag.
GC sets TooLateWeAlreadyTimedItOut = true
Release lock
<== writer comes in here, sees the "TooLate" flag and so does not write
GC deletes
In other words we're using a kind of optimistic locking approach. This does require some additional complexity in the Writer, but now you're not dependent upon any OS-specifc wrinkles.
I'm not clear what happens in the case:
GC checks timeout
GC deletes
Writer attempts write, and finds no file ...
Whatever you have planned for this case can also be used in the "TooLate" case
Edited to add:
You have said that it's valid for this sequence to occur:
GC Deletes
(Very slightly later) Writer attempts a write, sees no file, creates a new one
The writer can treat "tooLate" flag as a identical to this case. It just creates a new file, with a different name, use a version number as a trailing part of it's name. Opening a session file the first time requires a directory search, but then you can stash the latest name in the session.
This does assume that there can only be one Writer thread for a given session, or that we can mediate between two Writer threads creating the file, but that must be true for your simple GC/Writer case to work.
For Windows, you can use the FILE_FLAG_DELETE_ON_CLOSE option to CreateFile - that will cause the file to be deleted when you close the handle. But I'm not sure that this satisfies your semantics (because I don't believe you can clear the delete-on-close attribute.
Here's another thought. What about renaming the file before you delete it? You simply can't close the window where the write comes in after you decided to delete the file but what if you rename the file before deleting it? Then when the write comes in it'll see that the session file doesn't exist and recreate it.
The key thing to keep in mind is that you simply can't close the window in question. IMHO there are two solutions:
Adding a flag like djna mentioned or
Require that a per-session named mutex be acquired which has the unfortunate side effect of serializing writes on the session.
What is the downside of having a TooLate flag? In other words, what goes wrong if you delete the file prematurely? After all your system has to deal with the file not being present...

Should I write a temp file to a temp dir? or write a temp file to the final directory?

When an application saves a file, a typical model is to save the file to a temporary location, then move the temporary file to the final location. In some cases that "move" becomes "replace". In pseudo code:
Save temp file;
if final file exists
delete final file;
move temp file to final filename;
There's a window in there where the delete might succeed, but the move may not, so you can handle that by something like :
Save temp file;
if final file exists
move final file to parking lot
move temp file to final filename;
if move succeeded
delete previous final file.
else
restore previous final file.
Now to my questions:
is it preferred to save the temporary file to a temporary directory, and then move it, as opposed to saving the temporary file to the final directory? (if so, why?)
Is there a difference in attributes and permissions on a file that is first saved to a temp dir, then moved to the final file in a different directory, as compared to a file that is saved to a temp file in the final directory, and then renamed within the directory?
If the answers to both are YES, then how can I do the preferred thing while getting the appropriate ACL on file which was first saved to a temporary directory and then moved to a final directory?
Create a temp file in the temp folder if it is just a temporary file. Otherwise, create it in its final destination.
Caveats:
1) This may not work if the final destination is a 'pickup' folder (unless the 'pickup' process checks for locked files (which it should))
2) The final destination has special permissions that have to be created in code and applied before being able to move to the final destination.
Microsoft Word saves a temp file to the original directory starting with a tilde (~). I would just follow that convention.
If these are temp files that turn into permanent files, create them in the same location to avoid any risk of having to "move" files across disks/partitions, which will result in more I/O (as a copy followed by a delete).
If these are temp files that are truly temporary, create (and leave them) in the temp dir.
A reason why you might want to never write a file to one directory and move it to another is because those directories might be on different filesystems. Although this is less often a problem on windows, it is still reasonably possible so long as the parent filesystem is ntfs. In unix, it is a standard practice for /tmp to be a different filesystem.
The reason this could be a problem is because that means the file has to be copied from one place to another. This significantly impacts performance for files of substantial size, and will certainly require many more seeks, even if the file is small. Additionally, there are many more ways for this to fail when moving a file across filesystem boundaries. Of coursea access permissions could be different, but also the target filesystem could be full, or any number of other additional complications that you are now deferring until much later.
It is preferable to create a temp file using the GetTempFile routines because this creates temp files in predefined locations (e.g. C:\temp) that utilities can delete if your app crashes or makes corrupt files in. If the same thing happens in your final directory, it is unrecoverable.
Yes, attributes could be different if the target file's attributes or ACL has been edited. This could happen even if you create the temp file in the same folder.
You fix this by using the File.Replace routine, which performs an atomic replacement of one file with another, replacing the new file's attributes and ACLs with the old file's.
A C# method that does this is an answer to Safe stream update of file.
I prefer saving the temporary file to the final directory:
It avoids the potential permission problems that you've described.
The final directory might be on a different volume, in which case the move (of the temporary to the final file) is really a copy + delete -- which incurs a lot of overhead if you do it often or if the file is big.
You can always rename the existing file to a second temporary file, rename the new temporary file to the existing file's name, and rollback on error. That seems to me to be the safest combination.
EDITED: I see that your "parking lot" already described my suggestion, so I'm not sure I've added much here.
1 . Yes, it is preferred to save to a temporary file first
Because the final file will never be in a corrupt state should the creation of the file fails for any reason. If you write directly to the final file and your program crashed mid-way... it will definitely leave the final file in an invalid state.
2 . Yes
The "inherited" attributes and permissions will of course, be different. But temporary directories on most systems usually are pre-configured for all applications to use. The "final file" directory might, however, need to be configured. Say the "Program Files" folder and Vista UAC, for example.
3 . Copy ACL from the final file to the temp file prior to replacing?
By default Android places .tmp as the suffix when the suffix param is set to null in File.createTempFile(). I would suggest you just use that.
File file = File.createTempFile(imageFileName, null, storageDir);
You should call file.delete() yourself as soon as you're done with your .tmp file in your app. You shouldn't depend on file.deleteOnExit() since there's absolutely no guarantee it'll be used by the Android system/VM.
Why not make it user configurable? Some users don't like temp files polluting their current directory.

Resources