I've got an edge case where two files have the same name but different contents and are written to the same tarball. This causes there to be two entries in the tarball. I'm wondering if there's anything I can do to make the tar overwrite the file if it already exists in the tarball as opposed to creating another file with the same name.
No way as the first file have already been written when you ask to write the second one and the stream has advanced the position. Remember tar files are sequentially accessed.
You should do deduplication before starting to write.
Related
I have files with different extensions, some are text files, others are zipped files or images. How can I programmatically add a checksum to the files?
For example, my idea was to add a checksum somewhere in the metadata of the files. I tried doing it with PowerShell, but the properties of the files are read-only. I don't want to create a separate file that contains the checksum of the files. I want the checksum itself to be included somewhere in the file itself or in its metadata.
On Windows, with NTFS filesystem, you can use Alternate Data Streams.
They act exactly like files, but hidden and attached to the main file - until it's copied on a non-NTFS partition.
Otherwise, you can't just add a checksum to a file (even a short CRC32) without consequences, and how would you be SURE that the last N bytes are your checksum, and not file's data? You'll need to add a header (so even more bytes), etc. and it can mess up the file loading - simply think about a simple, plain text file, if you add N bytes of binary data at end!
Background--we are trying to read different file types (csv or parquet) into pyspark, and I have the task of writing a program that will determine file type.
It appears that parquet files are always directories, parquet file appears in HDFS as a directory.
We have some csv files that are also directories, where the file name is the directory name and the directory contains several part files. What processes do this?
Why are some files --'files' and some files 'directories'?
It will depend on what process produced those files. For example, when MapReduce produces output, it always produces a directory and then creates one output file per reducer within that directory. This is done so that each reducer can create its output independently.
Judging from Spark's CSV package, it expects to output to a single file. So perhaps the single-file CSVs are being generated by Spark and the directories by MapReduce.
To be as generic as possible, it may be a good idea to do the following: check if the file in question is a directory. If not, check the extension. If yes, look at the extension of the files inside of the directory. This should work for each of your situations.
Note that some input formats (e.g. MapReduce input formats) will only accept directories as inputs, and some (e.g. Spark's textFile) will only accept files/globs of files. You need to be aware of what is expected from the libraries you are interacting with.
All the data on your hard drive consists of files and folders. The
basic difference between the two is that files store data, while
folders store files and other folders.
Hadoop execution engines generally creates a directory and write multiple part files as output based on the number of reducers or executors used.
When you many an output file abc.csv it doesn't mean that its a single file with the data. Its just the output location which MapReduce (generally) interprets as the new directory to be created within which it creates the output files(part files).
In case of Spark when you are writing a file(maybe using .saveAsTextFile) it may creates only a single file.
In my Go application instead of writing to a file directly I would like to write to a temporary that is renamed into the final file when everything is done. This is to avoid leaving partially written content in the file if the application crashes.
Currently I use ioutil.TempFile, but the issue is that it creates the file with the 0600 permission, not 0666. Thus with typical umask values one gets the 0600 permission, not expected 0644 or 0660. This is not a problem is the destination file already exist as I can fix the permission on the temporary to much the existing ones, but if the file does not exist, then I need somehow to deduce the current umask.
I suppose I can just duplicate ioutil.TempFile implementation to pass 0666 into os.OpenFile, but that does not sound nice. So the question is there a better way?
I don't quite grok your problem.
Temporary files must be created with as tight permissions as possible because the whole idea of having them is to provide your application with secure means of temporary storing data which is too big to fit in memory (or to hand the generated file over to another process). (Note that on POSIX systems, where an opened file counts as a live reference to it, it's even customary to immediately remove the file while having it open so that there's no way to modify its data other than writing it from the process which created it.)
So in my opinion you're trying to use a wrong solution to your problem.
So what I do in a case like yours is:
Create a file with the same name as old one but with the ".temp" suffix appended.
Write data there.
Close, rename it over the old one.
If you feel like using a fixed suffix is lame, you can "steal" the implementation of picking a unique non-conflicting file name from ioutil.TempFile(). But IMO this would be overengeneering.
You can use ioutil.TempDir to get the folder where temporary files should be stored an than create the file on your own with the right permissions.
Suppose I have a folder with a few files, images, texts, whatever, it only matters that there are multiple files and the folder is rather large (> 100 mb). Now I want to update five files in this folder, but I want to do this atomically, normally I would just create a temporary folder and write everything into it and if it succeeds, just replace the existing folder. But because I/O is expensive, I don't really want to go this way (resaving hundreds of files just to update five seems like a huge overhead). But how am I supposed to write these five files atomically? Note, I want the writing of all files to be atomic, not each file separately.
You could adapt your original solution:
Create a temporary folder full of hard links to the original files.
Save the five new files into the temporary folder.
Delete the original folder and move the folder of hard links in its place.
Creating a few links should be speedy, and it avoids rewriting all the files.
When an application saves a file, a typical model is to save the file to a temporary location, then move the temporary file to the final location. In some cases that "move" becomes "replace". In pseudo code:
Save temp file;
if final file exists
delete final file;
move temp file to final filename;
There's a window in there where the delete might succeed, but the move may not, so you can handle that by something like :
Save temp file;
if final file exists
move final file to parking lot
move temp file to final filename;
if move succeeded
delete previous final file.
else
restore previous final file.
Now to my questions:
is it preferred to save the temporary file to a temporary directory, and then move it, as opposed to saving the temporary file to the final directory? (if so, why?)
Is there a difference in attributes and permissions on a file that is first saved to a temp dir, then moved to the final file in a different directory, as compared to a file that is saved to a temp file in the final directory, and then renamed within the directory?
If the answers to both are YES, then how can I do the preferred thing while getting the appropriate ACL on file which was first saved to a temporary directory and then moved to a final directory?
Create a temp file in the temp folder if it is just a temporary file. Otherwise, create it in its final destination.
Caveats:
1) This may not work if the final destination is a 'pickup' folder (unless the 'pickup' process checks for locked files (which it should))
2) The final destination has special permissions that have to be created in code and applied before being able to move to the final destination.
Microsoft Word saves a temp file to the original directory starting with a tilde (~). I would just follow that convention.
If these are temp files that turn into permanent files, create them in the same location to avoid any risk of having to "move" files across disks/partitions, which will result in more I/O (as a copy followed by a delete).
If these are temp files that are truly temporary, create (and leave them) in the temp dir.
A reason why you might want to never write a file to one directory and move it to another is because those directories might be on different filesystems. Although this is less often a problem on windows, it is still reasonably possible so long as the parent filesystem is ntfs. In unix, it is a standard practice for /tmp to be a different filesystem.
The reason this could be a problem is because that means the file has to be copied from one place to another. This significantly impacts performance for files of substantial size, and will certainly require many more seeks, even if the file is small. Additionally, there are many more ways for this to fail when moving a file across filesystem boundaries. Of coursea access permissions could be different, but also the target filesystem could be full, or any number of other additional complications that you are now deferring until much later.
It is preferable to create a temp file using the GetTempFile routines because this creates temp files in predefined locations (e.g. C:\temp) that utilities can delete if your app crashes or makes corrupt files in. If the same thing happens in your final directory, it is unrecoverable.
Yes, attributes could be different if the target file's attributes or ACL has been edited. This could happen even if you create the temp file in the same folder.
You fix this by using the File.Replace routine, which performs an atomic replacement of one file with another, replacing the new file's attributes and ACLs with the old file's.
A C# method that does this is an answer to Safe stream update of file.
I prefer saving the temporary file to the final directory:
It avoids the potential permission problems that you've described.
The final directory might be on a different volume, in which case the move (of the temporary to the final file) is really a copy + delete -- which incurs a lot of overhead if you do it often or if the file is big.
You can always rename the existing file to a second temporary file, rename the new temporary file to the existing file's name, and rollback on error. That seems to me to be the safest combination.
EDITED: I see that your "parking lot" already described my suggestion, so I'm not sure I've added much here.
1 . Yes, it is preferred to save to a temporary file first
Because the final file will never be in a corrupt state should the creation of the file fails for any reason. If you write directly to the final file and your program crashed mid-way... it will definitely leave the final file in an invalid state.
2 . Yes
The "inherited" attributes and permissions will of course, be different. But temporary directories on most systems usually are pre-configured for all applications to use. The "final file" directory might, however, need to be configured. Say the "Program Files" folder and Vista UAC, for example.
3 . Copy ACL from the final file to the temp file prior to replacing?
By default Android places .tmp as the suffix when the suffix param is set to null in File.createTempFile(). I would suggest you just use that.
File file = File.createTempFile(imageFileName, null, storageDir);
You should call file.delete() yourself as soon as you're done with your .tmp file in your app. You shouldn't depend on file.deleteOnExit() since there's absolutely no guarantee it'll be used by the Android system/VM.
Why not make it user configurable? Some users don't like temp files polluting their current directory.