strategies for backing up packages on macosx - macos

I am writing a program that synchronizes files across file systems much like rsync but I'm stuck when it comes to handling packages. These are folders that are identified by the system as containing a coherent set of files. Pages and Numbers can use packages rather than monolithic files, and applications are actually packages for example. My problem is that I want to keep the most recent version and also keep a backup copy. As far as I can see I have two options -
I can just treat the whole thing as a regular folder and handle the contents entry by entry.
I can look at all the modification dates of all the contents and keep the complete folder tree for the one that has the most recently modified contents.
I was going for (2) and then I found that the iPhoto library is actually stored as a package and that would mean I would copy the whole library (10s, or even 100s of gigabytes) even if only one photograph was altered.
My worry with (1) is that handling the content files individually might break things. I haven't really come up with a good solution that will guarantee that the package will work and won't involved unnecessarily huge backup files in some cases. If it is just iPhoto then I can probably put in a special case, or perhaps change strategy if the package is bigger than some user specified limit.
Packages are surprisingly mysterious, and what the system treats as a package does not seem to be just a matter of setting an extended attribute on a folder.

It depends on how you treat the "backup" version. Do you keep two versions of each file (the current and first previous), or two versions of the sync snapshot (i.e. if a file hasn't changed between the last two syncs, you only store one version)?
If it's two versions of the sync, packages shouldn't be a big problem -- just provide a way to restore the "backup" version, which if necessary splices together the changed files from the "backup" with the unchanged files from the current sync. There are some things to watch out for, though: make sure you correctly handle files that're deleted or added between the two snapshots.
If you're storing two versions of each file, things are much more complicated -- you need some way to record which versions of the files within the package "go together". I think in this case I'd be tempted to only store backup versions of files within the package from the last time something within the package changed. So, for example, say you sync a package called preso.key. On the second sync, preso.key/index.apxl.gz and preso.key/splash.png are modified, so the old version of those two files get stored in the backup. On the third sync, preso.key/index.apxl.gz is modified again, so you store a new backup version of it and remove the backup version of preso.key/splash.png.
BTW, another way to save space would be hard-linking. If you want to store two "full" versions of a big package without without wasting space, just store one copy of each unchanged file and hard-link it into both backups.

Related

Is there a way to limit my executable's ability to delete to only files it has created?

I'm on Windows writing a C++ executable that deletes and replaces some files in a directory it creates during an earlier run session. Maybe I'm a little panicky, but since my directory and file arguments for the deletions are generated by parsing an input file's path, I worry about the parse throwing out a much higher or different directory due to an oversight and systematically deleting unrelated files unintentionally.
Is there a way to limit my executable's reign to only include write/delete access to files it has created during earlier run sessions, while retaining read access to everything else? Or at least provide a little extra peace of mind that, even if I really mis-speak my strings to DeleteFileA() and RemoveDirectoryA() I'll avoid causing catastrophic damage?
It doesn't need to be a restriction to the entire executable, it's good enough if it limits the function calls to delete and remove in some way.

Checksum File Comparison Tool

So I am looking for a tool that can compare files in folders based on checksums (this is common, not hard to find); however, my use-case is that the files can exist in pretty deep folder paths that can change, I am expected to compare them every few months and ONLY create a package of the different files. I don't care what folders the files are in, the same file can move between folders regularly and files wouldn't change names much, only content (so checksums are a must).
My issue is that almost all of the tools I can find do care about the folder paths when they compare folders, I don't and I actually want it to ignore the folder paths. I rather not develop anything or at least only have to develop a small part of the process to save time.
To be clear the order I am looking for things to happen are:
Program scans directory from 1/1/2020 (A).
Program scans directory from 4/1/2020 (B)
Finds all files where checksum in B don't exist in A and make a new folder with differences (C).
Any ideas?

How should I mark a folder as processed in a script?

A script shall process files in a folder on a Windows machine and mark it as done once it is finished in order to not pick it up in the next round of processing.
My tendency is to let the script rename the folder to a different name, like adding "_done".
But on Windows, renaming a folder is not possible if some process has the folder or a file within it open. In this setup, there is a minor chance that some user may have the folder open.
Alternatively I could just write a stamp-file into that folder.
Are there better alternatives?
Is there a way to force the renaming anyway, in particular when it is on a shared drive or some NAS drive?
You have several options:
Put a token file of some sort in each processed folder and skip the folders that contain said file
Keep track of the last folder processed and only process ones newer (Either by time stamp or (since they're numbered sequentially), by sequence number)
Rename the folder
Since you've already stated that other users may already have the folder/files open, we can rule out #3.
In this situation, I'm in favor of option #1 even though you'll end up with extra files, if someone needs to try and figure out which folders have already been processed, they have a quick, easy method of discerning that with the naked eye, rather than trying to find a counter somewhere in a different file. It's also a bit less code to write, so less pieces to break.
Option #2 is good in this situation as well (I've used both depending on the circumstances), but I tend to favor it for things that a human wouldn't really need to care about or need to look for very often.

Flat or nested directory structure for an image cache?

My Mac app keeps a collection of objects (with Core Data), each of which has a cover image, and to which I assign a UUID upon creation. I had originally been storing the cover images as a field in my Core Data store, but recently started storing them on disk in the file system, instead.
Initially, I'm storing the covers in a flat directory, using the UUID to name the file, as below. This gives me O(1) fetching, as I know exactly where to look.
...
/.../Covers/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
...
I've looked at the way other applications handle this task, though, and noticed a multi-level scheme, as below (for instance). This could still be implemented in O(1) time.
...
/.../Covers/A/B/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/C/D/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
...
What might be the reason to do it this way? Does OS X limit the number of files in a directory? Is it in some way faster to retrieve them from disk? It would make the code used to calculate the file's name more complicated, so I want to find out if there is a good reason to do it that way.
On certain file systems (and I beleive HFS+ too), having too many files in the same directory will cause performance issues.
I used to work in an ISP where they would break up the home directories (they had 90k+ of them) Using a multi-directory scheme. You can partition your directories by using, say, the first two characters of the UUID, then the second two, eg:
/.../Covers/3B/72/3B723A52-C228-4C5F-A71C-3169EBA33677.jpg
/.../Covers/6B/EC/6BEC2FC4-B9DA-4E28-8A58-387BC6FF8E06.jpg
That way you don't need to calculate any extra characters or codes, just use the ones you have already to break it up. Since your UUIDs will be different every time, this should suffice.
The main reason is that in the latter way, as you've mentioned, disk retrieval is faster because your directory is smaller (so the FS will lookup in a smaller table for a file to exists).
As others mentioned, on some file systems it takes longer for the OS to open the file, because one directory with many files is longer to read than a couple of short directories.
However, you should perform measurements on your particular file system and for your particular usage scenario. I did this for NTFS on Windows XP and was surprised to discover that flat directory was performing better in all kinds of tests, than hierarchical structure.

How to programmatically find the difference between two directories

First off; I am not necessarily looking for Delphi code, spit it out any way you want.
I've been searching around (especially here) and found a bit about people looking for ways to compare to directories (inclusive subdirs) though they were using byte-by-byte methods. Second off, I am not looking for a difftool, I am "just" looking for a way to find files which do not match and, just as important, files which are in one directory but not the other and vice versa.
To be more specific: I have one directory (the backup folder) which I constantly update using FindFirstChangeNotification. Though the first time I need to copy all files and I also need to check the backup directory against the original when the applications starts (in case something happened when the application wasn't running or FindFirstChangeNotification didn't catch a file change). To solve this I am thinking of creating a CRC list for the backed up files and then run through the original directory computing the CRC for every file and finally compare the two CRCs. Then somehow look for files which are in one directory and not the other (again; vice versa).
Here's the question: Is this the fastest way? If so, how would one (roughly) get the job done?
You don't necessarily need CRCs for each file, you can just compare the "last modified" date for every file for most normal purposes. It's WAY faster. If you need additional safety, you can also compare the lengths. You get both of these metrics for free with the find functions.
And in your change notification, you should probably add the files to a queue and use a timer object to copy the new queued files every ~30sec or something, so you don't bog down the system with frequent updates/checks.
For additional speed, use the Win32 functions wherever possible, avoid any Delphi find/copy/getfileinfo functions. I'm not familiar with the Delphi framework but for example the C# stuff is WAY WAY WAY slower than the Win32 functions.
Regardless of you "not looking for a difftool", are you opposed to using Cygwin with it's "diff" command for the shell? If you are open to this its quite easy, particularly using diff with the -r "recursive" option.
The following generates the differences between 2 Rails installs on my machine, and greps out not only information about differences between files but also, specifically by grepping for 'Only', finds files in one directory, but not the other:
$ diff -r pgnindex pgnonrails | egrep '^Only|diff'
Only in pgnindex/app/controllers: openings_controller.rb
Only in pgnindex/app/helpers: openings_helper.rb
Only in pgnindex/app/views: openings
diff -r pgnindex/config/environment.rb pgnonrails/config/environment.rb
diff -r pgnindex/config/initializers/session_store.rb pgnonrails/config/initializers/session_store.rb
diff -r pgnindex/log/development.log pgnonrails/log/development.log
Only in pgnindex/test/functional: openings_controller_test.rb
Only in pgnindex/test/unit: helpers
The fastest way to compare one directory on the local machine to a directory on another machine thousands of miles away is exactly as you propose:
generate a CRC/checksum for every file
send the name, path, and CRC/checksum for each file over the internet to the other machine
compare
Perhaps the easiest way to do that is to use rsync with the "--dryrun" or "--list-only" option.
(Or use one of the many applications that use the rsync algorithm,
or compile the rsync algorithm into your application).
cd some_backup_directory
rsync --dryrun myname#remote_host:latest_version_directory .
For speed, the default rsync assumes, as Blindy suggested, that two files with the same name and the same path and the same length and the same modification time are the same.
For extra safety, you can give rsync the "--checksum" option to ignore the length and modification time and force it to compare (the checksum of) the actual contents of the file.

Resources