Checksum File Comparison Tool - windows

So I am looking for a tool that can compare files in folders based on checksums (this is common, not hard to find); however, my use-case is that the files can exist in pretty deep folder paths that can change, I am expected to compare them every few months and ONLY create a package of the different files. I don't care what folders the files are in, the same file can move between folders regularly and files wouldn't change names much, only content (so checksums are a must).
My issue is that almost all of the tools I can find do care about the folder paths when they compare folders, I don't and I actually want it to ignore the folder paths. I rather not develop anything or at least only have to develop a small part of the process to save time.
To be clear the order I am looking for things to happen are:
Program scans directory from 1/1/2020 (A).
Program scans directory from 4/1/2020 (B)
Finds all files where checksum in B don't exist in A and make a new folder with differences (C).
Any ideas?

Related

How to arrange files in different folders with copies of these files with the same name?

I have files located in different folders, but with the help of one program I took files from all these folders, but leaving copies of the files.
having worked with the files, now I want to put the files back, but there is a problem, it will take a very long time to arrange the files in folders.
so I'm looking for a way to arrange files quickly
since we have copies of files that have the same names, I would like to simply replace them by similarity of names.

How should I mark a folder as processed in a script?

A script shall process files in a folder on a Windows machine and mark it as done once it is finished in order to not pick it up in the next round of processing.
My tendency is to let the script rename the folder to a different name, like adding "_done".
But on Windows, renaming a folder is not possible if some process has the folder or a file within it open. In this setup, there is a minor chance that some user may have the folder open.
Alternatively I could just write a stamp-file into that folder.
Are there better alternatives?
Is there a way to force the renaming anyway, in particular when it is on a shared drive or some NAS drive?
You have several options:
Put a token file of some sort in each processed folder and skip the folders that contain said file
Keep track of the last folder processed and only process ones newer (Either by time stamp or (since they're numbered sequentially), by sequence number)
Rename the folder
Since you've already stated that other users may already have the folder/files open, we can rule out #3.
In this situation, I'm in favor of option #1 even though you'll end up with extra files, if someone needs to try and figure out which folders have already been processed, they have a quick, easy method of discerning that with the naked eye, rather than trying to find a counter somewhere in a different file. It's also a bit less code to write, so less pieces to break.
Option #2 is good in this situation as well (I've used both depending on the circumstances), but I tend to favor it for things that a human wouldn't really need to care about or need to look for very often.

automate directory creation in windows 7

I have been tasked with restructuring the directory of files relating to employees. As it is now, each employee has their own folder and all the files are grouped into 3 subfolders, divided by year. I'd like to sort the files in each of the folders into 4 other subfolders that are organized by subject matter. Is there any way to automate the creation of folders and transferring of files into these folders?
If this is not a sufficient information about my issue, please say so and I will attempt to provide a more accurate explanation.
You could use PowerShell or any number of scripting languages/tools (Perl, Python). The trick may be knowing which target folder each of the files should go into. If you can determine that from the name of the file or the file type it will be trivial, but if there is some other criterion it may be harder.

strategies for backing up packages on macosx

I am writing a program that synchronizes files across file systems much like rsync but I'm stuck when it comes to handling packages. These are folders that are identified by the system as containing a coherent set of files. Pages and Numbers can use packages rather than monolithic files, and applications are actually packages for example. My problem is that I want to keep the most recent version and also keep a backup copy. As far as I can see I have two options -
I can just treat the whole thing as a regular folder and handle the contents entry by entry.
I can look at all the modification dates of all the contents and keep the complete folder tree for the one that has the most recently modified contents.
I was going for (2) and then I found that the iPhoto library is actually stored as a package and that would mean I would copy the whole library (10s, or even 100s of gigabytes) even if only one photograph was altered.
My worry with (1) is that handling the content files individually might break things. I haven't really come up with a good solution that will guarantee that the package will work and won't involved unnecessarily huge backup files in some cases. If it is just iPhoto then I can probably put in a special case, or perhaps change strategy if the package is bigger than some user specified limit.
Packages are surprisingly mysterious, and what the system treats as a package does not seem to be just a matter of setting an extended attribute on a folder.
It depends on how you treat the "backup" version. Do you keep two versions of each file (the current and first previous), or two versions of the sync snapshot (i.e. if a file hasn't changed between the last two syncs, you only store one version)?
If it's two versions of the sync, packages shouldn't be a big problem -- just provide a way to restore the "backup" version, which if necessary splices together the changed files from the "backup" with the unchanged files from the current sync. There are some things to watch out for, though: make sure you correctly handle files that're deleted or added between the two snapshots.
If you're storing two versions of each file, things are much more complicated -- you need some way to record which versions of the files within the package "go together". I think in this case I'd be tempted to only store backup versions of files within the package from the last time something within the package changed. So, for example, say you sync a package called preso.key. On the second sync, preso.key/index.apxl.gz and preso.key/splash.png are modified, so the old version of those two files get stored in the backup. On the third sync, preso.key/index.apxl.gz is modified again, so you store a new backup version of it and remove the backup version of preso.key/splash.png.
BTW, another way to save space would be hard-linking. If you want to store two "full" versions of a big package without without wasting space, just store one copy of each unchanged file and hard-link it into both backups.

How many files is most advised to have in a Windows folder (NTFS)?

we have a project that constitutes a large archive of image files...
We try to split them into sub-folders within the main archive folder.
Each sub-folder contains up to 2500 files in it.
For example:
C:\Archive
C:\Archive\Animals\
C:\Archive\Animals\001 - 2500 files...
C:\Archive\Animals\002 - 2300 files..
C:\Archive\Politics\
C:\Archive\Politics\001 - 2000 files...
C:\Archive\Politics\002 - 2100 files...
Etc... What would be the best way of storing files in such way under Windows ? and why exactly, please ... ?
Later on, the files have their EXIF metadata extracted and indexed for keywords, to be added into a Lucene index... (this is done by a Windows service that lives on the server)
We have an application where we try to make sure we don't store more than around 1000 files in a directory. Under Windows at least, we noticed extreme degradation in performance over this number. The folder can theoretically store up to 4,294,967,295 in Windows 7. Note that because the OS does a scan of the folder, doing lookups and lists very quickly degrades as you add many more files. Once we got to 100,000 files in a folder it was almost completely unusable.
I'd recommend breaking down the animals even further, perhaps by first letter of name. Same with the other files. This will let you separate things out more so you won't have to worry about the directory performance. Best advice I can give is to perform some stress tests on your system to see where the performance starts to tail off once you have enough files in a directory. Just be aware you'll need several thousand files to test this out.

Resources