I have a set of configuration files for a range of devices.
They are all different, but several parts of the files are common to many (but not all) of them.
I'd like to determine all these common sections, extract each into a single file which is then "included" in the original files, so that there's only one place to edit/update.
How should I approach this? I can use diff or similar to find common sections among each pair of files, but I may miss things where, say, one pair has a 10 line common section, but another file shares only the first 9 lines of that - I really want to find common subsequences across as many files as possible.
There are several other questions that are somewhat related to this, but don't seem to apply to this context.
Related
I'm trying to find the 'best' way to order two lists of files so that a diff patch between them is small in general.
The way to do this without any other 'heuristics' that may fail easily (natural name order, parsing index files like cues to figured out natural sequential orders) seems to be to analyze the bytes on files on both collections, and figure out a sequence that minimizes the 'distance' between them.
This actually reminds me Levenshtein distance applied to segments of the bytes in the files (possibly with a constraint segments of the same file are in order to minimize permutations). Is there a library around that can figure out this for me? Notice that it's likely for the header or footer of files that are 'technically the same' to be different (ex: different dump format).
My main use case is to figure out the distance between two kinds of cd dumps. It's pretty normal for a cd dump to be segmented in different ways. I could just figure out their 'natural' order from the index files (cue, ccd etc) but why waste a opportunity to get something that applies generally (that works with extra files in the source or destination, or files segmented in different ways or to compare things that aren't cd dumps)?
I'd prefer a library in python if you know of any?
BTW I already have something implemented zxd3 but it's pretty much using the 'natural order' heuristic, i'd like to improve it (and make it work on more than two zips).
I tried to search for Duplicate files in my mac machine via command line.
This process took almost half an hour for 10 gb Data files whereas Gemini and cleanmymac apps takes lesser time to find the files.
So my point here is how this fastness is achieved in these apps,what is the concept behind it?, in which language code is written.
I tried googling for information but didnot get anything related to duplicate finder.
if you have any ideas please input them here.
First of all Gemini locates files with equal size, than it uses it’s own hash-like type-dependent algorithm to compare files content. That algorithm is not 100% accurate but much more quick than classical hashes.
I contacted support, asking them what algorithm they use. Their response was that they compare parts of each file to each other, rather than the whole file or doing a hash. As a result, they can only check maybe 5% (or less) of each file that's reasonably similar in size to each other, and get a reasonably accurate result. Using this method, they don't have to pay the cost of comparing the whole file OR the cost of hashing files. They could be even more accurate, if they used this method for the initial comparison, and then did full comparisons among the potential matches.
Using this method, files that are minor variants of each other may be detected as identical. For example, I've had two songs (original mix and VIP mix) that counted as the same. I also had two images, one with a watermark and one without, listed as identical. In both these cases, the algorithm just happened to pick parts of the file that were identical across the two files.
My application need to keep a large amount of fairly small files (10-100k) that are usually accessed with some 'locality' in the filename's string expression.
Eg. if file_5_5 is accessed, files like file_4_5 or file_5_6 may be accessed too in a short while.
I've seen that web browser file caches are often sorted in a tree like fashion resembling the lexical order of the filename, which is a kind of hash. Eg. sadisadji would reside at s/a/d/i/ssadisadji for example. I guess that is optimized for fast random access to any of these files.
Would such a tree structure be useful for my case too? Or does a flat folder keeping all files in one location does equal well?
A tree structure would be better because many filesystems have trouble with listing a single directory with 100,000 files or more in them.
One approach taken by the .mbtiles file format, which stores a large number of image files for use with maps, is to store all of the files in an SQLite database, circumventing the problems caused by having thousands of files in a directory. Their reasoning and implementation is described here:
https://www.mapbox.com/developers/mbtiles/
Basically I need to synchronize two folder/file structures when both folders and files get moved around and changed quite often. They both have history of changes recorded and deltas can be queried by request. I already have some as I think reliable self-made sync algorithm tuned on the go when a problem arises. I was wondering if there is a mathematical background to this problem and probably some well-build theories and patterns I could reuse and improve my system.
Not sure I understand your question, but perhaps Longest common subsequence problem which is the base of diff programs: find out what is the difference between two states (ie. folders/files in your case) and encode the sequence of operations that translate state A into state B (what files need to be added, modified and removed for the two location to have the same structure). This kind of solution works if one of the locations is the 'golden' copy (or 'master') and the other one is 'slave': the slave has to reach the state of the master. When the situation is master-master (both sites accept writes) then is significantly more difficult to resolve it, and you need some sort of automated conflict resolution.
I'm in the process of implementing caching for my project. After looking at cache directory structures, I've seen many examples like:
cache
cache/a
cache/a/a/
cache/a/...
cache/a/z
cache/...
cache/z
...
You get the idea. Another example for storing files, let's say our file is named IMG_PARTY.JPG, a common way is to put it in a directory named:
files/i/m/IMG_PARTY.JPG
Some thoughts come to mind, but I'd like to know the real reasons for this.
Filesystems doing linear lookups find files faster when there's fewer of them in a directory. Such structure spreads files thin.
To not mess up *nix utilities like rm, which take a finite number of arguments and deleting large number of files at once tends to be hacky (having to pass it though find etc.)
What's the real reason? What is a "good" cache directory structure and why?
Every time I've done it, it has been to avoid slow linear searches in filesystems. Luckily, at least on Linux, this is becoming a thing of the past.
However, even today, with b-tree based directories, a very large directory will be hard to deal with, since it will take forever and a day just to get a listing of all the files, never mind finding the right file.
Just use dates. Since you will remove by date. :)
If you do ls -l, all the files need to be stat()ed to get details, which adds considerably to the listing time - this happens whether the FS uses hashed or linear structures.
So even if the FS has a capability of coping with incredibly large directory sizes, there are good reasons not to have large flat structures (They're also a pig to back up)
I've benchmarked GFS2 (clustered) with 32,000 files in a directory or arranged in a tree structure - recursive listings were around 300 times faster than getting a listing when they were all in a flat structure (could take up to 10 minutes to get a directory listing)
EXT4 showed similar ratios but as the end point was only a couple of seconds most people wouldn't notice.