Efficient way to walk directory tree containing link cycles - algorithm

Is there a more efficient way to walk a directory tree that contains link cycles than tracking which files have already been visited?
For example consider walking a directory containing these files:
symlink "parent" -> ".."
symlink "uh_oh" -> "/"
regular file "reg"
symlink "reg2" -> "reg"

You should also track which directories have been visited, as per your first example, but otherwise there is no better solution than maintaining visited flags for every file.
Maintaining the flags would be easier if there were a portable way of getting a short unique identifier for a mounted filesystem. Even then, you need to think through the consequences of mount and umount operations occurring during the scan, particularly since such a scan might take quite a long time if the filesystem tree includes remote filesystems.
In theory, you can get a "filesystem id" from the stafvfs interface, but in practice that is not totally portable. Quoting man statfs from a Linux distro:
Nobody knows what f_fsid is supposed to contain…
…The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several OSes restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
This latter restriction -- that f_fsid is presented as 0 to non-privileged users -- does not violate the Posix standard cited above, because that standard includes a very general disclaimer: "It is unspecified whether all members of the statvfs structure have meaningful values on all file systems."

The tree walk algorithm guarantees that you'll visit every file under a directory, so instead of tracking individual files you can maintain a list of search "roots":
Add the initial directory to the list of roots
Walk the directory tree for each search root
For every symlink you find, check if it's already contained by a search root. If it's not, add it as a new search root.
This way you'll visit every file and directory, will never get stuck in a loop, but may visit files and directories more than once. That can happen only when you find a symlink to an ancestor of an existing root. To avoid doing that, you can check if a directory is a search root before entering it.

Related

Does tree object type in git internals point to the blob only or to trees as well?

I've explored a lot about the 3 commit model of the git internals and in the diagram,it's shown that tree object type is pointing to the other trees as well, but when I do
git cat-file -p 'sha-hash',
It tells that tree is pointing to the blobs only.
Please refer to the screenshot attached.
Please help me with a screenshot where tree is pointing to some other tree or tell me any use case which I may be missing.
If a file's name is a/b/c/f.ext it will be stored in a commit as:
at the level of the commit: a tree containing a subtree named a
at the level of the subtree named a: a tree containing a subtree named b
at the level of the subtree named b: a tree containing a subtree named c
at the level of the subtree named c: a tree containing a blob named f.ext
Hence, from the top, we simply string together the names of each higher level tree to arrive at the file's actual name, a/b/c/f.ext.
None of this really matters while using the file, since the important version of the file is the one in the index, and that one is named a/b/c/f.ext (with slashes in the name). It only matters when reading trees into the index (git read-tree), and when writing the index to a series of trees (git write-tree).
I think #torek has explained it pretty well. Also, I would like to add on that this all storage is done in the form of SHA1 hashes, the blob(binary large object) stores file contents while the trees store, in maybe vague terms, the file name and subtrees inside of it.
This is a pretty good guide to the internals of Git, you might as well take a look at it.
Cheers!

List all duplicate files in a filesystem given the root directory.

How would you go about designing an algorithm to list all the duplicate files in a filesystem? My first thought it to use hashing but I'm wondering if there's a better way to do it. Any possible design tradeoffs to keep in mind?
Hashing all your files will take a very long time because you have to read all the file contents.
I would recommend a 3-step algorithm:
scan your directories and note down the paths & sizes of the files
Hash only the files which have the same size as other files, only if there are more than 2 files with the same size: if a file has the same size as only one other file, you don't need the hashing, just compare their contents one-to-one (saves hashing time, you won't need the hash value afterwards)
Even if the hash is the same, you still have to compare the files byte-per-byte because hash can be identical for different files (although this is very unlikely if the file size is the same and it's your filesystem).
You could also do without hashing at all, opening all files at the same time if possible, and compare contents. That would save a multiple read on big files. There are a lot of tweaks that you could implement to save time depending on the type of your data (ex: if 2 compressed/tar files have the same size > x Ggigabytes size (and the same name), don't read the contents, given your process, the files are very likely to be duplicates)
That way, you avoid hashing files which size is unique in the system. Saves a lot of time.
Note: I don't take names into account here, because I suppose names can be different.
EDIT: I've done a bit of research (too late) and found out that fdupes seems to do exactly that if you are using Un*x-like systems:
https://linux.die.net/man/1/fdupes
seen in that question: List duplicate files in a directory in Unix

Do algorithms for performing hierarchical permission checking exist?

I have a data structure which represents a hierarchy.
folder
folder
folder
file
file
etc.
Permissions are stored in a flat table:
| pKey | type | bitperms |
When performing global operations like search, we need to check permissions recursively within the tree.
Checking permissions inline with the individual leaves of the tree structure is easy. However accounting for permission on the nodes requires one of two known approaches:
after fetching the filtered leaves, post process each one to check it's parents perms
cost is delayed until after
there might be lots of initial leaves found, but after processing the parents, nothing remains resulting in useless work being done
pre calculating all the roots (nodes which grant the permission) ahead of time and using that as a query filter while getting leaves
potentially a huge query if many roots exist resulting in excessive time spent processing each leaf
Do any algorithms exist for doing this in a more efficient way? Perhaps reorganizing the permission data or adding more information to the hierarchy?
Perhaps adding some heuristics to deal with extremes?
Dunno about a complete paper about that, but here are my thoughts.
You obviously need to check at some point the whole path from the leaf to the root.
I assume no permission rule introduction from the side (i.e. you're working on a tree, not a general graph).
I assume lots of leafs on few "folder" nodes.
I also assume that you have a method for including permissions (ORing on a bitmask) or excluding them (NOTANDing on a bitmask).
Permissions are mostly granted to roles/groups, not individual users (in the latter case, you'd need to create s.th. like an "ad-hoc role/group" for that user).
Permissions will not go up the tree, only down to the leafs.
Then I'd pre-calculate all permissions on folders from the root up and save them along with the folder nodes whenever some permissions on folders change (or a role is added, etc). When a specific file/leaf is called, you only have to check the files/leafs permissions and its folders permissions.
You could also mark some folders as "do not inherit permissions from parent", which may shorten your calculations when the root's permissions change...
This would make it cheap for the following operations:
checking a leaf's permissions (join leaf and its parent permissions).
changing the permissions of a folder which does not contain more folders.
Costly are these operations, but since they do not need to work on any leaf/file, they only need to touch a minor part of the whole tree:
changing/extending the permission model (e.g. by adding a role/group, which might broaden your bitmask, depending on your implementation).
changing the roots permissions.

Shortest sequence of operations transforming a file tree to another

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder
A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
Compute a database:
Store file names and their CRCs
Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
Compute A ∩ B and remove this intersection from both databases.
Use an inner join to find matching CRCs in A and B, folders first, order by size desc
while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums;
sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums;
sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.

The fastest way to implement mktree in FTP

What is generally the fastest algorithm to recursively make a directory (similar to UNIX mkdir -p) using the FTP protocol?
I have considered one approach:
MKDIR node
if error and nodes left go to 1 with next node
end
But this might have bad performance if part of the directory most likely exists. For example, with some amortization the "/a/b/c/d" part of "/a/b/c/d/e/f/g" path exists %99 of the time.
Considering that sending a command and receiving the response is taking most of the time, the fastest way to create a directory path is using as few commands as possible.
As there is no way other than to try to create or cd into a directory to check for its existence, just using mkdir a; mkdir a/b; ..., mkdir a/b/c/d/e/f would be the generally fastest way (do not cd into the subdirectories to create the next as this would prolong the process).
If you create multiple directories this way, you could of course keep track of which top-level directories you already created. Also, depending on the length of your paths and the likelihood that the upper directories already exist, you could try to start with e.g. mkdir a/b/c (for a/b/c/d/e/f) and then backtrack if it did not succeed. However if it's more likely that directories do not exist this will actually be slower in the long run.
If the existing directory hierarchy is equally likely to end at any given depth, then binary searching for the start position will be the fastest way. But as dseifert points out, if most of the time the directories already exist down to say level k, then it will be faster to start binary searching at level k rather than level n/2.
BTW, you'd have to be creating a lot of very deep directories for this sort of optimisation to be worth your time. Are you sure you're not optimising prematurely?

Resources