I have a data structure which represents a hierarchy.
folder
folder
folder
file
file
etc.
Permissions are stored in a flat table:
| pKey | type | bitperms |
When performing global operations like search, we need to check permissions recursively within the tree.
Checking permissions inline with the individual leaves of the tree structure is easy. However accounting for permission on the nodes requires one of two known approaches:
after fetching the filtered leaves, post process each one to check it's parents perms
cost is delayed until after
there might be lots of initial leaves found, but after processing the parents, nothing remains resulting in useless work being done
pre calculating all the roots (nodes which grant the permission) ahead of time and using that as a query filter while getting leaves
potentially a huge query if many roots exist resulting in excessive time spent processing each leaf
Do any algorithms exist for doing this in a more efficient way? Perhaps reorganizing the permission data or adding more information to the hierarchy?
Perhaps adding some heuristics to deal with extremes?
Dunno about a complete paper about that, but here are my thoughts.
You obviously need to check at some point the whole path from the leaf to the root.
I assume no permission rule introduction from the side (i.e. you're working on a tree, not a general graph).
I assume lots of leafs on few "folder" nodes.
I also assume that you have a method for including permissions (ORing on a bitmask) or excluding them (NOTANDing on a bitmask).
Permissions are mostly granted to roles/groups, not individual users (in the latter case, you'd need to create s.th. like an "ad-hoc role/group" for that user).
Permissions will not go up the tree, only down to the leafs.
Then I'd pre-calculate all permissions on folders from the root up and save them along with the folder nodes whenever some permissions on folders change (or a role is added, etc). When a specific file/leaf is called, you only have to check the files/leafs permissions and its folders permissions.
You could also mark some folders as "do not inherit permissions from parent", which may shorten your calculations when the root's permissions change...
This would make it cheap for the following operations:
checking a leaf's permissions (join leaf and its parent permissions).
changing the permissions of a folder which does not contain more folders.
Costly are these operations, but since they do not need to work on any leaf/file, they only need to touch a minor part of the whole tree:
changing/extending the permission model (e.g. by adding a role/group, which might broaden your bitmask, depending on your implementation).
changing the roots permissions.
Related
Is there a more efficient way to walk a directory tree that contains link cycles than tracking which files have already been visited?
For example consider walking a directory containing these files:
symlink "parent" -> ".."
symlink "uh_oh" -> "/"
regular file "reg"
symlink "reg2" -> "reg"
You should also track which directories have been visited, as per your first example, but otherwise there is no better solution than maintaining visited flags for every file.
Maintaining the flags would be easier if there were a portable way of getting a short unique identifier for a mounted filesystem. Even then, you need to think through the consequences of mount and umount operations occurring during the scan, particularly since such a scan might take quite a long time if the filesystem tree includes remote filesystems.
In theory, you can get a "filesystem id" from the stafvfs interface, but in practice that is not totally portable. Quoting man statfs from a Linux distro:
Nobody knows what f_fsid is supposed to contain…
…The general idea is that f_fsid contains some random stuff such that the pair (f_fsid,ino) uniquely determines a file. Some operating systems use (a variation on) the device number, or the device number combined with the filesystem type. Several OSes restrict giving out the f_fsid field to the superuser only (and zero it for unprivileged users), because this field is used in the filehandle of the filesystem when NFS-exported, and giving it out is a security concern.
This latter restriction -- that f_fsid is presented as 0 to non-privileged users -- does not violate the Posix standard cited above, because that standard includes a very general disclaimer: "It is unspecified whether all members of the statvfs structure have meaningful values on all file systems."
The tree walk algorithm guarantees that you'll visit every file under a directory, so instead of tracking individual files you can maintain a list of search "roots":
Add the initial directory to the list of roots
Walk the directory tree for each search root
For every symlink you find, check if it's already contained by a search root. If it's not, add it as a new search root.
This way you'll visit every file and directory, will never get stuck in a loop, but may visit files and directories more than once. That can happen only when you find a symlink to an ancestor of an existing root. To avoid doing that, you can check if a directory is a search root before entering it.
Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder
A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
Compute a database:
Store file names and their CRCs
Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
Compute A ∩ B and remove this intersection from both databases.
Use an inner join to find matching CRCs in A and B, folders first, order by size desc
while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums;
sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums;
sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.
I understand how typically trees are used to modify persistent data structures (create a new node and replace all it's ancestors).
But what if I have a tree of 10,000's of nodes and I need to modify 1000's of them? I don't want to go through and create 1000's of new roots, I only need the one new root that results from modifying everything at once.
For example:
Let's take a persistent binary tree for example. In the single update node case, it does a search until it finds the node, creates a new one with the modifications and the old children, and creates new ancestors up to the root.
In the bulk update case could we do:
Instead of just updating a single node, you're going to update 1000 nodes on it in one pass.
At the root node, the current list is the full list. You then split that list between those that match the left node and those that match the right. If none match one of the children, don't descend to it. You then descend to the left node (assuming there were matches), split its search list between its children, and continue. When you have a single node and a match, you update it and go back up, replacing and updating ancestors and other branches as appropriate.
This would result in only one new root even though it modified any number of nodes.
These kind of "mass modification" operations are sometimes called bulk updates. Of course, the details will vary depending on exactly what kind of data structure you are working with and what kind of modifications you are trying to perform.
Typical kinds of operations might include "delete all values satisfying some condition" or "increment the values associated with all the keys in this list". Frequently, these operations can be performed in a single walk over the entire structure, taking O(n) time.
You seem to be concerned about the memory allocation involved in creating "1000's of new roots". Typical allocation for performing the operations one at a time would be O(k log n), where k is the number of nodes being modified. Typical allocation for performing the single walk over the entire structure would be O(n). Which is better depends on k and n.
In some cases, you can decrease the amount of allocation--at the cost of more complicated code--by paying special attention to when changes occur. For example, if you have a recursive algorithm that returns a tree, you might modify the algorithm to return a tree together with a boolean indicating whether anything has changed. The algorithm could then check those booleans before allocating a new node to see whether the old node can safely be reused. However, people don't usually bother with this extra check unless and until they have evidence that the extra memory allocation is actually a problem.
A particular implementation of what you're looking for can be found in Clojure's (and ClojureScript's) transients.
In short, given a fully-immutable, persistent data structure, a transient version of it will make changes using destructive (allocation-efficient) mutation, which you can flip back into a proper persistent data structure again when you're done with your performance-sensitive operations. It is only at the transition back to a persistent data structure that new roots are created (for example), thus amortizing the attendant cost over the number of logical operations you performed on the structure while it was in its transient form.
We are currently working on a project where the main domain objects are content nodes and we are using an ACL-like system where each node in the hierarchy can contain rules that override or complement those on their parents. Everything is as well based on roles and actions, for example.
Node 1 - {Deny All, Allow Role1 View}
\- Node 2 - {Allow Role2 View}
\- Node 3 - {Deny Role1 View}
In that case, rules will be read in order from top to bottom so the Node 3 can be viewed only by Role2. It's not really complicated in concept.
Retrieving the rules for a single node can result in some queries, getting all the parents and then recreating the list of rules and evaluating them, but this process can be cumbersome because the hierarchy can become quite deep and there may be a lot of rules on each node.
I have been thinking on preparing a table with precalculated rules for each node which could be recreated whenever a permission is changed and propagated to all the leaf nodes of the updated one.
Do you think of any other strategy to speed up the retrieval and calculation of the rules? Ideally it should be done in a single query, but trees are not the best structures for that.
I would think that an Observer Pattern might be adapted.
The idea would be that each Node maintains a precomputed list and is simply notified by its parent of any update so that it can recompute this list.
This can be done in 2 different ways:
notify that a change occurred, but don't recompute anything
recompute at each update
I would advise going with 1 if possible, since it does not involve recomputing the whole world when root is updated, and only recomputing when needed (lazy eval in fact) but you might prefer the second option if you update rarely but need blazing fast retrieval (there are more concurrency issues though).
Let's illustrate Solution 1:
Root ___ Node1 ___ Node1A
\ \__ Node1B
\_ Node2 ___ Node2A
\__ Node2B
Now, to begin with, none of them has precomputed anything (there are all in a dirty state), if I ask for Node2A rules:
Node2A realizes it is dirty: it queries Node2 rules
Node2 realizes it is dirty: it queries Root
Root does not have any parent, so it cannot be dirty, it sends its rules to Node2
Node2 caches the answer from Root, merges its rules with those received from Root and cleans the dirty bit, it sends the result of the merge (cached now) to Node2A
Node2A caches, merges, cleans the dirty bit and returns the result
If I subsequently asks for Node2B rules:
Node2B is dirty, it queries Node2
Node2 is clean, it replies
Node2B caches, merges, cleans the dirty bit and returns the result
Note that Node2 did not recomputed anything.
In the update case:
I update Node1: I use the Root cached rules to recompute the new rules and send a beat to Node1A and Node1B to notify them their cache is outdated
Node1A and Node1B set their dirty bit, they would also have propagated this notification had they had children
Note that because I cached Root rules I don't have to query the Root object, if it's a simple enough operation, you might prefer not to cache them at all: if you're not playing distributed here, and querying Root only involves a memory round-trip you might prefer not to duplicate it in order to save up some memory and book-keeping.
Hopes it gets you going.
Your version of pre-computation appears to store all the permissions relevant to each role at each node. You can save a little time and space by traversing the tree, numbering the nodes as you reach them, and producing, for each role, an array of the node numbers and permission changes just for the nodes at which the permissions relevant to that role change. This produces output only linear in the size of the input tree (including its annotations). Then when you come to check a permission for a role at a node, use the number of that node to search the array to find the point in the array that represents the most recent change of permission when you visited that node during the tour.
This may be associated in some way with http://en.wikipedia.org/wiki/Range_Minimum_Query and http://en.wikipedia.org/wiki/Lowest_common_ancestor, but I don't really know if those references will help or not.
What is generally the fastest algorithm to recursively make a directory (similar to UNIX mkdir -p) using the FTP protocol?
I have considered one approach:
MKDIR node
if error and nodes left go to 1 with next node
end
But this might have bad performance if part of the directory most likely exists. For example, with some amortization the "/a/b/c/d" part of "/a/b/c/d/e/f/g" path exists %99 of the time.
Considering that sending a command and receiving the response is taking most of the time, the fastest way to create a directory path is using as few commands as possible.
As there is no way other than to try to create or cd into a directory to check for its existence, just using mkdir a; mkdir a/b; ..., mkdir a/b/c/d/e/f would be the generally fastest way (do not cd into the subdirectories to create the next as this would prolong the process).
If you create multiple directories this way, you could of course keep track of which top-level directories you already created. Also, depending on the length of your paths and the likelihood that the upper directories already exist, you could try to start with e.g. mkdir a/b/c (for a/b/c/d/e/f) and then backtrack if it did not succeed. However if it's more likely that directories do not exist this will actually be slower in the long run.
If the existing directory hierarchy is equally likely to end at any given depth, then binary searching for the start position will be the fastest way. But as dseifert points out, if most of the time the directories already exist down to say level k, then it will be faster to start binary searching at level k rather than level n/2.
BTW, you'd have to be creating a lot of very deep directories for this sort of optimisation to be worth your time. Are you sure you're not optimising prematurely?