Syncing two separate structures to the same master data - data-structures

I've got multiple structures to maintain in my application. All link to the same records, and one of them could be considered the "master" in that it reflects actual relationships held in files on disk. The other structures are used to "call out" elements of the main design for purchase and work orders. I'm struggling to come up with a pattern that deals appropriately with changes to the master data.
As an example, the following trees might refer to the same data:
A
|_
B
|_
C
|_
D
|_
E
|_
B
|_
C
|_
D
A
|_
B
E
C
|_
D
A
|_
B
C
D
E
These secondary structures follow internal rules, but their overall structure is usually user-determined. In all cases (including the master), any element can be used in multiple locations and in multiple trees. When I add a child to any element in the tree, I want to either automatically build the secondary structure for each instance of the "master" element or at least advertise the situation to the user and allow them to manually generate the data required for the secondary trees.
Is there any pattern which might apply to this situation? I've been treating it as a view problem, but it turns out to be more complicated than that when you look at the initial generation of the data.

A tree implementation should be a good starting point. The master copy will be the complete tree.
The nodes in the copies can be composite objects that contain the data and a reference to the respective node in the master tree.
When a child is added or any other modification happens on a node in one of the copies, the node can send a message to the master tree with the reference of the corresponding node in the master tree and details of the change.
The master would then modify itself and update the other copies.
The event handling may get tricky because you will have to make sure to not make the whole process cyclic.

Related

Does tree object type in git internals point to the blob only or to trees as well?

I've explored a lot about the 3 commit model of the git internals and in the diagram,it's shown that tree object type is pointing to the other trees as well, but when I do
git cat-file -p 'sha-hash',
It tells that tree is pointing to the blobs only.
Please refer to the screenshot attached.
Please help me with a screenshot where tree is pointing to some other tree or tell me any use case which I may be missing.
If a file's name is a/b/c/f.ext it will be stored in a commit as:
at the level of the commit: a tree containing a subtree named a
at the level of the subtree named a: a tree containing a subtree named b
at the level of the subtree named b: a tree containing a subtree named c
at the level of the subtree named c: a tree containing a blob named f.ext
Hence, from the top, we simply string together the names of each higher level tree to arrive at the file's actual name, a/b/c/f.ext.
None of this really matters while using the file, since the important version of the file is the one in the index, and that one is named a/b/c/f.ext (with slashes in the name). It only matters when reading trees into the index (git read-tree), and when writing the index to a series of trees (git write-tree).
I think #torek has explained it pretty well. Also, I would like to add on that this all storage is done in the form of SHA1 hashes, the blob(binary large object) stores file contents while the trees store, in maybe vague terms, the file name and subtrees inside of it.
This is a pretty good guide to the internals of Git, you might as well take a look at it.
Cheers!

Build hierarchy from sets of attributes

I'm having some trouble with what I believe to be some pretty basic stuff. Nevertheless I can't seem to find anything. Probably because I'm not asking the correct question.
Let's say I have three(potentially redundant) sets of data A,B,C = (a,b,c), (a,b,d), (a,e,f).
What I need is for some tool to suggest a hierarchy for me.
Like so:
(a)
(b) (ef)
(c) (d)
In reality there are far more sets and ALOT of attributes within each set but they are all closely related and I don't want to manually find and build the hierarchy.
If you want to build an hierarchy out of plain tuples, go build a tree (or, rather, a forest) out of them!
In your case tree would look like
c
/
b - d
/
a - e -f
Algorithm is trivial:
pick first element from the tuple
find top element in the forest with this value (or create one if not found)
pick next value from the tuple
find matching element among children of previously found node.
repeat until PROFIT

Shortest sequence of operations transforming a file tree to another

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?
An operation can be:
Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder
A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.
This question has been puzzling me for some time. For the moment I have the following, basic idea:
Compute a database:
Store file names and their CRCs
Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
Compute A ∩ B and remove this intersection from both databases.
Use an inner join to find matching CRCs in A and B, folders first, order by size desc
while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
Then remove all files and folders referenced in database A and create those referenced in database B.
However I think that this is really a suboptimal way to do that. What could you give me as advice?
Thank you!
This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.
That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.
Hope this helps! And thanks for posting an awesome question!
You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.
https://github.com/irskep/sleepytree (code and paper)
The first step to do is figure out which files need to be created/renamed/deleted.
A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map
The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.
After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).
The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.
EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.
Helpful Links/References
An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann
Enumerate all files in B and their associated sizes and checksums;
sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums;
sort by size/checksum.
Now, doing an ordered list comparison, do the following:
a. for every file in A but not B, delete it.
b. for every file in B but not A, create it.
c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.
Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.
You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?
The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.
This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.
Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).
You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.
You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.
I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.

Efficient mass modification of persistent data structures

I understand how typically trees are used to modify persistent data structures (create a new node and replace all it's ancestors).
But what if I have a tree of 10,000's of nodes and I need to modify 1000's of them? I don't want to go through and create 1000's of new roots, I only need the one new root that results from modifying everything at once.
For example:
Let's take a persistent binary tree for example. In the single update node case, it does a search until it finds the node, creates a new one with the modifications and the old children, and creates new ancestors up to the root.
In the bulk update case could we do:
Instead of just updating a single node, you're going to update 1000 nodes on it in one pass.
At the root node, the current list is the full list. You then split that list between those that match the left node and those that match the right. If none match one of the children, don't descend to it. You then descend to the left node (assuming there were matches), split its search list between its children, and continue. When you have a single node and a match, you update it and go back up, replacing and updating ancestors and other branches as appropriate.
This would result in only one new root even though it modified any number of nodes.
These kind of "mass modification" operations are sometimes called bulk updates. Of course, the details will vary depending on exactly what kind of data structure you are working with and what kind of modifications you are trying to perform.
Typical kinds of operations might include "delete all values satisfying some condition" or "increment the values associated with all the keys in this list". Frequently, these operations can be performed in a single walk over the entire structure, taking O(n) time.
You seem to be concerned about the memory allocation involved in creating "1000's of new roots". Typical allocation for performing the operations one at a time would be O(k log n), where k is the number of nodes being modified. Typical allocation for performing the single walk over the entire structure would be O(n). Which is better depends on k and n.
In some cases, you can decrease the amount of allocation--at the cost of more complicated code--by paying special attention to when changes occur. For example, if you have a recursive algorithm that returns a tree, you might modify the algorithm to return a tree together with a boolean indicating whether anything has changed. The algorithm could then check those booleans before allocating a new node to see whether the old node can safely be reused. However, people don't usually bother with this extra check unless and until they have evidence that the extra memory allocation is actually a problem.
A particular implementation of what you're looking for can be found in Clojure's (and ClojureScript's) transients.
In short, given a fully-immutable, persistent data structure, a transient version of it will make changes using destructive (allocation-efficient) mutation, which you can flip back into a proper persistent data structure again when you're done with your performance-sensitive operations. It is only at the transition back to a persistent data structure that new roots are created (for example), thus amortizing the attendant cost over the number of logical operations you performed on the structure while it was in its transient form.

Strategies for quickly traversing an ACL

We are currently working on a project where the main domain objects are content nodes and we are using an ACL-like system where each node in the hierarchy can contain rules that override or complement those on their parents. Everything is as well based on roles and actions, for example.
Node 1 - {Deny All, Allow Role1 View}
\- Node 2 - {Allow Role2 View}
\- Node 3 - {Deny Role1 View}
In that case, rules will be read in order from top to bottom so the Node 3 can be viewed only by Role2. It's not really complicated in concept.
Retrieving the rules for a single node can result in some queries, getting all the parents and then recreating the list of rules and evaluating them, but this process can be cumbersome because the hierarchy can become quite deep and there may be a lot of rules on each node.
I have been thinking on preparing a table with precalculated rules for each node which could be recreated whenever a permission is changed and propagated to all the leaf nodes of the updated one.
Do you think of any other strategy to speed up the retrieval and calculation of the rules? Ideally it should be done in a single query, but trees are not the best structures for that.
I would think that an Observer Pattern might be adapted.
The idea would be that each Node maintains a precomputed list and is simply notified by its parent of any update so that it can recompute this list.
This can be done in 2 different ways:
notify that a change occurred, but don't recompute anything
recompute at each update
I would advise going with 1 if possible, since it does not involve recomputing the whole world when root is updated, and only recomputing when needed (lazy eval in fact) but you might prefer the second option if you update rarely but need blazing fast retrieval (there are more concurrency issues though).
Let's illustrate Solution 1:
Root ___ Node1 ___ Node1A
\ \__ Node1B
\_ Node2 ___ Node2A
\__ Node2B
Now, to begin with, none of them has precomputed anything (there are all in a dirty state), if I ask for Node2A rules:
Node2A realizes it is dirty: it queries Node2 rules
Node2 realizes it is dirty: it queries Root
Root does not have any parent, so it cannot be dirty, it sends its rules to Node2
Node2 caches the answer from Root, merges its rules with those received from Root and cleans the dirty bit, it sends the result of the merge (cached now) to Node2A
Node2A caches, merges, cleans the dirty bit and returns the result
If I subsequently asks for Node2B rules:
Node2B is dirty, it queries Node2
Node2 is clean, it replies
Node2B caches, merges, cleans the dirty bit and returns the result
Note that Node2 did not recomputed anything.
In the update case:
I update Node1: I use the Root cached rules to recompute the new rules and send a beat to Node1A and Node1B to notify them their cache is outdated
Node1A and Node1B set their dirty bit, they would also have propagated this notification had they had children
Note that because I cached Root rules I don't have to query the Root object, if it's a simple enough operation, you might prefer not to cache them at all: if you're not playing distributed here, and querying Root only involves a memory round-trip you might prefer not to duplicate it in order to save up some memory and book-keeping.
Hopes it gets you going.
Your version of pre-computation appears to store all the permissions relevant to each role at each node. You can save a little time and space by traversing the tree, numbering the nodes as you reach them, and producing, for each role, an array of the node numbers and permission changes just for the nodes at which the permissions relevant to that role change. This produces output only linear in the size of the input tree (including its annotations). Then when you come to check a permission for a role at a node, use the number of that node to search the array to find the point in the array that represents the most recent change of permission when you visited that node during the tour.
This may be associated in some way with http://en.wikipedia.org/wiki/Range_Minimum_Query and http://en.wikipedia.org/wiki/Lowest_common_ancestor, but I don't really know if those references will help or not.

Resources