Strategies for quickly traversing an ACL - performance

We are currently working on a project where the main domain objects are content nodes and we are using an ACL-like system where each node in the hierarchy can contain rules that override or complement those on their parents. Everything is as well based on roles and actions, for example.
Node 1 - {Deny All, Allow Role1 View}
\- Node 2 - {Allow Role2 View}
\- Node 3 - {Deny Role1 View}
In that case, rules will be read in order from top to bottom so the Node 3 can be viewed only by Role2. It's not really complicated in concept.
Retrieving the rules for a single node can result in some queries, getting all the parents and then recreating the list of rules and evaluating them, but this process can be cumbersome because the hierarchy can become quite deep and there may be a lot of rules on each node.
I have been thinking on preparing a table with precalculated rules for each node which could be recreated whenever a permission is changed and propagated to all the leaf nodes of the updated one.
Do you think of any other strategy to speed up the retrieval and calculation of the rules? Ideally it should be done in a single query, but trees are not the best structures for that.

I would think that an Observer Pattern might be adapted.
The idea would be that each Node maintains a precomputed list and is simply notified by its parent of any update so that it can recompute this list.
This can be done in 2 different ways:
notify that a change occurred, but don't recompute anything
recompute at each update
I would advise going with 1 if possible, since it does not involve recomputing the whole world when root is updated, and only recomputing when needed (lazy eval in fact) but you might prefer the second option if you update rarely but need blazing fast retrieval (there are more concurrency issues though).
Let's illustrate Solution 1:
Root ___ Node1 ___ Node1A
\ \__ Node1B
\_ Node2 ___ Node2A
\__ Node2B
Now, to begin with, none of them has precomputed anything (there are all in a dirty state), if I ask for Node2A rules:
Node2A realizes it is dirty: it queries Node2 rules
Node2 realizes it is dirty: it queries Root
Root does not have any parent, so it cannot be dirty, it sends its rules to Node2
Node2 caches the answer from Root, merges its rules with those received from Root and cleans the dirty bit, it sends the result of the merge (cached now) to Node2A
Node2A caches, merges, cleans the dirty bit and returns the result
If I subsequently asks for Node2B rules:
Node2B is dirty, it queries Node2
Node2 is clean, it replies
Node2B caches, merges, cleans the dirty bit and returns the result
Note that Node2 did not recomputed anything.
In the update case:
I update Node1: I use the Root cached rules to recompute the new rules and send a beat to Node1A and Node1B to notify them their cache is outdated
Node1A and Node1B set their dirty bit, they would also have propagated this notification had they had children
Note that because I cached Root rules I don't have to query the Root object, if it's a simple enough operation, you might prefer not to cache them at all: if you're not playing distributed here, and querying Root only involves a memory round-trip you might prefer not to duplicate it in order to save up some memory and book-keeping.
Hopes it gets you going.

Your version of pre-computation appears to store all the permissions relevant to each role at each node. You can save a little time and space by traversing the tree, numbering the nodes as you reach them, and producing, for each role, an array of the node numbers and permission changes just for the nodes at which the permissions relevant to that role change. This produces output only linear in the size of the input tree (including its annotations). Then when you come to check a permission for a role at a node, use the number of that node to search the array to find the point in the array that represents the most recent change of permission when you visited that node during the tour.
This may be associated in some way with http://en.wikipedia.org/wiki/Range_Minimum_Query and http://en.wikipedia.org/wiki/Lowest_common_ancestor, but I don't really know if those references will help or not.

Related

Neo4j - slow cypher query - big graph with hierarchies

Using Neo4j 2.1.4. I have a graph with 'IS A' relationships (and other types of relationships) between nodes. I have some hierarchies inside the graph (IS A relationships) and I need to know the descendants (IS A relationship) of one hierarchy that has a particular-known relationship with some descendant of second hierarchy. If that particular-known relationship exists, I return the descendant/s of the first hierarchy.
INPUTS: 'ID_parentnode_hierarchy_01', 'ID_relationship', 'ID_parentnode_hierarchy_02'.
OUTPUT: Descendants (IS A relationship) of 'ID_parentnode_hierarchy_01' that has 'ID_relationship' with some descendant of 'ID_parentnode_hierarchy_02'.
Note: The graph has 500.000 nodes and 2 million relationships.
I am using this cypher query but it is very slow (aprox. 40s in a 4GB RAM and 3GHz Pentium Dual Core 64 bit PC). It is possible to build a faster query?
MATCH (parentnode_hierarchy_01: Node{nodeid : {ID_parentnode_hierarchy_01}})
WITH parentnode_hierarchy_01
MATCH (parentnode_hierarchy_01) <- [:REL* {reltype: {isA}}] - (descendants01: Node)
WITH descendants01
MATCH (descendants01) - [:REL {reltype: {ID_relationship}}] -> (descendants02: Node)
WITH descendants02, descendants01
MATCH (parentnode_hierarchy_02: Node {nodeid: {ID_parentnode_hierarchy_02} })
<- [:REL* {reltype: {isA}}] - (descendants02)
RETURN DISTINCT descendants01;
Thank you very much.
Well, I can slightly clean up your query - this might help us understand the issues better. I doubt this one will run faster, but using the cleaned up version we can discuss what's going on: (mostly eliminating unneeded uses of MATCH/WITH)
MATCH (parent:Node {nodeid: {ID_parentnode_hierarchy_01}})<-[:REL* {reltype:{isA}}]-
(descendants01:Node)-[:REL {reltype:{ID_relationship}}]->(descendants02:Node),
(parent2:Node {nodeid: {ID_parentnode_hierarchy_02}})<-[:REL* {reltype:{isA}}]-
(descendants02)
RETURN distinct descendants01;
This looks like you're searching two (probably large) trees, starting from the root, for two nodes somewhere in the tree that are linked by an {ID_relationship}.
Unless you can provide some query hints about which node in the tree might have an ID_relationship or something like that, at worst, this looks like you could end up comparing every two nodes in the two trees. So this looks like it could take n * k time, where n is the number of nodes in the first tree, k the number of nodes in the second tree.
Here are some strategy things to think about - which you should use depends on your data:
Is there some depth in the tree where these links are likely to be found? Can you put a range on the depth of [:REL* {reltype:{isA}}]?
What other criteria can you add to descendants01 and descendants02? Is there anything that can help make the query more selective so that you're not comparing every node in one tree to every node in the other?
Another strategy you might try is this: (this might be a horrible idea, but it's worth trying) -- basically look for a path from one root to the other, over any number of undirected edges of either isa type, or the other. Your data model has :REL relationships with a reltype attribute. This is probably an antipattern; instead of a reltype attribute, why is the relationship type not just that? This prevents the query that I want to write, below:
MATCH p=shortestPath((p1:Node {nodeid: {first_parent_id}})-[:isA|ID_relationship*]-(p2:Node {nodeid: {second_parent_id}}))
return p;
This would return the path from one "root" to the other, via the bridge you want. You could then use path functions to extract whatever nodes you wanted. Note that this query isn't possible currently because of your data model.

Do algorithms for performing hierarchical permission checking exist?

I have a data structure which represents a hierarchy.
folder
folder
folder
file
file
etc.
Permissions are stored in a flat table:
| pKey | type | bitperms |
When performing global operations like search, we need to check permissions recursively within the tree.
Checking permissions inline with the individual leaves of the tree structure is easy. However accounting for permission on the nodes requires one of two known approaches:
after fetching the filtered leaves, post process each one to check it's parents perms
cost is delayed until after
there might be lots of initial leaves found, but after processing the parents, nothing remains resulting in useless work being done
pre calculating all the roots (nodes which grant the permission) ahead of time and using that as a query filter while getting leaves
potentially a huge query if many roots exist resulting in excessive time spent processing each leaf
Do any algorithms exist for doing this in a more efficient way? Perhaps reorganizing the permission data or adding more information to the hierarchy?
Perhaps adding some heuristics to deal with extremes?
Dunno about a complete paper about that, but here are my thoughts.
You obviously need to check at some point the whole path from the leaf to the root.
I assume no permission rule introduction from the side (i.e. you're working on a tree, not a general graph).
I assume lots of leafs on few "folder" nodes.
I also assume that you have a method for including permissions (ORing on a bitmask) or excluding them (NOTANDing on a bitmask).
Permissions are mostly granted to roles/groups, not individual users (in the latter case, you'd need to create s.th. like an "ad-hoc role/group" for that user).
Permissions will not go up the tree, only down to the leafs.
Then I'd pre-calculate all permissions on folders from the root up and save them along with the folder nodes whenever some permissions on folders change (or a role is added, etc). When a specific file/leaf is called, you only have to check the files/leafs permissions and its folders permissions.
You could also mark some folders as "do not inherit permissions from parent", which may shorten your calculations when the root's permissions change...
This would make it cheap for the following operations:
checking a leaf's permissions (join leaf and its parent permissions).
changing the permissions of a folder which does not contain more folders.
Costly are these operations, but since they do not need to work on any leaf/file, they only need to touch a minor part of the whole tree:
changing/extending the permission model (e.g. by adding a role/group, which might broaden your bitmask, depending on your implementation).
changing the roots permissions.

Why is it important to delete files in-order to remove them faster?

Some time ago I learned that rsync deletes files much faster that many other tools.
A few days ago I came across this wonderful answer on Serverfault which explains why rsync is so good at deleting files.
Quotation from that answer:
I revisited this today, because most filesystems store their directory
structures in a btree format, the order of which you delete files is
also important. One needs to avoid rebalancing the btree when you
perform the unlink. As such I added a sort before deletes occur.
Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?
I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree level. People, who wrote rsync and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.
It is not important, nor b-tree issue. It is just a coincidence.
First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.
Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.
Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir() returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.
A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):
to sort the directory entries returned by readdir() by inode number.
Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().
I did not test the following hypothesis because I do not have the data sets from which #MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?
Let's assume that the answer you posted is correct, and that the given file system does indeed store things in a balanced tree. Balancing a tree is a very expensive operation. Keeping a tree "partially" balanced is pretty simple, in that when you allow for a tree to be imbalanced slightly, you only worry about moving things around the point of insertion/deletion. However, when talking about completely balanced trees, when you remove a given node, you may find that suddenly, the children of this node could belong on the complete opposite side of the tree, or a child node on the opposite side has become the root node, and all of it's children need to be rotated up the tree. This requires you to do either a long series of rotations, or to place all the items into an array and re-create the tree.
5
3 7
2 4 6 8
now remove the 7, easy right?
5
3 8
2 4 6
Now remove the 6, still easy, yes...?
5
3 8
2 4
Now remove the 8, uh oh
5
3
2 4
Getting this tree to be the proper balanced form like:
4
3 5
2
Is quite expensive, compared at least to the other removals we have done, and gets exponentially worse as the depth of our tree increases. We could make this go much(exponentially) faster by removing the 2 and the 4, before removing the 8. Particularly if our tree was more than 3 levels deep.
Without sorting removal is on average a O(K * log_I(N)^2). N representing the number of elements total, and K the number to be removed, I the number of children a given node is permitted, log_I(N) then being the depth, and for each level of depth we increase the number of operations quadratically.
Removal with some ordering help is on average O(K * log_I(N)), though sometimes ordering cannot help you and you are stuck removing something that will require a re-balance. Still, minimizing this is optimal.
EDIT:
Another possible tree ordering scheme:
8
6 7
1 2 3 4
Accomplishing optimal removal under this circumstance would be easier, because we can take advantage of our knowledge of how things are sorted. Under either situation it is possible, and in fact both are identical, under this one the logic is just a little simpler to understand, because the ordering is more human friendly for the given scenario. In either case in-order is defined as "remove the farthest leaf first", in this case it just so happens to be that the farthest leaves are also the smallest numbers, a fact that we could take advantage of to make it even a little more optimal, but this fact is not necessarily true for the file system example presented(though it may be).
I am not convinced that the number of B-tree rebalancing changes significantly if you delete the files in-order. However I do believe that the number of different seeks to external storage will be significantly smaller if you do this. At any time, the only nodes in the B-tree that need be visited will then be the far right boundary of the tree, whereas, with a random order, each leaf block in the B tree is visited with equal probability for each file.
Rebalancing for B-Trees are cheaper than B-Tree+ implementations that's why most filesystems and database indexes implementations use them.
There are many approaches when deletion, depending on the approach it can be more efficient in terms of time and the need to rebalance the tree. You'll also have to consider the size of the node, since the number of keys the node can store will affect the need for rebalancing the tree. A large node size will just reorder keys inside the node, but a small one probably will make the tree rebalance many times.
A great resource for understanding this is the famous CLR (Thomas Cormen) book "Introduction to Algorithms".
On storage systems where hosting huge directories, the buffer cache will be under stress and buffers may get recycled. So, if you have deletes spaced apart by time, then the number of disk reads to get the btree back in core into the buffer cache, between deletes, may be high.
If you sort the files to be deleted, effectively you are delaying the deletes and bunching them. This may have the side effect of more deletes per block of btree paged in. If there are stats to say what the buffer cache hits are between the two experiments, it may tell if this hypo is wrong or not.
But, if there is no stress on the buffer cache during the deletes, then the btree blocks could stay in core and then my hypothesis is not a valid one.

What are the performance benefits of keeping head node without data in linkedlist?

What are the performance benefit of keeping head node with useless data in linkedlist?
I read some implementation of linkedlist operations using head node which keeps only pointer to the first node( data in head node is useless).
But i am unable to figure out even a single benefit of using head node instead of head pointer.
Can someone please clarify with 1 problem and 2 implementations, one with head node and another with head pointer and tradeoff between performance/complexity?
Dummy header nodes are used for these reasons:
To satisfy the requirement that every node has a previous node (makes the methods simpler/uniform)
Not to deal with the special cases of inserting into and deleting from the head. It is just like any other node
Other than that, no, they do not provide any performance/memory advantage.
Among other things, one more advantage is that, you can keep the number of currently available nodes in the linked list, as nodes are added and removed frequently, so you keen keep the record of the number of nodes in the head node of the linked list.

Efficient mass modification of persistent data structures

I understand how typically trees are used to modify persistent data structures (create a new node and replace all it's ancestors).
But what if I have a tree of 10,000's of nodes and I need to modify 1000's of them? I don't want to go through and create 1000's of new roots, I only need the one new root that results from modifying everything at once.
For example:
Let's take a persistent binary tree for example. In the single update node case, it does a search until it finds the node, creates a new one with the modifications and the old children, and creates new ancestors up to the root.
In the bulk update case could we do:
Instead of just updating a single node, you're going to update 1000 nodes on it in one pass.
At the root node, the current list is the full list. You then split that list between those that match the left node and those that match the right. If none match one of the children, don't descend to it. You then descend to the left node (assuming there were matches), split its search list between its children, and continue. When you have a single node and a match, you update it and go back up, replacing and updating ancestors and other branches as appropriate.
This would result in only one new root even though it modified any number of nodes.
These kind of "mass modification" operations are sometimes called bulk updates. Of course, the details will vary depending on exactly what kind of data structure you are working with and what kind of modifications you are trying to perform.
Typical kinds of operations might include "delete all values satisfying some condition" or "increment the values associated with all the keys in this list". Frequently, these operations can be performed in a single walk over the entire structure, taking O(n) time.
You seem to be concerned about the memory allocation involved in creating "1000's of new roots". Typical allocation for performing the operations one at a time would be O(k log n), where k is the number of nodes being modified. Typical allocation for performing the single walk over the entire structure would be O(n). Which is better depends on k and n.
In some cases, you can decrease the amount of allocation--at the cost of more complicated code--by paying special attention to when changes occur. For example, if you have a recursive algorithm that returns a tree, you might modify the algorithm to return a tree together with a boolean indicating whether anything has changed. The algorithm could then check those booleans before allocating a new node to see whether the old node can safely be reused. However, people don't usually bother with this extra check unless and until they have evidence that the extra memory allocation is actually a problem.
A particular implementation of what you're looking for can be found in Clojure's (and ClojureScript's) transients.
In short, given a fully-immutable, persistent data structure, a transient version of it will make changes using destructive (allocation-efficient) mutation, which you can flip back into a proper persistent data structure again when you're done with your performance-sensitive operations. It is only at the transition back to a persistent data structure that new roots are created (for example), thus amortizing the attendant cost over the number of logical operations you performed on the structure while it was in its transient form.

Resources