Algorthm to make a tree from list of paths

Algorthm to make a tree from list of paths - algorithm

The task is to make a tree from list of sorted paths. Each node is a filesystem object(file or folder).
Currently I'm using this one (pseudo code):
foreach(string path in pathList)
{
INode currentNode = rootNode;
StringCollection pathTokens = path.split(pathSplitter);
foreach(pathToken in pathTokens)
{
if (currentNode.Children.contains(pathToken ))
{
currentNode = currentNode.Children.find(pathToken);
}
else
{
currentNode = currentNode.Children.Add(pathToken);
}
}
}
pathSplitter is a \ for win and / for *nix.
Is there a more efficient way to solve that task?

They key quality of your input data is that the list of paths is sorted. Hence you can work with common prefixes between the current and previous nodes quite efficiently. What you can do is maitain the last trace through the tree data structure from its root the leaf folder node. Then for the current path you just traverse the previous trace (i.e. process the current path relative to the last path) instead of finding the right position in the tree again and again.
When comparing the last and current path, three cases may happen:
1) Same paths
\path\to\folder\file1.txt
\path\to\folder\file2.txt
The trace remains, node for file2.txt is added.
2) New path is a subpath
\path\to\folder\file1.txt
\path\to\folder\subfolder\file2.txt
Nodes for subfolder and file2.txt are added.
3) New path is different
\path\to\folder\file1.txt
\path\to\another_folder\subfolder\file2.txt
First you need to back-track the trace to represent \path\to\. Then, nodes for another_path, subfolder and file2.txt are added. (Note that the another_folder\subfolder\ portion may be missing completely — I hope it's clear.)
Depending on the overall characteristics and volume of data such algoritm may perform faster. You could play with some formal Big-O estimations, but I think it would be faster to just test it.

The algorithm seems optimal to me; if I am not mistaken, the sorting of paths implies that the nodes will be generated in a depth-first sequence with respect to the tree on which they originate. This means that no unneccessary backtracing in the graph is performed. Furthermore, the algorithm is linear in the number of paths in the input and every path is processin in time linear in its length, so the overall running time is linear in the size of the input. Complexity-wise, this means that the algorithm is optimal since it is impossible to read all paths with lower runtime complexity.

Related

How to find longest accepted word by automata?

I need to write a code in Java that will find the longest word that DFA accepts. Firstly, if there is transition to one of previous states (or self-transition) on path that leads to final state, that means there are infinite words, and longest one doesn't exist (that means there is Kleene star applied on some word). I was thinking to form queue by BFS, where each level is separated by null, so that when I'm iterating through queue and come across null, length of the word would be increases by one, but it would be hard to track set of previous states so I'm kind of idealess. If you can't code in Java I would appreciate pseudocode or algorithm.

I don't think this is strictly necessary, but it would not hurt the performance too terribly much in practice and might be sufficient for your needs. I would suggest, as a first pass, minimizing the DFA. This can be done in O(nlogn) in terms of the number of states, using e.g. Hopcroft. This is probably conceptually similiar to what Christian Sloper suggests in the comments regarding reversing the transitions to find unproductive states ; indeed, there is a minimization algorithm that does this as well, but you might be able to get away with just removing unproductive states and not minimizing here (though minimizing does make the reasoning a little easier).
Doing that is nice because it will remove all unproductive loops and combine them into a single dead state, if indeed there are any unproductive prefixes. It is easy to find the one dead state, if there is one, and remove it from the directed graph formed by the DFA's states and transitions. To do this, do either DFS or BFS and check each state to come to and see if (1) all transitions are self-loops and (2) the state is not accepting.
With the one dead state removed (if any) any loops or cycles we detect in the remaining directed graph imply there are infinitely many strings in the language, since by definition any remaining states have a path to acceptance. If we find a loop or cycle, we know the language is infinite, and can respond accordingly.
If there are no loops or cycles remaining after removing the dead state from the minimal DFA, what remains is a tree rooted at the start state and whose leaves are accepting states (think about this for a moment and you will see it must be true). Therefore, the length of the longest string accepted is the length (in edges) of the longest path from the root to a leaf; so basically the height of the tree or something close to it (depending on how you define depth/height, whether edges or nodes). You can take any old algorithm for finding the depth and modify it so that in addition to returning the depth, it returns the string corresponding to the deepest subtree, so you can get the string without having to go back through the tree. Something like this:
GetLongestStringInTree(root)
1. if root is null return ""
2. result = ""
3. maxlen = 0
4. for each transition
5. child = transition.target
6. symbol = transition.symbol
7. str = GetLongestStringInTree(child)
8. if str.length > maxlen then
9. maxlen = str.length
10. result = str
11. return result
This could be pretty easily modified to find all words of maximum length by adding str to a collection if its length is equal to the max length so far, and emptying that collection when a new longer string is found, and returning the collection (and using the length of the first thing in the collection for checking). That can be left as an exercise; as written, this will just find some arbitrary longest string accepted by the DFA.

This problem becomes a lot simpler if you split it in two. (Sorry no java)
Step 1: Determine if there is a loop.
If there is a loop there exist an infinite long input. Detecting a loop in a directed graph can be done with DFS.
Step 2 (no loop): You now have a directed acyclic graph (DAG) and you can find the longest path using this algorithm: Longest path in Directed acyclic graph

Search for node in unorganized binary tree?

This is a conceptual question. I have a tree where the data is stored with strings but not stored alphabetically. How do I search through the entire tree to find the node with string I'm looking for. So far I can only search through one side of the tree.

Here are the thing you can:
1. traverse the tree in any manner, say `DFS` or `BFS`
2. while travering nodes, keep checking the the current node is equivalent to the key string you are searching for.
2.1. compare each character of your search string with each character of current node's value.
2.2. if match found, process your result.
2.3. if not, continue with point 2.
3. if all the nodes exhausted, you don't have any match. stop the algorithm.
The complexity of above mentioned algorithm will be:
O(N)* O(M) => O(NM)
N - nodes of your tree.
M - length of your node's value + length of your search key's value.

You may iterate throught all tree levels and on each of level check all nodes. Depth of the tree is equivalent to numbers of itetations.
You may recursively go down to each branches and stop all itetations when node is found (by using external variable or flag) or if there is no child nodes.

Data structure supporting Add and Partial-Sum

Let A[1..n] be an array of real numbers. Design an algorithm to perform any sequence of the following operations:
Add(i,y) -- Add the value y to the ith number.
Partial-sum(i) -- Return the sum of the first i numbers, i.e.
There are no insertions or deletions; the only change is to the values of the numbers. Each operation should take O(logn) steps. You may use one additional array of size n as a work space.
How to design a data structure for above algorithm?

Construct a balanced binary tree with n leaves; stick the elements along the bottom of the tree in their original order.
Augment each node in the tree with "sum of leaves of subtree"; a tree has #leaves-1 nodes so this takes O(n) setup time (which we have).
Querying a partial-sum goes like this: Descend the tree towards the query (leaf) node, but whenever you descend right, add the subtree-sum on the left plus the element you just visited, since those elements are in the sum.
Modifying a value goes like this: Find the query (left) node. Calculate the difference you added. Travel to the root of the tree; as you travel to the root, update each node you visit by adding in the difference (you may need to visit adjacent nodes, depending if you're storing "sum of leaves of subtree" or "sum of left-subtree plus myself" or some variant); the main idea is that you appropriately update all the augmented branch data that needs updating, and that data will be on the root path or adjacent to it.
The two operations take O(log(n)) time (that's the height of a tree), and you do O(1) work at each node.
You can probably use any search tree (e.g. a self-balancing binary search tree might allow for insertions, others for quicker access) but I haven't thought that one through.

You may use Fenwick Tree
See this question

Finding the width of a directed acyclic graph... with only the ability to find parents

I'm trying to find the width of a directed acyclic graph... as represented by an arbitrarily ordered list of nodes, without even an adjacency list.
The graph/list is for a parallel GNU Make-like workflow manager that uses files as its criteria for execution order. Each node has a list of source files and target files. We have a hash table in place so that, given a file name, the node which produces it can be determined. In this way, we can figure out a node's parents by examining the nodes which generate each of its source files using this table.
That is the ONLY ability I have at this point, without changing the code severely. The code has been in public use for a while, and the last thing we want to do is to change the structure significantly and have a bad release. And no, we don't have time to test rigorously (I am in an academic environment). Ideally we're hoping we can do this without doing anything more dangerous than adding fields to the node.
I'll be posting a community-wiki answer outlining my current approach and its flaws. If anyone wants to edit that, or use it as a starting point, feel free. If there's anything I can do to clarify things, I can answer questions or post code if needed.
Thanks!
EDIT: For anyone who cares, this will be in C. Yes, I know my pseudocode is in some horribly botched Python look-alike. I'm sort of hoping the language doesn't really matter.

I think the "width" you're considering here isn't really what you want - the width depends on how you assign levels to each node where you have some choice. You noticed this when you were deciding whether to assign all sources to level 0 or all sinks to the max level.
Instead, you just want to count the number of nodes and divide by the "critical path length", which is the longest path in the dag. This gives the average parallelism for the graph. It depends only on the graph itself, and it still gives you an indication of how wide the graph is.
To compute the critical path length, just do what you're doing - the critical path length is the maximum level you end up assigning.

In my opinion when you're doing this type of last minute development, its best to keep the new structures separate from the ones you are already using. At this point, if I were pressed by time I would go for a simpler solution.
Create an adjacency matrix for the graph using the parent data (should be easy)
Perform a topological sort using this matrix. (or even use tsort if pressed for time)
Now that you have a topological sort, create an array level, one element for each node.
For each node:
If the node has no parents set its level to 0
Otherwise set it to the minimum of level its parents + 1.
Find the maximum level width.
The question is as Keith Randall asked, is this the right measurement you need?

Here's what I (Platinum Azure, the original author) have so far.
Preparations/augmentations:
Add "children" field to linked list ("DAG") node
Add "level" field to "DAG" node
Add "children_left" field to "DAG" node. This is used to make sure that all children are examined before a parent is examined (in a later stage of the algorithm).
Algorithm:
Find the number of immediate children for all nodes; also, determine leaves by adding nodes with children==0 to list.
for l in L:
l.children = 0
for l in L:
l.level = 0
for p in l.parents:
++p.children
Leaves = []
for l in L:
l.children_left = l.children
if l.children == 0:
Leaves.append(l)
Assign every node a "reverse depth" level. Normally by depth, I mean topologically sort and assign depth=0 to nodes with no parents. However, I'm thinking I need to reverse this, with depth=0 corresponding to leaves. Also, we want to make sure that no node is added to the queue without all its children "looking at it" first (to determine its proper "depth level").
max_level = 0
while !Leaves.empty():
l = Leaves.pop()
for p in l.parents:
--p.children_left
if p.children_left == 0:
/* we only want to append parents with for sure correct level */
Leaves.append(p)
p.level = Max(p.level, l.level + 1)
if p.level > max_level:
max_level = p.level
Now that every node has a level, simply create an array and then go through the list once more to count the number of nodes in each level.
level_count = new int[max_level+1]
for l in L:
++level_count[l.level]
width = Max(level_count)
So that's what I'm thinking so far. Is there a way to improve on it? It's linear time all the way, but it's got like five or six linear scans and there will probably be a lot of cache misses and the like. I have to wonder if there isn't a way to exploit some locality with a better data structure-- without actually changing the underlying code beyond node augmentation.
Any thoughts?

A data structure based on the R-Tree: creating new child nodes when a node is full, but what if I have a lot of objects at the exact same position?

I realize my title is not very clear, but I am having trouble thinking of a better one. If anyone wants to correct it, please do.
I'm developing a data structure for my 2 dimensional game with an infinite universe. The data structure is based on a simple (!) node/leaf system, like the R-Tree.
This is the basic concept: you set howmany childs you want a node (a container) to have maximum. If you want to add a leaf, but the node the leaf should be in is full, then it will create a new set of nodes within this node and move all current leafs to their new (more exact) node. This way, very populated areas will have a lot more subdivisions than a very big but rarely visited area.
This works for normal objects. The only problem arises when I have more than maxChildsPerNode objects with the exact same X,Y location: because the node is full, it will create more exact subnodes, but the old leafs will all be put in the exact same node again because they have the exact same position -- resulting in an infinite loop of creating more nodes and more nodes.
So, what should I do when I want to add more leafs than maxChildsPerNode with the exact same position to my tree?
PS. if I failed to explain my problem, please tell me, so I can try to improve the explanation.
Update: this is how I check if all leafs in a full node have identical positions:
//returns whether all leafs in the given leaf list are identical
private function allLeafsHaveSamePos(leafArr:Array<RTLeaf<T>>):Bool {
if (leafArr.length > 1) {
var lastLeafTopX:Float = leafArr[0].topX;
var lastLeafTopY:Float = leafArr[0].topY;
for (i in 1...leafArr.length) {
if ((leafArr[i].topX != lastLeafTopX) || (leafArr[i].topY != lastLeafTopY)) return false;
}
}
return true;
}

I would like to ask a question...
is it that important than the maxChildsPerNode constraint be respected ?
I would rather think of this maximum as a hint to the structure for when to split, and simply ignore it when there is no meaningful way to perform the split.
Of course you'd better rethink the name then, otherwise it'd be odd for the next maintainer.
In pseudo code I would use something like this:
def AddToNode(node, item):
node.items.append(item)
if len(node.items) > node.splitHint:
leftNode = Node(node.splitHint)
rightNode = Node(node.splitHint)
node.split(leftNode, rightNode)
if len(leftNode.items) == 0 or len(rightNode.items) == 0:
node.splitHint *= 1.5 # famous golden ratio ;)
else:
node.items = [leftNode, rightNode]
The important step is to modify the hint when it's detected than we can't abide by it in order not to perform this check at each insertion (this way we obtain a constant amortized cost).

It looks like a bit of a mismatch between your data and your structure, since you have a structure that assumes N objects within an arbitrarily large area when you're supplying it with >N objects on an infinitely small point. It might be worth using a different structure for this data?
Hack fix: apply a tiny random displacement to your newly created objects. This should allow the space to be subdivided by the existing algorithm.
Better fix: ensure that your algorithm for splitting a leaf node always generates at least 2 new leaf nodes to begin with. When reassigning objects to the new leaf nodes, or when performing normal insertions, iterate over all the candidates, and if more than one is equally suitable then you can tie-break based on how full they are. This should result in your co-located players ending up split evenly across the otherwise identical nodes.

From common sense I'd not assume having two objects in the same position ever, but if this is a part of the idea, then I would introduce one more axis, say 'spin', an integer number, and impose a restriction that all your objects are fermions (cannot have the same location and spin at the same time).

If you have a set of objects on the exact same spot, any query for a region that contains one should return all - so there's no reason to split them, as it doesn't gain you anything. Simply either count the number of distinct locations when deciding to split, or have each element on the leaf node be an object that encapsulates (coordinates, [list of objects at those coordinates]).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio