Algorithm for building tree depending on node attributes - algorithm

I am trying to solve a programming problem where I need to implement the following algorithm (roughly):
There are couple of nodes, ie, A, B, C, etc.
Every node can have multiple items in it, ie, a, b, c, x, y, z, etc. For example,
A [a, b, c, x, y, z]
B [a, b, c]
C [x, y, z]
There can be infinite number of nodes and items and nodes can have any number of items in it (but same item wont repeat again).
What I have to do is I have to create heirarchy among the nodes depending on the common items inside the nodes. So, in the above example, A should have higher heirarchy over B and C. Or in other words, A is master and B and C are the slaves.
So, I was thinking if I can make a tree from the nodes depending on common items, then it will be easier for me. But I don't know which algorithm to use. Anybody know which will be suitable for my case? Building tree is not mandatory, if there are other ways to achieve the same thing, it will be okay. Thanks.

Try using AVL trees.
Note that the worst case for AVL trees may look something like this. You can read more about the worst case here.
Most importantly, given two 'nodes' does the logic to compare them and determine which is higher exist? If not then that needs to be built first!
Once you know how to compare, then AVL trees can be used to build and maintain the 'hierarchy'.

I have adapted the algorithm proposed by the paper "Data Mining for Path Traversal Patterns in a Web Environment" by Ming-Syan Chen, Jong Soo Park and Philip S. Yu. This paper is available here. Though the algorithm here directly does not solve my problem, but I did little bit adaptation in the algorithm so that it fits my problem situation. Now it works fine and I get the result I need.
I would like to thank everyone took time to read my question and proposed solution.

Related

Data structure to represent a graph

Having a couple of cities and their locations I want to create a data structure that would represent a graph like this. This graph represent all possible paths that can be taken in order to visit every city only once:
My question is, since this is probably a very common problem, is there an algorithm or already made data structure to represent this? The programming language is not important (although I would prefer java).
Your problem seems very close to the traveling salesman problem, a classic among the classics.
As you did intuite, the graph that will represent all the possible solutions is indeed a tree (the paths from the root to any of its leaf should represent a solution).
From there, you can ask yourself several questions:
Is the first city that I'll visit an important piece of information, or is it only the order that matters ? For instance, is London-Warsaw-Berlin-Lidz equivalent to Warsaw-Berlin-Lidz-London ?
Usually, we consider these solutions as being equivalent to solve a TSP, but it might not be the case for you.
Did you see the link between a potential solution to the TSP and a permutation ? Actually, what you're looking for is a way (and the data structure that goes with it) to generate all the permutations of a given set(your set of cities).
With these two points in mind, we can think about a way to generate such a tree. A good strategy to work with trees is to think recursively.
We have a partial solution, meaning the k first cities. Then, the next possible city can be any among the n-k cities remaining. That gives the following pseudo-code.
get_all_permutations(TreeNode node, Set<int>not_visited){
for (city in not_visited){
new_set = copy(not_visited);
new_set.remove(city);
new_node = new TreeNode();
new_node.city = city;
node.add_child(new_node);
get_all_permutations(new_node, new_set);
}
}
This will build the tree recursively.
Depending on your answer to the first point I mentioned (about the importance of the first city), you might want to assign a city to the root node, or not.
Some good points to look in, if you want to go further with this kind of problem/thinking, are enumeration algorithms, and recursive algorithms. They're generally a good option when your goal is to enumerate all the elements of a set. But they're also generally an inefficient way to solve problems (for example, in the case of the TSP, solving using this algorithm results in a very inefficient approach. There are some much much better ones).
This tree is bad. There are redundant data in it. For instance connection between nodes 2 and 4 occurs three times in the tree. You want a "structure" that automatically gives the solution to your problem, so that it's easier for you, but that's not how problem solving works. Input data is one set of data, output data is another set of data, and they could appear similar, but they can also be quite different.
One simple matrix with one triangle empty and the other containing data should have all the information you need. Coordinates of the matrix are nodes, cells are distances. This is your input data.
What you do with this matrix in your code is a different matter. Maybe you want to write all possible paths. Then write them. Use input data and your code to produce output data.
What you are looking for is actually a generator of all permutations. If you keep one city fixed as the first one (London, in your diagram), then you need to generate all permutations of the list of all your remaining nodes (Warsaw, Łódź, Berlin).
Often such an algorithm is done recursively by looping over all elements, taking it out and doing this recursively for the remaining elements. Often libraries are use to achieve this, e. g. itertools.permutations in Python.
Each permutation generated this way should then be put in the resulting graph you originally wanted. For this you can use any graph-representation you would like, e. g. a nested dictionary structure:
{ a: { b: { c: d,
d: c },
c: { b: d,
d, b },
d: { b: c,
c: b } } }

Cost of building a "connected matrix"

I'm sure there is an abundance of information on how to do exactly what I'm after, but it's a matter of not knowing the technical term for it. Basically what I want to create is an adjacency matrix for a directed graph, however rather than simply storing whether or not each vertex pair has a direct adjacency, for every vertex pair in the matrix I want to store if there is ANY path connecting the two (and what those paths are).
This would give me constant time complexity for lookups which is desirable, however what's not immediately clear to me is what the expected optimal time complexity of building this matrix will be.
Also, is there a formal name for such a matrix?
Playing this out in my head, it seems like a dynamic programming problem. If I want to know if A is connected to Z, I should be able to ask each of A's neighbors, B, C and D if they are (in some way) connected to Z, and if so, then I know A is. And if B doesn't have this answer stored, then he would ask the same question of his direct neighbors, and so on. I would memoize the results along the way, so subsequent lookups would be constant.
I haven't spent time to implement this yet, because it feels like ϴ(n^n) to build a complete matrix, so my question is whether or not I'm going about this the right way, and if indeed there is a lower-cost way to build such a matrix?
The transitive closure of a graph (https://en.wikipedia.org/wiki/Transitive_closure#In_graph_theory) can indeed be computed by dynamic programming with a variation of Floyd Warshall algorithm: https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm.
Using |V| DFS (or BFS) is more efficient, though.
Using networkx connected components
G = nx.path_graph(4)
G.add_path([10, 11, 12])
d = {}
for group in idx, group in enumerate(nx.connected_components(G)):
for node in group:
d[node] = idx
def connected(node1, node2):
return d[node1]==d[node2]
Generation should be O(N) lookup should be O(1)

Is there a good data structure that performs find, union, and de-union?

I am looking for a data structure that can support union, find, and de-union fairly efficiently (everything at least O(log n) or better) as a standard disjoint set structure doesn't support de-unioning. As a background, I am writing a Go AI with MCTS [http://en.wikipedia.org/wiki/Monte_Carlo_tree_search], and this would be used in keeping track of groups of stones as they connect and are disconnected during backtracking. I think this might make it easier as de-union is not on some arbitrary object in the set, but is always an "undo" of the latest union.
I have read through the following paper and, while I could do the proposed data structure, it seems a bit over kill and would take a while to implement
http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1773&context=cstech
While O( a(n)) would be great, of course, I'm pretty sure path compression won't work with de-union, and I'd be happy with O(log n). My gut tells me a solution might be heap related, but I haven't been able to figure anything out.
What you're describing is sometimes called the union-find-split problem, but most modern data structures for it (or at least, the ones that I know of) usually view this problem differently. Think about every element as being a node in a forest. You then want to be able to maintain the forest under the operations
link(x, y), which adds the edge (x, y),
find(x), which returns a representative node for the tree containing x, and
cut(x, y), which cuts the edge from x to y.
These data structures are often called dynamic trees or link-cut trees. To the best of my knowledge, there are no efficient data structures that match the implementation simplicity of the standard union-find data structure. Two data structures that might be helpful for your case are the link/cut tree (also called the Sleator-Tarjan tree or ST-tree) and the Euler-tour tree (also called the ET-tree), both of which can perform all of the above operations in time O(log n) each.
Hope this helps!
The other answer is over-complicated. You can use a standard union-find data-structure where every time you set parent[x] = y, you append (x, old_parent) to a stack. On backtracking, you just reset to the old value.
You can do the same thing with path compression, but the overhead may not pay off. Also, if you make multiple modifications per union call, you need to add separators to the stack so you know when to stop.

Using nondeterminism to detect cliques?

I am trying to understand non-determinism with the clique-problem.
In computer science, the clique problem refers to any of the problems related to
finding particular complete subgraphs ("cliques") in a graph, i.e., sets of
elements where each pair of elements is connected.
Say I have a graph with nodes A, B, C, D, E, F and i want to decide if a clique of 4 exists.
My understanding of non-determinism is to make a guess by taking four nodes (B, C, D, F) and check if a connection exists between all 4 nodes. If it exists, I conclude that a clique exists and if doesn't, I conclude a clique does not exist.
What I am not sure of however is how this helps solve the problem as I just might have made the wrong choice.
I guess I am trying to understand the application of non-determinism in general.
Nondeterministic choices are different from random or arbitrary choices. When using nondeterminism, if any possible choice that can be made will lead to the algorithm outputting YES, then one of those choices will be selected. If no choice exists that does this, then an arbitrary choice will be made.
If this seems like cheating, in a sense it is. It's unknown how to implement nondeterminism efficiently using a deterministic computer, a randomized algorithm, or parallel computers that have lots of processors but which can only do a small amount of work on each core. These are the P = NP, BPP = NP, and NC = NP questions, respectively. Accordingly, nondeterminism is primarily a theoretical approach to problem solving.
Hope this helps!

Comparison-based ranking algorithm

I would like to rank or sort a collection of items (with size potentially greater than 100,000) where items in the collection have no intrinsic (comparable) value, instead all I have is the comparisons between any two items which have been provided by users in a subjective manner.
Example: Consider a collection with elements [a, b, c, d] and comparisons by users b > a, a > d, d > c. The correct order of this collection would be [b, a, d, c].
This example is simple, however there could be more complicated cases:
Since the comparisons are subjective, a user could also say that c > b. In which case that would cause a conflict with the ordering above.
Also you may not have comparisons that “connects” all the items, i.e. b > a, d > c. In which case the ordering is ambiguous. It could be [b, a, d, c] or [d, c, b, a]. In this case either ordering is acceptable.
If possible it would be nice to somehow take into account multiple instances of the same comparison and give those with higher occurrences more weight. But a solution without this condition would still be acceptable.
A similar application of this algorithm was used by Zuckerberg's FaceMash application where he ranked people based on comparisons (if I understood it correctly), but I have not been able to find what that algorithm actually was.
Is there an algorithm which already exists that can solve the problem above? I would not like to spend effort trying to come up with one if that is the case. If there is no specific algorithm, is there perhaps certain types of algorithms or techniques which you can point me to?
This is a problem that has already occurred in another arena: competitive games! Here, too, the goal is to assign each player a global "rank" on the basis of a series of 1 vs. 1 comparisons. The difficulty, of course, is that the comparisons are not transitive (I take "subjective" to mean "provided by a human being" in your question). Kasparov beats Fischer beats (don't know another chess player!) Bob beats Kasparov, potentially.
This renders useless algorithms that rely on transitivity (i.e. a > b and b > c => a > c) as you end up with (likely) a highly cyclic graph.
Several rating systems have been devised to tackle this problem.
The most well-known system is probably the Elo algorithm/score for competitive chess players. Its descendants (for instance, the Glicko rating system) are more sophisticated and take into account statistical properties of the win/loss record---in other words, how reliable is a rating? This is similar to your idea of weighting more heavily records with more "games" played. Glicko also forms the basis for the TrueSkill system used on Xbox Live for multiplayer video games.
You may be interested in the minimum feedback arc set problem. Essentially the problem is to find the minimum number of comparisons that "go the wrong way" if the elements are linearly ordered in some ordering. This is the same as finding the minimum number of edges that must be removed to make the graph acyclic. Unfortunately, solving the problem exactly is NP-hard.
A couple of links that discuss the problem:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.8157&rep=rep1&type=pdf
http://en.wikipedia.org/wiki/Feedback_arc_set
I googled this out, look for chapter 12.3, Topological sorting and Depth-first Search
http://www.cs.cmu.edu/~avrim/451f09/lectures/lect1006.pdf
Your set of relations describe a directed acyclic graph (hopefully acyclic) and so graph topological sorting is exactly what you need.

Resources