Having a couple of cities and their locations I want to create a data structure that would represent a graph like this. This graph represent all possible paths that can be taken in order to visit every city only once:
My question is, since this is probably a very common problem, is there an algorithm or already made data structure to represent this? The programming language is not important (although I would prefer java).
Your problem seems very close to the traveling salesman problem, a classic among the classics.
As you did intuite, the graph that will represent all the possible solutions is indeed a tree (the paths from the root to any of its leaf should represent a solution).
From there, you can ask yourself several questions:
Is the first city that I'll visit an important piece of information, or is it only the order that matters ? For instance, is London-Warsaw-Berlin-Lidz equivalent to Warsaw-Berlin-Lidz-London ?
Usually, we consider these solutions as being equivalent to solve a TSP, but it might not be the case for you.
Did you see the link between a potential solution to the TSP and a permutation ? Actually, what you're looking for is a way (and the data structure that goes with it) to generate all the permutations of a given set(your set of cities).
With these two points in mind, we can think about a way to generate such a tree. A good strategy to work with trees is to think recursively.
We have a partial solution, meaning the k first cities. Then, the next possible city can be any among the n-k cities remaining. That gives the following pseudo-code.
get_all_permutations(TreeNode node, Set<int>not_visited){
for (city in not_visited){
new_set = copy(not_visited);
new_set.remove(city);
new_node = new TreeNode();
new_node.city = city;
node.add_child(new_node);
get_all_permutations(new_node, new_set);
}
}
This will build the tree recursively.
Depending on your answer to the first point I mentioned (about the importance of the first city), you might want to assign a city to the root node, or not.
Some good points to look in, if you want to go further with this kind of problem/thinking, are enumeration algorithms, and recursive algorithms. They're generally a good option when your goal is to enumerate all the elements of a set. But they're also generally an inefficient way to solve problems (for example, in the case of the TSP, solving using this algorithm results in a very inefficient approach. There are some much much better ones).
This tree is bad. There are redundant data in it. For instance connection between nodes 2 and 4 occurs three times in the tree. You want a "structure" that automatically gives the solution to your problem, so that it's easier for you, but that's not how problem solving works. Input data is one set of data, output data is another set of data, and they could appear similar, but they can also be quite different.
One simple matrix with one triangle empty and the other containing data should have all the information you need. Coordinates of the matrix are nodes, cells are distances. This is your input data.
What you do with this matrix in your code is a different matter. Maybe you want to write all possible paths. Then write them. Use input data and your code to produce output data.
What you are looking for is actually a generator of all permutations. If you keep one city fixed as the first one (London, in your diagram), then you need to generate all permutations of the list of all your remaining nodes (Warsaw, Łódź, Berlin).
Often such an algorithm is done recursively by looping over all elements, taking it out and doing this recursively for the remaining elements. Often libraries are use to achieve this, e. g. itertools.permutations in Python.
Each permutation generated this way should then be put in the resulting graph you originally wanted. For this you can use any graph-representation you would like, e. g. a nested dictionary structure:
{ a: { b: { c: d,
d: c },
c: { b: d,
d, b },
d: { b: c,
c: b } } }
Related
I'm sure there is an abundance of information on how to do exactly what I'm after, but it's a matter of not knowing the technical term for it. Basically what I want to create is an adjacency matrix for a directed graph, however rather than simply storing whether or not each vertex pair has a direct adjacency, for every vertex pair in the matrix I want to store if there is ANY path connecting the two (and what those paths are).
This would give me constant time complexity for lookups which is desirable, however what's not immediately clear to me is what the expected optimal time complexity of building this matrix will be.
Also, is there a formal name for such a matrix?
Playing this out in my head, it seems like a dynamic programming problem. If I want to know if A is connected to Z, I should be able to ask each of A's neighbors, B, C and D if they are (in some way) connected to Z, and if so, then I know A is. And if B doesn't have this answer stored, then he would ask the same question of his direct neighbors, and so on. I would memoize the results along the way, so subsequent lookups would be constant.
I haven't spent time to implement this yet, because it feels like ϴ(n^n) to build a complete matrix, so my question is whether or not I'm going about this the right way, and if indeed there is a lower-cost way to build such a matrix?
The transitive closure of a graph (https://en.wikipedia.org/wiki/Transitive_closure#In_graph_theory) can indeed be computed by dynamic programming with a variation of Floyd Warshall algorithm: https://en.wikipedia.org/wiki/Floyd%E2%80%93Warshall_algorithm.
Using |V| DFS (or BFS) is more efficient, though.
Using networkx connected components
G = nx.path_graph(4)
G.add_path([10, 11, 12])
d = {}
for group in idx, group in enumerate(nx.connected_components(G)):
for node in group:
d[node] = idx
def connected(node1, node2):
return d[node1]==d[node2]
Generation should be O(N) lookup should be O(1)
I'm trying to create decision tree from data. I'm using the tree for guess-the-animal-game kind of application. User answers questions with yes/no and program guesses the answer. This program is for homework.
I don't know how to create decision tree from data. I have no way of knowing what will be the root node. Data will be different every time. I can't do it by hand. My data is like this:
Animal1: property1, property3, property5
Animal2: property2, property3, property5, property6
Animal3: property1, property6
etc.
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.
Can someone direct me, what algorithm should i use, to build decision tree in this situation?
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i
don't know if i should use them.
Yes, you should. They are very commonly used decision trees, and have some nice open source implementations for them. (Weka's J48 is an example implementation of C4.5)
If you need to implement something from scratch, implementing a simple decision tree is fairly simple, and is done iteratively:
Let the set of labled samples be S, with set of properties P={p1,p2,...,pk}
Choose a property pi
Split S to two sets S1,S2 - S1 holds pi, and S2 do not. Create two children for the current node, and move S1 and S2 to them respectively
Repeat for S'=S1, S'=S2 for each of the subsets of samples, if they are not empty.
Some pointers:
At each iteration you basically split the current data to 2 subsets, the samples that hold pi, and the data that does not. You then create two new nodes, which are the current node's children, and repeat the process for each of them, each with the relevant subset of data.
A smart algorithm chooses the property pi (in step 2) in a way that minimizes the tree's height as much as it can (finding the best solution is NP-Hard, but there are greedy approaches to minimize entropy, for example).
After the tree is created, some pruning to it is done, in order to avoid overfitting.
A simple extension of this algorithm is using multiple decision trees that work seperately - this is called Random Forests, and is empirically getting pretty good results usually.
I'm looking for leads on algorithms to deduce the timeline/chronology of a series of novels. I've split the texts into days and created a database of relationships between them, e.g.: X is a month before Y, Y and Z are consecutive, date of Z is known, X is on a Tuesday, etc. There is uncertainty ('month' really only means roughly 30 days) and also contradictions. I can mark some relationships as more reliable than others to help resolve ambiguity and contradictions.
What kind of algorithms exist to deduce a best-fit chronology from this kind of data, assigning a highest-probability date to each day? At least time is 1-dimensional but dealing with a complex relationship graph with inconsistencies seems non-trivial. I have a CS background so I can code something up but some idea about the names of applicable algorithms would be helpful. I guess what I have is a graph with days as nodes as relationships as edges.
A simple, crude first approximation to your problem would be to store information like "A happened before B" in a directed graph with edges like "A -> B". Test the graph to see whether it is a Directed Acyclic Graph (DAG). If it is, the information is consistent in the sense that there is a consistent chronology of what happened before what else. You can get a sample linear chronology by printing a "topological sort" (topsort) of the DAG. If events C and D happened simultaneously or there is no information to say which came before the other, they might appear in the topsort as ABCD or ABDC. You can even get the topsort algorithm to print all possibilities (so both ABCD and ABDC) for further analysis using more detailed information.
If the graph you obtain is not a DAG, you can use an algorithm like Tarjan's algorithm to quickly identify "strongly connected components", which are areas of the graph which contain chronological contradictions in the form of cycles. You could then analyze them more closely to determine which less reliable edges might be removed to resolve contradictions. Another way to identify edges to remove to eliminate cycles is to search for "minimum feedback arc sets". That's NP-hard in general but if your strongly connected components are small the search could be feasible.
Constraint programming is what you need. In propagation-based CP, you alternate between (a) making a decision at the current choice point in the search tree and (b) propagating the consequences of that decision as far as you can. Notionally you do this by maintaining a domain D of possible values for each problem variable x such that D(x) is the set of values for x which have not yet been ruled out along the current search path. In your problem, you might be able to reduce it to a large set of Boolean variables, x_ij, where x_ij is true iff event i precedes event j. Initially D(x) = {true, false} for all variables. A decision is simply reducing the domain of an undecided variable (for a Boolean variable this means reducing its domain to a single value, true or false, which is the same as an assignment). If at any point along a search path D(x) becomes empty for any x, you have reached a dead-end and have to backtrack.
If you're smart, you will try to learn from each failure and also retreat as far back up the search tree as required to avoid redundant search (this is called backjumping -- for example, if you identify that the dead-end you reached at level 7 was caused by the choice you made at level 3, there's no point in backtracking just to level 6 because no solution exists in this subtree given the choice you made at level 3!).
Now, given you have different degrees of confidence in your data, you actually have an optimisation problem. That is, you're not just looking for a solution that satisfies all the constraints that must be true, but one which also best satisfies the other "soft" constraints according to the degree of trust you have in them. What you need to do here is decide on an objective function assigning a score to a given set of satisfied/violated partial constraints. You then want to prune your search whenever you find the current search path cannot improve on the best previously found solution.
If you do decide to go for the Boolean approach, you could profitably look into SAT solvers, which tear through these kinds of problems. But the first place I'd look is at MiniZinc, a CP language which maps on to a whole variety of state of the art constraint solvers.
Best of luck!
Problem description
There are different categories which contain an arbitrary amount of elements.
There are three different attributes A, B and C. Each element does have an other distribution of these attributes. This distribution is expressed through a positive integer value. For example, element 1 has the attributes A: 42 B: 1337 C: 18. The sum of these attributes is not consistent over the elements. Some elements have more than others.
Now the problem:
We want to choose exactly one element from each category so that
We hit a certain threshold on attributes A and B (going over it is also possible, but not necessary)
while getting a maximum amount of C.
Example: we want to hit at least 80 A and 150 B in sum over all chosen elements and want as many C as possible.
I've thought about this problem and cannot imagine an efficient solution. The sample sizes are about 15 categories from which each contains up to ~30 elements, so bruteforcing doesn't seem to be very effective since there are potentially 30^15 possibilities.
My model is that I think of it as a tree with depth number of categories. Each depth level represents a category and gives us the choice of choosing an element out of this category. When passing over a node, we add the attributes of the represented element to our sum which we want to optimize.
If we hit the same attribute combination multiple times on the same level, we merge them so that we can stripe away the multiple computation of already computed values. If we reach a level where one path has less value in all three attributes, we don't follow it anymore from there.
However, in the worst case this tree still has ~30^15 nodes in it.
Does anybody of you can think of an algorithm which may aid me to solve this problem? Or could you explain why you think that there doesn't exist an algorithm for this?
This question is very similar to a variation of the knapsack problem. I would start by looking at solutions for this problem and see how well you can apply it to your stated problem.
My first inclination to is try branch-and-bound. You can do it breadth-first or depth-first, and I prefer depth-first because I think it's cleaner.
To express it simply, you have a tree-walk procedure walk that can enumerate all possibilities (maybe it just has a 5-level nested loop). It is augmented with two things:
At every step of the way, it keeps track of the cost at that point, where the cost can only increase. (If the cost can also decrease, it becomes more like a minimax game tree search.)
The procedure has an argument budget, and it does not search any branches where the cost can exceed the budget.
Then you have an outer loop:
for (budget = 0; budget < ... ; budget++){
walk(budget);
// if walk finds a solution within the budget, halt
}
The amount of time it takes is exponential in the budget, so easier cases will take less time. The fact that you are re-doing the search doesn't matter much because each level of the budget takes as much or more time than all the previous levels combined.
Combine this with some sort of heuristic about the order in which you consider branches, and it may give you a workable solution for typical problems you give it.
IF that doesn't work, you can fall back on basic heuristic programming. That is, do some cases by hand, and pay attention to how you did it. Then program it the same way.
I hope that helps.
Can someone explain the branch and bound search technique for me? I need to find a path with the smallest cost from any start node to an end node of any random graph using branch and bound search algorithm.
The basic idea of B & B is:
When solving an optimisation problem ("Find an X satisfying criteria Y so as to minimise the cost f(X)"), you build a solution piece by piece -- at any point in time, you have a partial solution, which has a cost.
If the nature of the problem is such that the cost of a partial solution can only stay the same or go up as you continue adding pieces to it, then you know that there's no point continuing to add pieces to a partial solution if there's already a full solution with lower cost. In this case, you can abandon (or "prune", or "fathom") further processing of this partial solution.
Many problems have the latter property, making B & B a widely applicable algorithm technique.
The process of searching for solutions can be represented by a search tree, where the root node represents the starting point where no decisions have been made, and each edge leading from a node represents a decision about something to be included in a partial solution. Each node is a partial solution comprising the decisions made (edges) from the root to that node.
Example: if we want to solve a Sudoku puzzle, the root node would represent the board with just the originally supplied numbers filled in; there might be 9 edges from this root, each representing the decision to assign a number 1-9 to the top-left cell. Each of those 9 partial solution nodes could have 8 branches, representing the valid assignments to the cell at position (1, 2), and so on. Usually, each edge represents a recursion step in a program.
With B & B, in the best case a good solution is found early, meaning that unpromising areas of the search tree can be pruned near the root; but in the worst case, the entire tree of valid solutions will be generated. For this reason B & B is usually only used to solve problems for which no faster algorithm is known (such as NP-hard problems).
This link provides a graphical representation of concepts related to B & B.
This link provides an explanation of the algorithm and sample C# code in a downloadable zip file.
Hope this helps.
There are a lot of references about branch and bound algorithms in the web.
here you can find some theoretical explanation.
whereas the code in C# is here
Fantastic answer #j_random_hacker !!!!
See pg 439 (example 18.2) in Papadimitriou and Steiglitz, Combinatorial Optimization.
This book is a classic, and it discusses your exact problem.