Clever algorithm for checking on equality - algorithm

We need to implement next feature:
Suppose we have a list of some objects.
We need to say what objects are the same and what are not.
Suppose 1 is equal to 2 and 2 is equal to 5.
So 1 is equal to 5 and we don't need to check them.
Does this algorithm exist?
Any ideas would be very great.

I think this can be thought of as having sets of elements which are equal. And IMO Disjoint Set Data Structure will be very efficient in maintaing such set of records. The basic idea is to first put every element as a seperate set, whenever you encounter a equality relation you take the set to which those elements belong and merge them. The run time for merge and lookup are sub-logarithmic if you use path compression. http://en.wikipedia.org/wiki/Disjoint-set_data_structure

You want a Disjoint Set strucutre with path compression. It's very simple to implement, and the performance is close to O(1).

If you have a graph where the vertices are the objects and there's an edge between known equal objects, then every object connected to another object is also equal. So, to test if object A equals object B, you are testing if a path exists between A and B in that graph.
You can use any graph searching algorithm to do this. Depth-first search would be a good start.
If you want to find all connected pairs then you could use Tarjan's algorithm to find all strongly connected components. Within each component, all objects are equal.

Related

Extended version of the set cover problem

I don't generally ask questions on SO, so if this question seems inappropriate for SO, just tell me (help would still be appreciated of course).
I'm still a student and I'm currently taking a class in Algorithms. We recently learned about the Branch-and-Bound paradigm and since I didn't fully understand it, I tried to do some exercises in our course book. I came across this particular instance of the Set Cover problem with a special twist:
Let U be a set of elements and S = {S1, S2, ..., Sn} a set of subsets of U, where the union of all sets Si equals U. Outline a Branch-and-Bound algorithm to find a minimal subset Q of S, so that for all elements u in U there are at least two sets in Q, which contain u. Specifically, elaborate how to split the problem up into subproblems and how to calculate upper and lower bounds.
My first thought was to sort all the sets Si in S in descending order, according to how many elements they contain which aren't yet covered at least twice by the currently chosen subsets of S, so our current instance of Q. I was then thinking of recursively solving this, where I choose the first set Si in the sorted order and make one recursive call, where I take this set Si and one where I don't (meaning from those recursive calls onwards the subset is no longer considered). If I choose it I would then go through each element in this chosen subset Si and increase a counter for all its elements (before the recursive call), so that I'll eventually know, when an element is already covered by two or more chosen subsets. Since I sort the not chosen sets Si for each recursive call, I would theoretically (in my mind at least) always be making the best possible choice for the moment. And since I basically create a binary tree of recursive calls, because I always make one call with the current best subset chosen and one where I don't I'll eventually cover all 2^n possibilities, meaning eventually I'll find the optimal solution.
My problem now is I don't know or rather understand how I would implement a heuristic for upper and lower bounds, so the algorithm can discard some of the paths in the binary tree, which will never be better than the current best Q. I would appreciate any help I could get.
Here's a simple lower bound heuristic: Find the set containing the largest number of not-yet-twice-covered elements. (It doesn't matter which set you pick if there are multiple sets with the same, largest possible number of these elements.) Suppose there are u of these elements in total, and this set contains k <= u of them. Then, you need to add at least u/k further sets before you have a solution. (Why? See if you can prove this.)
This lower bound works for the regular set cover problem too. As always with branch and bound, using it may or may not result in better overall performance on a given instance than simply using the "heuristic" that always returns 0.
First, some advice: don't re-sort S every time you recurse/loop. Sorting is an expensive operation (O(N log N)) so putting it in a loop or a recursion usually costs more than you gain from it. Generally you want to sort once at the beginning, and then leverage that sort throughout your algorithm.
The sort you've chosen, descending by the length of the S subsets is a good "greedy" ordering, so I'd say just do that upfront and don't re-sort after that. You don't get to skip over subsets that are not ideal within your recursion, but checking a redundant/non-ideal subset is still faster than re-sorting every time.
Now what upper/lower bounds can you use? Well generally, you want your bounds and bounds-checking to be as simple and efficient as possible because you are going to be checking them a lot.
With this in mind, an upper bounds is easy: use the shortest set-length solution that you've found so far. Initially set your upper-bounds as var bestQlength = int.MaxVal, some maximum value that is greater than n, the number of subsets in S. Then with every recursion you check if currentQ.length > bestQlength, if so then this branch is over the upper-bounds and you "prune" it. Obviously when you find a new solution, you also need to check if it is better (shorter) than your current bestQ and if so then update both bestQ and bestQlength at the same time.
A good lower bounds is a bit trickier, the simplest I can think of for this problem is: Before you add a new subset Si into your currentQ, check to see if Si has any elements that are not already in currentQ two or more times, if it does not, then this Si cannot contribute in any way to the currentQ solution that you are trying to build, so just skip it and move on to the next subset in S.

Data structure to represent a graph

Having a couple of cities and their locations I want to create a data structure that would represent a graph like this. This graph represent all possible paths that can be taken in order to visit every city only once:
My question is, since this is probably a very common problem, is there an algorithm or already made data structure to represent this? The programming language is not important (although I would prefer java).
Your problem seems very close to the traveling salesman problem, a classic among the classics.
As you did intuite, the graph that will represent all the possible solutions is indeed a tree (the paths from the root to any of its leaf should represent a solution).
From there, you can ask yourself several questions:
Is the first city that I'll visit an important piece of information, or is it only the order that matters ? For instance, is London-Warsaw-Berlin-Lidz equivalent to Warsaw-Berlin-Lidz-London ?
Usually, we consider these solutions as being equivalent to solve a TSP, but it might not be the case for you.
Did you see the link between a potential solution to the TSP and a permutation ? Actually, what you're looking for is a way (and the data structure that goes with it) to generate all the permutations of a given set(your set of cities).
With these two points in mind, we can think about a way to generate such a tree. A good strategy to work with trees is to think recursively.
We have a partial solution, meaning the k first cities. Then, the next possible city can be any among the n-k cities remaining. That gives the following pseudo-code.
get_all_permutations(TreeNode node, Set<int>not_visited){
for (city in not_visited){
new_set = copy(not_visited);
new_set.remove(city);
new_node = new TreeNode();
new_node.city = city;
node.add_child(new_node);
get_all_permutations(new_node, new_set);
}
}
This will build the tree recursively.
Depending on your answer to the first point I mentioned (about the importance of the first city), you might want to assign a city to the root node, or not.
Some good points to look in, if you want to go further with this kind of problem/thinking, are enumeration algorithms, and recursive algorithms. They're generally a good option when your goal is to enumerate all the elements of a set. But they're also generally an inefficient way to solve problems (for example, in the case of the TSP, solving using this algorithm results in a very inefficient approach. There are some much much better ones).
This tree is bad. There are redundant data in it. For instance connection between nodes 2 and 4 occurs three times in the tree. You want a "structure" that automatically gives the solution to your problem, so that it's easier for you, but that's not how problem solving works. Input data is one set of data, output data is another set of data, and they could appear similar, but they can also be quite different.
One simple matrix with one triangle empty and the other containing data should have all the information you need. Coordinates of the matrix are nodes, cells are distances. This is your input data.
What you do with this matrix in your code is a different matter. Maybe you want to write all possible paths. Then write them. Use input data and your code to produce output data.
What you are looking for is actually a generator of all permutations. If you keep one city fixed as the first one (London, in your diagram), then you need to generate all permutations of the list of all your remaining nodes (Warsaw, Łódź, Berlin).
Often such an algorithm is done recursively by looping over all elements, taking it out and doing this recursively for the remaining elements. Often libraries are use to achieve this, e. g. itertools.permutations in Python.
Each permutation generated this way should then be put in the resulting graph you originally wanted. For this you can use any graph-representation you would like, e. g. a nested dictionary structure:
{ a: { b: { c: d,
d: c },
c: { b: d,
d, b },
d: { b: c,
c: b } } }

create decision tree from data

I'm trying to create decision tree from data. I'm using the tree for guess-the-animal-game kind of application. User answers questions with yes/no and program guesses the answer. This program is for homework.
I don't know how to create decision tree from data. I have no way of knowing what will be the root node. Data will be different every time. I can't do it by hand. My data is like this:
Animal1: property1, property3, property5
Animal2: property2, property3, property5, property6
Animal3: property1, property6
etc.
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i don't know if i should use them.
Can someone direct me, what algorithm should i use, to build decision tree in this situation?
I searched stackoverflow and i found ID3 and C4.5 algorithms. But i
don't know if i should use them.
Yes, you should. They are very commonly used decision trees, and have some nice open source implementations for them. (Weka's J48 is an example implementation of C4.5)
If you need to implement something from scratch, implementing a simple decision tree is fairly simple, and is done iteratively:
Let the set of labled samples be S, with set of properties P={p1,p2,...,pk}
Choose a property pi
Split S to two sets S1,S2 - S1 holds pi, and S2 do not. Create two children for the current node, and move S1 and S2 to them respectively
Repeat for S'=S1, S'=S2 for each of the subsets of samples, if they are not empty.
Some pointers:
At each iteration you basically split the current data to 2 subsets, the samples that hold pi, and the data that does not. You then create two new nodes, which are the current node's children, and repeat the process for each of them, each with the relevant subset of data.
A smart algorithm chooses the property pi (in step 2) in a way that minimizes the tree's height as much as it can (finding the best solution is NP-Hard, but there are greedy approaches to minimize entropy, for example).
After the tree is created, some pruning to it is done, in order to avoid overfitting.
A simple extension of this algorithm is using multiple decision trees that work seperately - this is called Random Forests, and is empirically getting pretty good results usually.

good way to check the edge exists or not in an undirected graph

I am using adjacency list representation.
basically
A:[B,C,D] means A is connected to B,C and D
now I am trying to add a method (in python) to add edge in graph.
But before I add an edge. I want to check whether two edges are connected or not.
So for example I want to add an edge between two nodes D and A ( ignorant of the fact taht A and D are connected).
So, since there is no key "D" in the hash/dictionary, it will return false.
Now, very naively, I can check for D and A and then A and D as well.. but thats very scruff.
Or whenever I connect two nodes, I can always duplicate..
I.e when connecting A and E.. A:[E] create E:[A]
but this is not very space efficient.
Basically I want to make this graph direction independent.
Is there any data structure that can help me solve this.
I am hoping that my question makes sense.
For an undirected graph you could use a simple edge list in which you store all the pairs of edges. This will save space and worsen performance but you should know that you can't have both at the same time so you always have to decide for a tradeoff.
Otherwise you could use a triangular adjacency matrix but, to avoid wasting half of the space, you will have to store it in a particular way (by developing an efficient way to retrieve edge existence without wasting space). Are you sure it is worth it and it's not just premature optimization?
Adjacency lists are mostly fine, even if you have to store every undirected edge twice, how big is your graph?
Take a look at this my answer: Graph representation benchmarking, so you can choose which one you prefer.
You have run into a classic space vs. time tradeoff.
As you said, if you don't find D->A, you can search for A->D. This will result in maximum of double your execution time. Alternatively, when inserting A->D, also create D->A, but this comes at the cost of additional space.
Worst case, for the time tradeoff, you will do 2 lookups, which is still O(N) (faster with better data structures). For the space tradeoff, you will (in the worst case) create a link between every set of nodes, which is roughly O(N^2). As such, I would just do 2 lookups.
Assuming each contains() method is VERY expansive, and you want to avoid doing these in all costs, one can use a bloom filter, and check if an edge exists - and by this, reduce the number of contains() calls.
The idea is: Each node will hold its own bloom filter, which will indicate which edges are connected to it. Checking a bloom filter is fairly easy and cheap, and also modifying it when an edge is added.
If you checked the bloom filter - and it said "no" - you can safely add the edge - it does not exist.
However, bloom filters have False Positives - so, if the bloom filter said "the edge exists" - you will have to check the list if it is indeed there.
Notes:
(1) Removing edges will be a problem if using bloom filters.
(2) Bloom filters give you nice time/space trade off - as the number of false positives decrease as the size of the filter grows.
(3) However, when an edge does exist - no matter what's the size of the filter, you will alwats have to use the contains() method.
Assuming your node names can be compared, you can simply always store the edges so that the first endpoint is less than the second endpoint. Then you only have one lookup to perform. This definitely works for strings.

Optimal selection election algorithm

Given a bunch of sets of people (similar to):
[p1,p2,p3]
[p2,p3]
[p1]
[p1]
Select 1 from each set, trying to minimize the maximum number of times any one person is selected.
For the sets above, the max number of times a given person MUST be selected is 2.
I'm struggling to get an algorithm for this. I don't think it can be done with a greedy algorithm, more thinking along the lines of a dynamic programming solution.
Any hints on how to go about this? Or do any of you know any good websites about this stuff that I could have a look at?
This is neither dynamic nor greedy. Let's look at a different problem first -- can it be done by selecting every person at most once?
You have P people and S sets. Create a graph with S+P vertices, representing sets and people. There is an edge between person pi and set si iff pi is an element of si. This is a bipartite graph and the decision version of your problem is then equivalent to testing whether the maximum cardinality matching in that graph has size S.
As detailed on that page, this problem can be solved by using a maximum flow algorithm (note: if you don't know what I'm talking about, then take your time to read it now, as you won't understand the rest otherwise): first create a super-source, add an edge linking it to all people with capacity 1 (representing that each person may only be used once), then create a super-sink and add edges linking every set to that sink with capacity 1 (representing that each set may only be used once) and run a suitable max-flow algorithm between source and sink.
Now, let's consider a slightly different problem: can it be done by selecting every person at most k times?
If you paid attention to the remarks in the last paragraph, you should know the answer: just change the capacity of the edges leaving the super-source to indicate that each person may be used more than once in this case.
Therefore, you now have an algorithm to solve the decision problem in which people are selected at most k times. It's easy to see that if you can do it with k, then you can also do it with any value greater than k, that is, it's a monotonic function. Therefore, you can run a binary search on the decision version of the problem, looking for the smallest k possible that still works.
Note: You could also get rid of the binary search by testing each value of k sequentially, and augmenting the residual network obtained in the last run instead of starting from scratch. However, I decided to explain the binary search version as it's conceptually simpler.

Resources