Find the number of possible combinations in a graph - algorithm

This is the problems I am trying to solve:
Given an undirected graph, find the number of different unordered combinations of nodes that can be obtained by traversing a path between any two nodes and keeping track of what nodes that we went to (without going to any node twice).
Say for example that the adjacency matrix is:
1: 2,3
2: 1,3,4
3: 1,2,4
4: 2,3,5
5: 4
One unordered combination would be [1,2,3,4] which could be obtained by going in the path 1>3>4>2 or 1>3>2>4
The answer would be 17 with the following unordered sets:
[1,2] [1,3] [2,3] [2,4] [3,4] [4,5]
[1,2,3] [1,2,4] [1,3,4] [2,3,4] [2,4,5] [3,4,5]
[1,2,3,4] [1,2,4,5] [1,3,4,5] [2,3,4,5]
[1,2,3,4,5]
Currently, what my function does in my program is just brute force all of the possibilities, but I was wondering if there was any faster ways to do it if the graph had 10,000+ nodes? Brute forcing would be way too slow.

You can user this algorithm :
sort the nodes by name for example N_0,...,N_k
define an `result` set which is empty at beginning
for `i` from `0` to `k` do the following
set node `n_i` as root
step 1: list of all 'n_i` neighbors like `n_j` which `j>i`
add all of these pairs to `result` set
set `n_j` as root
go to step 1
end

Related

perfect hash function for random integer

Here's the problem:
X is a positive integer (include 0) set which has n different elements I know in advance. All of them is less equal than m. And I want to have an occ-free hash function as simple as possible to map them to 0-n-1.
For example:
X = [31,223,121,100,123,71], so n = 6, m = 223.
I want to find a hash function to map them to [0, 1, 2, 3, 4, 5].
If mapping to 0-n-1 is too difficult, then how to mapping X to a small range is also a problem.
Finding such a function is not too difficult, but to be simple and easy to be generated is hard.
It's better to preserve the order of the X.
Any clues?
My favorite perfect hash is pretty easy.
The hash function you generate has the form:
hash = table1[h1(key)%N] + table2[h2(key)%N]
h1 and h2 are randomly generated hash functions. In your case, you can generate random constants and then have h1(key)=key*C1/m and h2(key)=key*C2/m or something similarly simple
To generated the perfect hash:
Generate random constants C1 and C2
Imagine the bipartite graph, with table1 slots and table2 slots as vertices and an edge for each key between table1[h1(key)%N] and table2[h2(key)%N]. Run a DFS to see if the graph is acyclic. If not, go back to step 1.
Now that you have an acyclic graph, start at any key/edge in each connected component, and set its slots in table1 and table2 however you like to give it whatever hash you like.
Traverse the tree starting at the vertices adjacent to the edge you just set. For every edge you traverse, one of its slots will already be set. Set the other one to make the hash value come out however you like.
That's it. All of steps (2), (3) and (4) can be combined into a single DFS traversal pretty easily.
The complete description and analysis is in this paper.

Select one element from each set but the selected element should not be repeated

I have a few sets, say 5 of them
{1,2,3}
{2,4}
{1,2}
{4,5}
{2,3,5}
Here, I need to choose at least 3 elements from any three sets(One element per set). Given that if an element is already selected, then it cannot be selected again.
Also check if any solution exists or not.
Eg
set {1,2,3} -> choose 1
set {2,4} -> choose 2
set {1,2} -> cannot choose since both 1 and 2 are chosen.
set {2,5} -> can only choose 5
Is there a way to achieve this? Simple explanation would be appreciated.
If you only need 3 elements, then the algorithm is quite simple. Just repeat the following procedure:
Select the set with the lowest heuristic. The heuristic is the length of the set, divided by the total occurrences of that set. If the set has zero elements, remove the set, and go to step 4. If there are two or more, you can choose any one of them.
Pick an element from that set. This is the element you'll choose.
Remove this element from every set.
If you have picked 3 elements or there are no more sets remaining, exit. Otherwise go to step 1.
This algorithm gives at least 3 elements whenever it's possible, even in the presence of duplicates. Here's the proof.
If the heuristic for a set is <= 1, picking an element from that set is basically free. It doesn't hurt the ability to use other sets at all.
If we are in a situation with 2 or more sets with heuristic >1, and we have to pick at least two elements, this is easy. Just pick one from the first, and the second one will have an element left, because it's length is >1 because it's heuristic is >1.
If we are in a situation with 3 or more sets with heuristic >1, we can pick from the first set. After this we are left with at least two sets, where at least one of them has more than one element. We can't be left with two size one sets, because that would imply that the 3 sets we started with contain a duplicate length 2 set, which has heuristic 1. Thus we can pick all 3 elements.
Here is python code for this algorithm. The generator returns as many elements as it can manage. If it's possible to return at least 3 elements, it will. However after that, it doesn't always return the optimal solution.
def choose(sets):
# Copy the input, to avoid modification of the input
s = [{*e} for e in sets]
while True:
# If there are no more sets remaining
if not s:return
# Remove based on length and number of duplicates
m = min(s,key=lambda x:(len(x)/s.count(x)))
s.remove(m)
# Ignore empty sets
if m:
# Remove a random element
e = m.pop()
# Yield it
yield e
# Remove the chosen element e from other sets
for i in range(len(s)):s[i].discard(e)
print([*choose([{1,2,3}, {2,4}, {1,2}, {4,5}, {2,3,5}])])
print([*choose([{1}, {2,3}, {2,4}, {1,2,4}])])
print([*choose([{1,2}, {2}, {2,3,4}])])
print([*choose([{1,2}, {2}, {2,1}])])
print([*choose([{1,2}, {1,3}, {1,3}])])
print([*choose([{1}, {1,2,3}, {1,2,3}])])
print([*choose([{1,3}, {2,3,4}, {2,3,4}, {2,3,4}, {2,3,4}])])
print([*choose([{1,5}, {2,3}, {1,3}, {1,2,3}])])
Try it online!
Something like this
given your sets
0: {1,2,3}
1: {2,4}
2: {1,2}
3: {4,5}
4: {2,3,5}
A array of sets
set A[1] = { 0, 2} // all sets containing 1
A[2] = { 0, 1, 2, 4} // all sets containing 2
A[3] = { 0, 4 } // all sets containing 3
A[4] = { 1, 3 } // all sets containing 4
A[5] = { 3, 4 } // all sets containing 5
set<int> result;
for(i = 0; i < 3; i++) {
find k such that A[k] not empty
if no k exist then "no solution"
result.add(k)
A[k] = empty
}
return result
I think my idea is a bit overkill but it would work on any kind of sets with any number of sets in any size.
the idea is to transform the sets to bipartite graph. on one side you have each set, and on the other side you have the number which they contains.
and if a set contains a number you have a edge between those vertices.
eventually you're trying to find the maximum matching in the graph (maximum cardinality matching).
gladly it can be done with Hopcroft-Karp algorithm in O(√VE) time or even less with Ford–Fulkerson algorithm.
here some links for more source on maximum matching and the algorithms->
https://en.wikipedia.org/wiki/Matching_(graph_theory)
https://en.wikipedia.org/wiki/Maximum_cardinality_matching
https://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm

Does a data structure like this exist?

I'm searching for a data structure that can be sorted as fast as a plain list and which should allow to remove elements in the following way. Let's say we have a list like this:
[{2,[1]},
{6,[2,1]},
{-4,[3,2,1]},
{-2,[4,3,2,1]},
{-4,[5,4,3,2,1]},
{4,[2]},
{-6,[3,2]},
{-4,[4,3,2]},
{-6,[5,4,3,2]},
{-10,[3]},
{18,[4,3]},
{-10,[5,4,3]},
{2,[4]},
{0,[5,4]},
{-2,[5]}]
i.e. a list containing tuples (this is Erlang syntax). Each tuple contains a number, and a list which includes the members of a list used to compute previous number. What I want to do with the list is the following. First, sort it, then take the head of the list, and finally clean the list. With clean I mean to remove all the elements from the tail that contain elements that are in the head, or, in other words, all the elements from the tail which intersection with head is not empty. For example, after sorting the head is {18,[4,3]}. Next step is removing all the elements of the list that contain 4 or 3, i.e. the resulting list should be this one:
[{6,[2,1]},
{4,[2]},
{2,[1]},
{-2,[5]}]
The process follows by taking the new head and cleaning again till the whole list is consumed. Note that if the the clean process preserves the order, there is no need to resorting the list each iteration.
The bottleneck here is the clean process. I would need some structure which allows me to do the cleaning in a faster way than now.
Does anyone know some structure that allows to do this in an efficient way without losing the order or at least allowing fast sorting?
Yes, you can get faster than this. Your problem is that you are representing the second tuple members as lists. Searching them is cumbersome and quite unnecessary. They are all contiguous substrings of 5..1. You could simply represent them as a tuple of indices!
And in fact you don't even need a list with these index tuples. Put them in a two-dimensional array right at the position given by the respective tuple, and you'll get a triangular array:
h\l| 1 2 3 4 5
---+----------------------
1 | 2
2 | 6 2
3 | -4 -6 -10
4 | -2 -4 18 2
5 | -4 -10 -10 0 -2
Instead of storing the data in a two-dimensional array, you might want to store them in a simple array with some index magic to account for the triangular shape (if your programming language only allows for rectangular two-dimensional arrays), but that doesn't affect complexity.
This is all the structure you need to quickly filter the "list" by simply looking the things up.
Instead of sorting first and getting the head, we simply iterate once through the whole structure to find the maximum value and its indices:
max_val = 18
max = (4, 3) // the two indices
The filter is quite simple. If we don't use lists (not (any (substring `contains`) selection)) or sets (isEmpty (intersect substring selection)) but tuples then it's just sel.high < substring.low || sel.low > substring.high. And we don't even need to iterate the whole triangular array, we can simple iterate the higer and the lower triangles:
result = []
for (i from 1 until max[1])
for (j from i until max[1])
result.push({array[j][i], (j,i)})
for (i from max[0] until 5)
for (j from i until 5)
result.push({array[j+1][i+1], (j+1,i+1)})
And you've got the elements you need:
[{ 2, (1,1)},
{ 6, (2,1)},
{ 4, (2,2)},
{-2, (5,5)}]
Now you only need to sort that and you've got your result.
Actually the overall complexity doesn't get better with the triangular array. You still got O(n) from building the list and finding the maximum. Whether you filter in O(n) by testing against every substring index tuple, or filter in O(|result|) by smart selection doesn't matter any more, but you were specifically asking about a fast cleaning step. This still might be beneficial in reality if the data is large, or when you need to do multiple cleanings.
The only thing affecting overall complexity is to sort only the result, not the whole input.
I wonder if your original data structure can be seen as an adjacency list for a directed graph? E.g;
{2,[1]},
{6,[2,1]}
means you have these nodes and edges;
node 2 => node 1
node 6 => node 2
node 6 => node 1
So your question can be rewritten as;
If I find a node that links to nodes 4 and 3, what happens to the graph if I delete nodes 4 and 3?
One approach would be to build an adjacency matrix; an NxN bit matrix where every edge is the 1-bit. Your problem now becomes;
set every bit in the 4-row, and every bit in the 4-column, to zero.
That is, nothing links in or out of this deleted node.
As an optimisation, keep a bit array of length N. The bit is set if the node hasn't been deleted. So if nodes 1, 2, 4, and 5 are 'live' and 3 and 6 are 'deleted', the array looks like
[1,1,0,1,1,0]
Now to delete '4', you just clear the bit;
[1,1,0,0,1,0]
When you're done deleting, go through the adjacency matrix, but ignore any edge that's encoded in a row or column with 0 set.
Full example. Lets say you have
[ {2, [1,3]},
{3, [1]},
{4, [2,3]} ]
That's the adjacency matrix
1 2 3 4
1 0 0 0 0 # no entry for 1
2 1 0 1 0 # 2, [1,3]
3 1 0 0 0 # 3, [1]
4 0 1 1 0 # 4, [2,3]
and the mask
[1 1 1 1]
To delete node 2, you just alter the mask;
[1 0 1 1]
Now, to figure out the structure, pseudocode like:
rows = []
for r in 1..4:
if mask[r] == false:
# this row was deleted
continue;
targets = []
for c in 1..4:
if mask[c] == true && matrix[r,c]:
# this node wasn't deleted and was there before
targets.add(c)
if (!targets.empty):
rows.add({ r, targets})
Adjacency matrices can get large - it's NxN bits, after all - so this will only better on small, dense matrices, not large, sparse ones.
If this isn't great, you might find that it's easier to google for graph algorithms than invent them yourself :)

VF2 algorithm - implementation

I have a problem with the VF2 algorithm implementation. Everything seems to be working perfectly in many cases, however there is a problem I cannot solve.
The algorithm does not work on the example below. In this example, we are comparing two identical graphs (see image below). Starting vertex is 0.
The set P, that is calculated inside s0, stores the powerset of all pairs of vertices.
Below is a pseudocode included in the publications about VF2 on which I based my implementation.
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=B51AD0DAEDF60D6C8AB589A39A570257?doi=10.1.1.101.5342&rep=rep1&type=pdf
http://www.icst.pku.edu.cn/intro/leizou/teaching/2012-autumn/papers/part2/VF2%20A%20%28sub%29Graph%20Isomorphism%20Algorithm%20For%20Matching%20Large%20Graphs.pdf
Comments on the right side of /* describe the way I understand the code:
I'm not sure if creating the P() set is valid as described below. Powersets of pairs are iterated in lexicographical order by first and then second value of pair.
PROCEDURE Match(s)
INPUT: an intermediate state s; the initial state s0 has M(s0)=empty
OUTPUT: the mappings between the two graphs
IF M(s) covers all the nodes of G2 THEN
OUTPUT M(s)
ELSE
Compute the set P(s) of the pairs candidate for inclusion in M(s)
/*by powerset of all succesors from already matched M(s) if not empty or
/*predestors to already matched M(s) if not empty
/*or all possible not included vertices in M(s)
FOREACH (n, m)∈ P(s)
IF F(s, n, m) THEN
Compute the state s ́ obtained by adding (n, m) to M(s)
/*add n to M1(s), exclude from T1(s)
/*add m to M2(s), exclude from T2(s)
/*M1(s) is now M1(s'), other structures belong to s' too
CALL Match(s′)
END IF
END FOREACH
Restore data structures
/*Return all structures as from before foreach
END IF
END PROCEDURE
When the algorithm goes to s4, when returing from the function, it looses information about good vertices match.
It results in searching the subgraph-isomorphism ({(0,0),(1,1),(2,2),(5,3),(6,4)}) - even though the graphs are isomorphic.
What am I doing wrong here?
I think that to know your question "what am I doing wrong here", it is necessary to include some of your code here. You re-implemented the code yourself, based on the pseudo-code presented in the paper? or you were doing the matching with the help of some graph-processing packages?
For me I didn't have time to dig in the details, but I work with graphs as well, so I tried with networkx (a Python package) and Boost 1.55.0 library (very extensive C++ lib for graph). Your example and another example of a graph with 1000 nodes, 1500 edges return the correct matching (the trivial case of matching a graph with itself).
import networkx as nx
G1 = nx.Graph()
G2 = nx.Graph()
G1.clear()
G2.clear()
G1.add_nodes_from(range(0,7))
G2.add_nodes_from(range(0,7))
G1.add_edges_from([(0,1), (1,2), (2,3), (3,4), (2,5), (5,6)])
G2.add_edges_from([(0,1), (1,2), (2,3), (3,4), (2,5), (5,6)])
from networkx.algorithms import isomorphism
GM = isomorphism.GraphMatcher(G2,G1)
print GM.is_isomorphic()
GM.mapping
True
Out[39]:
{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}

retrieve closest element from a set of elements

I'm experimenting with an idea, where I have following subproblem:
I have a list of size m containing tuples of fixed length n.
[(e11, e12, .., e1n), (e21, e22, .., e2n), ..., (em1, em2, .., emn)]
Now, given some random tuple (t1, t2, .., tn), which does not belong to the list, I want to find the closest tuple(s), that belongs to the list.
I use the following distance function (Hamming distance):
def distance(A, B):
total = 0
for e1, e2 in zip(A, B):
total += e1 == e2
return total
One option is to use exhaustive search, but this is not sufficient for my problem as the lists are quite large. Other idea, I have come up with, is to first use kmedoids to cluster the list and retrieve K medoids (cluster centers). For querying, I can determine the closest cluster with K calls to distance function. Then, I can search for the closest tuple from that particular cluster. I think it should work, but I am not completely sure, if it is fine in cases the query tuple is on the edges of the clusters.
However, I was wondering, if you have a better idea to solve the problem as my mind is completely blank at the moment. However, I have a strong feeling that there may be a clever way to do it.
Solutions that require precomputing something are fine as long as they bring down the complexity of the query.
You can store a hash table (dictionary/map) that maps from an element (in the tupple) to the tupples it appears in: hash:element->list<tupple>.
Now, when you have a new "query", you will need to iterate each of hash(element) for each element of the new "query", and find the maximal number of hits.
pseudo code:
findMax(tuple):
histogram <- empty map
for each element in tuple:
#assuming hash_table is the described DS from above
for each x in hash_table[element]:
histogram[x]++ #assuming lazy initialization to 0
return key with highest value in histogram
An alternative, that does not exactly follow the metric you desired is a k-d tree. The difference is k-d tree also take into consideration the "distance" between the elements (and not only equality/inequality).
k-d trees also require the elements to be comparable.
If your data is big enough, you may want to create some inverted indexes over it.
With a data of m vectors of n elements.
Data:
0: 1, 2, 3, 4, 5, ...
1: 2, 3, 1, 5, 3, ...
2: 5, 3, 2, 1, 3, ...
3: 1, 2, 1, 5, 3, ...
...
m: m0, ... mn
Then you want to get n indexes like this:
Index0
1: 0, 3
2: 1
5: 2
Index1
2: 0, 3
3: 3, 3
Index2
3: 0
1: 1, 3
2: 2
...
Then you only search on your indexes to get the tuples that contain any of the query tuple values and find the closest tuple within those.
def search(query)
candidates = []
for i in range(len(query))
value = query[i]
candidates.append(indexes[i][value])
# find candidates with min distance
for candidate in candidates
distance = distance(candidate, query)
...
The heavy process is creating the indexes, once you built them the search will be really fast.

Resources