finding largest region with small difference more efficiently - algorithm

There is an grid consisting of h * w (h, w <= 200) pixels, every pixel is represented by a value, we want to find the largest continuous region.
A continuous region is defined in this way:
Given a point P(x, y), The connected region must include this point.
There exists reference point R(x, y) of value v, any point in the connected region must be connected to this point. Also, there is a value g_critical(g_critical <= 100000). Let the value of a point in the connected region be v, the difference
of u and v must be smaller or equal than g_critical.
The question is to find the size of the largest connected region.
For example the grid. h = 5, w = 5, g_critical = 3, P(x, y) = (2, 4)
1 3 7 9 2
2 5 6 6 8
3 5 9 3 6
2 7 3 2 9
In this case, the bold region is the largest connected region. Notice that R(x, y) is chosen at (2, 3) or (2, 2) in this case. The size of the region is 14.
I have rephrased the question a bit so it is shorter. So if there is any ambiguity, please point it out in the comment. This question is also in our private judge so I am unable to share the problem source here.
My attempt
I have tried to loop through every cell, consider it as the R point and use bfs to find the connected region attached to it. Then, check if P is contained in the region.
The complexity is O(h * h * w * w), which is too large. So any way to optimize it?
I am guessing that maybe starting with p will help, but I am not sure how I should do it. Maybe there is some kind of flood fill algorithms that allow me to do it?
Thanks in advance.

There's an O(h w √(g_critical) α(h w))-time algorithm (where α is the inverse Ackermann function, constant for practical purposes) that uses a disjoint set data structure with an "undo" operation and a variant of Mo's trick. The idea is, decompose the interval [v − g_critical, v] into about √g_critical subintervals of length about √g_critical. For each subinterval [a, b], prepare a disjoint set data structure representing the components of the matrix where the allowed values are [b, a + 2 g_critical]. Then for each c in [a, b], extend the disjoint set with points whose values lie in [c, b) and (a + 2 g_critical, c + 2 g_critical] and report the number of nodes in the component of P(x,y), then undo these operations (keep a stack of the writes made to the structure, with original values; then pop each one, writing the original values).
There's also an O(h w log(h w))-time algorithm that you're not going to like because it uses dynamic trees. (Sleator–Tarjan's 1985 construction based on splay trees is the simplest and works OK here.) Posting it mainly in case it inspires a more practical approach.
The high-level idea is a kinetic algorithm that "slides" the interval of allowed values over the at most g_critical + 1 possibilities, repeatedly reporting and taking the max over the size of the connected component containing P.
To do this, we need to maintain the maximum spanning forest on a derived graph. Given a node-weighted undirected graph, construct a new, edge-weighted graph by subdividing each edge and setting the weight of each new edge to the weight of the old node to which it is incident. Deleting the lightest nodes in the graph is straightforward – all paths avoid these nodes if possible, so the new maximum spanning forest won't have any more edges. To add the lightest nodes not in the graph, try to link their incident edges one at a time. If a cycle would form, evert one endpoint, query the path minimum from the other endpoint, and unlink that minimum.
To report the size of the component containing P we need another decoration that captures the size of the concrete subtree (as opposed to the represented subtree) of each node. The details get a bit gnarly.

Here's some heuristics which might help:
First some pre-processing in O(h*w*log(h*w)):
Store matrix values in an array, sort it
Now every value in array is a possible candidate for point R
Also maximum component will be of values in range [R-critical, R+critical]
So we can estimate size of component (in best case) with simple binary search
Now some heuristics:
Sort array again this time by estimated component size in descending order
Now try BFS in this order if estimated size is bigger than currently found best size

Related

Modifying Dijsktra's algorithm to work with edges having more than one possible cost

We have a directed weighted graph where an edge between two nodes can have more than one possible cost value (more precisely, at most 2 costs). I need to use a time-dependent variant of the Dijkstra's algorithm that can handle two possible ways of getting from one node to another, the cost between the nodes (edge cost) being dependant on the time at which we arrive at the source node and the type of edge we are about to use. When traversing from one node to the other only one of these edges is picked and its cost is added to the same total cost.
I currently model the two possible costs for an edge as two separate edges between the same nodes.
There is a similar problem I found here and it was suggested to augment the graph by duplicating the nodes. However, this does not allow returning to the original graph and implies the overhead of, well, duplicating all the nodes and possibly edges between them and original nodes.
Do you have any suggestions as to how to tackle this problem with as little overhead as possible? (The original graph is expected to be huge)
Thanks
Edit:
I provided more details about the problem in the first paragraph
You can safely ignore the largest of the two costs for algorithm purposes.
Assume there is a shortest path the uses the largest cost between two vertices, you can change it to use the smallest cost and the path will cost less, and that contradicts the assumption.
I think you can hack step 3 of Dijsktra's algorithm :
For the current node, consider all of its unvisited neighbors and calculate their tentative distances. Compare the newly calculated tentative distance to the current assigned value and assign the smaller one. For example, if the current node A is marked with a distance of 6, and the edge connecting it with a neighbor B has length 2, then the distance to B (through A) will be 6 + 2 = 8. If B was previously marked with a distance greater than 8 then change it to 8. Otherwise, keep the current value.
In your setup, you have two distances from A to B, depending on how late it is. You use the second one if your current distance to A is above your time treshold.
This step becomes :
if current distance to A above threshold :
current distance to B = min(current distance to B, current distance to A + d2(A, B))
else:
current distance to B = min(current distance to B, current distance to A + d1(A, B))

Algorithm that finds the connectivity distance of a graph on uniform points on the unit square

Situation
Suppose we are given n points on the unit square [0, 1]x[0, 1] and a positive real number r. We define the graph G(point 1, point 2, ..., point n, r) as the graph on vertices {1, 2, ..., n} such that there is an edge connecting two given vertices if and only if the distance between the corresponding points is less than or equal to r. (You can think of the points as transmitters, which can communicate with each other as long as they are within range r.)
Given n points on the unit square [0, 1]x[0, 1], we define the connectivity distance as the smallest possible r for which G(point 1, point 2, ..., point n, r) is connected.
Problem 1) find an algorithm that determines if G(point 1, point 2, ..., point n, r) is connected
Problem 2) find an algorithm that finds the connectivity distance for any n given points
My partial solution
I have an algorithm (Algorithm 1) in mind for problem 1. I haven't implemented it yet, but I'm convinced it works. (Roughly, the idea is to start from vertex 1, and try to reach all other vertices through the edges. I think it would be somewhat similar to this.)
All that remains is problem 2. I also have an algorithm in mind for this one. However, I think it is not efficient time wise. I'll try to explain how it works:
You must first convince yourself that the connectivity distance rmin is necessarily the distance between two of the given points, say p and q. Hence, there are at most *n**(n-1)/2 possible values for rmin.
So, first, my algorithm would measure all *n**(n-1)/2 distances and store them (in an array in C, for instance) in increasing order. Then it would use Algorithm 1 to test each stored value (in increasing order) to see if the graph is connected with such range. The first value that does the job is the answer, rmin.
My question is: is there a better (time wise) algorithm for problem 2?
Remarks: the points will be randomly generated (something like 10000 of them), so that's the type of thing the algorithm is supposed to solve "quickly". Furthermore, I'll implement this in C. (If that makes any difference.)
Here is an algorithm which requires O(n2) time and O(n) space.
It's based on the observation that if you partition the points into two sets, then the connectivity distance cannot be less than the distance of the closest pair of points one from each set in the partition. In other words, if we build up the connected graph by always adding the closest point, then the largest distance we add will be the connectivity distance.
Create two sets, A and B. Put a random point into A and all the remaining points into B.
Initialize r (the connectivity distance) to 0.
Initialize a map M with the distance to every point in B of the point in A.
While there are still points in B:
Select the point b in B whose distance M[b] is the smallest.
If M[b] is greater than r, set r to M[b]
Remove b from B and add it to A.
For each point p in M:
If p is b, remove it from M.
Otherwise, if the distance from b to p is less than M[p], set M[p] to that distance.
When all the points are in A, r will be the connectivity distance.
Each iteration of the while loop takes O(|B|) time, first to find the minimum value in M (whose size is equal to the size of B); second, to update the values in M. Since a point is moved from B to A in each iteration, there will be exactly n iterations, and thus the total execution time is O(n2).
The algorithm presented above is an improvement to a previous algorithm, which used an (unspecified) solution to the bichromatic closest pair (BCP) problem to recompute the closest neighbour to A in every cycle. Since there is an O(n log n) solution to BCP, this implied a solution to the original problem in O(n2 log n). However, maintaining and updating the list of closest points is actually much simpler, and only requires O(n). Thanks to #LajosArpad for a question which triggered this line of thought.
I think your ideas are reasonably good, however, I have an improvement for you.
In fact you build up an array based on measurement and you sort your array. Very nice. At least with not too many points.
The number of n(n-1)/2 is a logical consequence of your pairing requirement. So, for 10000 elements, you will have 49995000 elements. You will need to increase significantly the speed! Also, this number of elements would eat a lot of your memory storage.
How can you achieve greater speed?
First of all, don't build arrays. You already have an array. Secondly, you can easily solve your problem by traversing. Let's suppose you have a function, which determines whether a given distance is enough to connect all the nodes, lets call this function "valid". It is not enough, because you need to find the minimal possible value. So, if you don't have more information about the nodes prior the execution of the algorithm, then my suggestion is this solution:
lowerBound <- 0
upperBound <- infinite
i <- 0
while i < numberOfElements do
j <- i + 1
while j < numberOfElements do
distance <- d(elements[i], elements[j])
if distance < upperBound and distance > lowerBound then
if valid(distance) then
upperBound <- distance
else
lowerBound <- distance
end if
end if
j <- j + 1
end while
i <- i + 1
end while
After traversing all the elements the value of upperBound will hold the smallest distance which still connects the network. You didn't store all the distances, as they were far too many and you have solved your problem in a single cycle. I hope you find my answer helpful.
If some distance makes graph connected, any larger distance would make it connected too. To find minimal connecting distance just sort all distances and use binary search.
Time complexity is O(n^2*log n), space complexity is O(n^2).
You can start with some small distance d then check for connectivity. If the Graph is connected, you're done, if not, increment d by a small distance then check again for connectivity.
You also need a clever algorithm to avoid O(N^2) in case N is big.

Find sub-array of objects with maximum distance between elements

Let be an array of objects [a, b, c, d, ...] and a function distance(x, y) that gives a numeric value showing how 'different' are objects x and y.
I'm looking for an algorithm to find the subset of the array of length n that maximizes the minimum difference between that subset element.
Of course, I can't simply sort the array by the minimum of the differences with other elements and take the n highest entries, since removing an element can very well change the distances. For instance, if a=b, then removing a means the minimum distance of b with another element will change dramatically.
So far, the only solution I could find was wether to iteratively remove the element with the lowest minimum distance and re-calculate the distance at each iteration, until only n elements are left, or, vice-versa, to iteratively pick new elements, recalculate the distances, add the new pick or replace an existing one based on the distance minimums.
Does anybody know how I could get the same results without those iterations?
PS: here is an example, the matrix shows the 'distance' between each element...
a b c d
a - 1 3 2
b 1 - 4 2
c 3 4 - 5
d 2 2 5 -
If we'd keep only 2 elements, that would be c and d; if we'd keep 3, that would be a or b, c and d.
This problem is NP-hard, so no-one knows an efficient (polynomial time) algorithm for solving it.
Here's a quick sketch that it is NP-hard, by reduction from CLIQUE.
Suppose we have an instance of CLIQUE in the form of a graph G and a number n and we want to know whether there is a clique of size n in G. Construct a distance matrix d such that d(i, j) = 1 if vertices i and j are connected in G, or 0 if they are not. Then find a subset of the vertices of G of size n that maximizes the minimum distance between elements (your problem). If the minimum distance between vertices in this subset is 1, then G has a clique of size n; otherwise it does not.
As Gareth said this is an NP-hard problem, however there has been a lot of research into solving these kind of problems and as such better methods than brute force have been found. Unfortunately this is such a large area that you could spend forever looking at the possible implementations of a solutions.
However if you are interested in a heuristic way of solving this I would suggest looking into Ant Colony Optimization (ACO) which has proven fairly effective at finding optimum paths within graphs.

What is the algorithm for generating a random Deterministic Finite Automata?

The DFA must have the following four properties:
The DFA has N nodes
Each node has 2 outgoing transitions.
Each node is reachable from every other node.
The DFA is chosen with perfectly uniform randomness from all possibilities
This is what I have so far:
Start with a collection of N nodes.
Choose a node that has not already been chosen.
Connect its output to 2 other randomly selected nodes
Label one transition 1 and the other transition 0.
Go to 2, unless all nodes have been chosen.
Determine if there is a node with no incoming connections.
If so, steal an incoming connection from a node with more than 1 incoming connection.
Go to 6, unless there are no nodes with no incoming connections
However, this is algorithm is not correct. Consider the graph where node 1 has its two connections going to node 2 (and vice versa), while node 3 has its two connection going to node 4 (and vice versa). That is something like:
1 <==> 2
3 <==> 4
Where, by <==> I mean two outgoing connections both ways (so a total of 4 connections). This seems to form 2 cliques, which means that not every state is reachable from every other state.
Does anyone know how to complete the algorithm? Or, does anyone know another algorithm? I seem to vaguely recall that a binary tree can be used to construct this, but I am not sure about that.
Strong connectivity is a difficult constraint. Let's generate uniform random surjective transition functions and then test them with e.g. Tarjan's linear-time SCC algorithm until we get one that's strongly connected. This process has the right distribution, but it's not clear that it's efficient; my researcher's intuition is that the limiting probability of strong connectivity is less than 1 but greater than 0, which would imply only O(1) iterations are necessary in expectation.
Generating surjective transition functions is itself nontrivial. Unfortunately, without that constraint it is exponentially unlikely that every state has an incoming transition. Use the algorithm described in the answers to this question to sample a uniform random partition of {(1, a), (1, b), (2, a), (2, b), …, (N, a), (N, b)} with N parts. Permute the nodes randomly and assign them to parts.
For example, let N = 3 and suppose that the random partition is
{{(1, a), (2, a), (3, b)}, {(2, b)}, {(1, b), (3, a)}}.
We choose a random permutation 2, 3, 1 and derive a transition function
(1, a) |-> 2
(1, b) |-> 1
(2, a) |-> 2
(2, b) |-> 3
(3, a) |-> 1
(3, b) |-> 2
In what follows I'll use the basic terminology of graph theory.
You could:
Start with a directed graph with N vertices and no arcs.
Generate a random permutation of the N vertices to produce a random Hamiltonian cycle, and add it to the graph.
For each vertex add one outgoing arc to a randomly chosen vertex.
The result will satisfy all three requirements.
There is a expected running time O(n^{3/2}) algorithm.
If you generate a uniform random digraph with m vertices such that each vertex has k labelled out-arcs (a k-out digraph), then with high probability the largest SCC (strongly connected component) in this digraph is of size around c_k m, where c_k is a constant depending on k. Actually, there is about 1/\sqrt{m} probability that the size of this SCC is exactly c_k m (rounded to an integer).
So you can generate a uniform random 2-out digraph of size n/c_k, and check the size of the largest SCC. If its size is not exactly n, just try again until success. The expected number of trials needed is \sqrt{n}. And generating each digraph should be done in O(n) time. So in total the algorithm has expected running time O(n^{3/2}). See this paper for more details.
Just keep growing a set of nodes which are all reachable. Once they're all reachable, fill in the blanks.
Start with a set of N nodes called A.
Choose a node from A and put it in set B.
While there are nodes left in set A
Choose a node x from set A
Choose a node y from set B with less than two outgoing transitions.
Choose a node z from set B
Add a transition from y to x.
Add a transition from x to z
Move x to set B
For each node n in B
While n has less than two outgoing transitions
Choose a node m in B
Add a transition from n to m
Choose a node to be the start node.
Choose some number of nodes to be accepting nodes.
Every node in set B can reach every node in set B. As long as a node can be reached from a node in set B and that node can reach a node in set B, it can be added to the set.
The simplest way that I can think of is to (uniformly) generate a random DFA with N nodes and two outgoing edges per node, ignoring the other constraints, and then throw away any that are not strongly connected (which is easy to test using a strongly connected components algorithm). Generating uniform DFAs should be straightforward without the reachability constraint. The one thing that could be problematic performance-wise is how many DFAs you would need to skip before you found one with the reachability property. You should try this algorithm first, though, and see how long it ends up taking to generate an acceptable DFA.
We can start with a random number of states N1 between N and 2N.
Assume the initial state the as the state number 1.
For each state, for each character in the input alphabet we generate a random transition (between 1 and N1).
We take the connex automaton starting from the initial state. We check the number of states, and after few tries we get one with N states.
If we wish a minimal automaton too, remains only the assignment of final states, however there are great chances that a random assignment gets a minimal automaton as well.
The following references seem to be relevant to your question:
F. Bassino, J. David and C. Nicaud, Enumeration and random generation of possibly incomplete deterministic automata, Pure Mathematics and Applications 19 (2-3) (2009) 1-16.
F. Bassino and C. Nicaud. Enumeration and Random Generation of Accessible Automata. Theor. Comp. Sc.. 381 (2007) 86-104.

Algorithm/approximation for combined independent set/hamming distance

Input: Graph G
Output: several independent sets, so that the membership of a node to all independent sets is unique. A node therefore has no connections to any node in its own set. Here is an example path.
Since clarification was called for here another rephrasal:
Divide a given graph into sets so that
i can tell a node from all others by its membership in sets e.g. if node i is present only in set A no other node should be present in set A only
if node j is present in set A and B then no other node should be present in set A and B only. if the membership of nodes is coded by a bit pattern, then these bit patterns have hamming distance at least one
if two nodes are adjacent in the graph, they should not be present in the same set, hence be an independent set
Example:
B has no adjacent nodes
D=>A, A=>D
Solution:
A B /
/ B D
A has bit pattern 10 and no adjacent node in its set. B has bit pattern 11 and no adjacent node, D has 01
therefore all nodes have hamming distance at least 1 an no adjacent nodes => correct
Wrong, because D and A are connected:
A D B
/ D B
A has bit pattern 10 and D in its set, they are adjacent. B has bit pattern 11 and no adjacent node, D has 11 as has B, so there are two errors in this solution and therefore it is not accepted.
Of course this should be extended to more Sets as the number of Nodes in the Graph increases, since you need at least log(n) sets.
I already wrote a transformation into MAX-SAT, to use a sat-solver for this. but the number of clauses is just to big. A more direct approach would be nice. So far I have an approximation, but I would like an exact solution or at least a better approximation.
I have tried an approach where I used a particle swarm to optimize from an arbitrary solution towards a better one. However the running time is pretty awful and the results are far from great. I am looking for a dynamic algorithm or something, however i cannot fathom how to divide and conquer this problem.
Not a complete answer, and I don't know how useful it will be to you. But here goes:
The hamming distance strikes me as a red herring. Your problem statement says it must be at least 1 but it could be 1000. It suffices to say the bit encoding for each node's set memberships is unique.
Your problem statement doesn't spell it out, but your solution above suggests every node must be a member of at least 1 set. ie. a bit encoding of all 0's is not allowed for any node's set memberships.
Ignoring connected nodes for a moment, disjoint nodes are easy: Simply number them sequentially with an unused bit encoding. Save those for last.
Your example above uses directed edges, but again, that strikes me as a red herring. If A cannot be in the same set as D because A=>D, D cannot be in the same set as A regardless whether D=>A.
You mention needing at least log(N) sets. You will also have at most N sets. A fully connected graph (with (N^2-N)/2 undirected edges) will require N sets each containing a single node.
In fact, if your graph contains a fully connected simplex of M dimensions (M in 1..N-1) with M+1 vertices and (M^2+M)/2 undirected edges, you will require at least M+1 sets.
In your example above, you have one such simplex (M=1) with 2 vertices {A,D} and 1 (undirected) edge {(A,D)}.
It would seem that your problem boils down to finding the largest fully connected simplexes in your graph. Stated differently, you have a routing problem: How many dimensions do you need to route your edges so none cross? It doesn't sound like a very scalable problem.
The first large simplex found is easy. Every vertex node gets a new set with its own bit.
The disjoint nodes are easy. Once the connected nodes are dealt with, simply number the disjoint nodes sequentially skipping any previously used bit patterns. From your example above, since A and D take 01 and 10, the next available bit pattern for B is 11.
The tricky part then becomes how to fold any remaining simplexes as much as possible into the existing range before creating any new sets with new bits. When folding, one must use 2 or more bits (sets) for each node, and the bits (sets) must not intersect with the bits (sets) for any adjacent node.
Consider what happens to your example above when one adds another node, C, to the example:
If C connects directly to both A and D, then the initial problem becomes finding the 2-simplex with 3 vertices {A,C,D} and 3 edges {(A,c),(A,D),(C,D)}. Once A, C and D take the bit patterns 001, 010 and 100, the lowest available bit pattern for disjoint B is 011.
If, on the other hand, C connects directly A or D but not both, the graph has two 1-simplexes. Supposing we find the 1-simplex with vertices {A,D} first giving them the bit patterns 01 and 10, the problem then becomes how to fold C into that range. The only bit pattern with at least 2 bits is 11, but that intersects with whichever node C connects to so we have to create a new set and put C in it. At this point, the solution is similar to the one above.
If C is disjoint, either B or C will get the bit pattern 11 and the remaining one will need a new set and get the bit pattern 100.
Suppose C connects to B but not to A or D. Again, the graph has two 1-simplexes but this time disjoint. Suppose {A,D} is found first as above giving A and D the bit patterns 10 and 01. We can fold B or C into the existing range. The only available bit pattern in the range is 11 and either B or C could get that pattern as neither is adjacent to A or D. Once 11 is used, no bit patterns with 2 or more bits set remain, and we will have to create a new set for the remaining node giving it the bit pattern 100.
Suppose C connects to all 3 A, B and D. In this case, the graph has a 2-simplex with 3 vertexes {A,C,D} and a 1-simplex with 2 vertexes {B, C}. Proceeding as above, after processing the largest simplex, A, C and D will have bit patterns 001, 010, 100. For folding B into this range, the available bit patterns with 2 or more bits set are: 011, 101, 110 and 111. All of these except 101 intersect with C so B would get the bit pattern 101.
The question then becomes: How efficiently can you find the largest fully-connected simplexes?
If finding the largest fully connected simplex is too expensive, one could put an approximate upper bound on potential fully connected simplexes by finding maximal minimums in terms of connections:
Sweep through the edges updating the
vertices with a count of the
connecting edges.
For each connected node, create an array of Cn counts initially zero
where Cn is the count of edges
connected to the node n.
Sweep through the edges again, for the connected nodes n1 and n2,
increment the count in n1
corresponding to Cn2 and vice versa.
If Cn2 > Cn1, update the last count
in the n1 array and vice versa.
Sweep through the connected nodes again, calculating an upper bound on
the largest simplex each node could
be a part of. You could build a pigeon-hole array with a list of vertices
for each upper bound as you sweep through the nodes.
Work through the pigeon-hole array from largest to smallest extracting and
folding nodes into unique sets.
If your nodes are in a set N and your edges in a set E, the complexity will be:
O(|N|+|E|+O(Step 5))
If the above approximation suffices, the question becomes: How efficiently can you fold nodes into existing ranges given the requirements?
This maybe not the answer you might expect, but I can't find a place to add a comment. So I type it directly here. I can't fully understand your question. Or does it need specific knowledge to understand? What is this independent set? As I know a node in an independent set from a directed graph have a two way path to any other node in this set. Is your notion the same?
If this problem is like what I assume, independent sets can be found by this algorithm:
1. do depth-first search on the directed graph, records the time of tree rooted by this node is traversed.
2. then reverse all the edges in this graph
3. do depth-frist search again on the modified graph.
The algorihtm is precisely explained by book "introduction to alogrithm"

Resources