Minimize a DFA with don't care transitions - algorithm

I have a DFA (Q, Σ, δ, q0, F) with some “don't care transitions.” These transitions model symbols which are known not to appear in the input in some situations. If any such transition is taken, it doesn't matter whether the resulting string is accepted or not.
Is there an algorithm to compute an equivalent DFA with a minimal amount of states? Normal DFA minimisation algorithms cannot be used as they don't know about “don't care” transitions and there doesn't seem to be an obvious way to extend the algorithms.

I think this problem is NP-hard (more on that in a bit). This is what I'd try.
(Optional) Preprocess the input via the usual minimization algorithm with accept/reject/don't care as separate outcomes. (Since don't care is not equivalent to accept or reject, we get the Myhill–Nerode equivalence relation back, allowing a variant of the usual algorithm.)
Generate a conflict graph as follows. Start with all edges between accepting and rejecting states. Take the closure where we iteratively add edges q1—q2 such that there exists a symbol s for which there exists an edge σ(q1, s)—σ(q2, s).
Color this graph with as few colors as possible. (Or approximate.) Lots and lots of coloring algorithms out there. PartialCol is a good starting point.
Merge each color class into a single node. This potentially makes the new transition function multi-valued, but we can choose arbitrarily.
With access to an alphabet of arbitrary size, it seems easy enough to make this reduction to coloring run the other way, proving NP-hardness. The open question for me is whether having a fixed-size alphabet constrains the conflict graph in such a way as to make the resulting coloring instance easier somehow. Alas, I don't have time to research this.

I believe a slight variation of the normal Moore algorithm works. Here's a statement of the algorithm.
Let S be the set of states.
Let P be the set of all unordered pairs drawn from S.
Let M be a subset of P of "marked" pairs.
Initially set M to all pairs where one state is accepting and the other isn't.
Let T(x, c) be the transition function from state x on character c.
Do
For each pair z = <a, b> in P - M
For each character c in the alphabet
If <T(a, c), T(b, c)> is in M
Add z to M and continue with the next z
Until no new additions to M
The final set P - M is a pairwise description of an equivalence relation on states. From it you can create a minimum DFA by merging states and transitions of the original.
I believe don't care transitions can be handled by never marking (adding to M) pairs based on them. That is, we change one line:
If T(a, c) != DC and T(b, c) != DC and <T(a, c), T(b, c)> is in M
(Actually in an implementation, no real algorithm change is needed if DC is a reserved value of type state that's not a state in the original FA.)
I don't have time to think about a formal proof right now, but this makes intuitive sense to me. We skip splitting equivalence classes of states based on transitions that we know will never occur.
The thing I still need to prove to myself is whether the set P - M is still a pairwise description of an equivalence relation. I.e., can we end up with <a,b> and <b,c> but not <a,c>? If so, is there a fixup?

Related

Select N items such that their properties are balanced

Lets say I have N objects and each of them has associated values A and B. This could be represented as a list of tuples like:
[(3,10), (8,4), (0,0), (20,7),...]
where each tuple is an object and the two values are A and B.
What I want to do is select M of these objects (where M < N) such that the sums of A and B in the selected subset is as balanced as possible. M here is a parameter of the problem, I don't want to find the optimal M. I want to be able to say "give me 100 objects, and make them as balanced as possible".
Any idea if there an efficient algorithm which can solve this problem (not necessarily completely optimally)? I think this might be related to bin-packing, but I'm not really sure.
This is a disguised variant of subset-sum. Replace each (A,B) by A-B, and then the absolute value of the sum of all selected A-B values is the "unbalancedness" of the sums. So you really have to select M of those scalars and try to have a sum as close to 0 as possible.
The "variant" bit is because you have to select exactly M items. (I think this is why your mind went to bin-packing rather than subset-sum.) If you have a black-box subset-sum solver you can account for this too: if the maximum single-pair absolute difference is D, replace each (A,B) by (A-B+D) and have the target sum be M*D. (Don't do that if you're doing a dynamic programming approach, of course, since it increases the magnitude of the numbers you're working with.)
Presuming that you're fine with an approximation (and if you're not, you're gonna have a real bad day) I would tend to use Simulated Annealing or Late Acceptance Hill Climbing as a basic approach, starting with a greedy initial solution (iteratively add whichever object results in the minimal difference), and then in each round, considering randomly replacing one object by one not-currently-selected object.

fastest implementation to cut n×n board into n connected n-minos

I am attempting to create a puzzle to let players piece together an n×n grid using n connected n-minos (definition: connected piece of n 1×1 blocks, e.g. each of the Tetris pieces is a 4-mino). However, generating a way to cut the grid first proves to be a challenge despite seemingly easy enough for a human.
example board
For human, generating such a solution is a relative easy task by recursively following the following logic/pseudo-code:
:start_of_recursion:
Start with a random "least connected" piece (end, corner, edge pieces that has the fewest member connecting to it) to be the starting mino block
:start_of_recursion:
Make a "grow" in a random available direction from a random piece in the current mino
If "grow" results in a "separated" remaining board(, if the separated region isn't a multiple of n), try some other location and direction
if all location and direction has been attempted, revert to previous board configuration (shouldn't really occur?)
If size-n has been reached, exit recursion
:end_of_recursion:
if board has been filled, exit recursion
:end_of_recursion:
Performing this routine seem to generate an O(n^2) method of solution generation, however the condition checks prove to be really expensive for computers. In order to determine whether the board to be connected, a human simply checks for any "gap" inside the remaining region, and is processed in almost O(1) fashion for a simple non-overlapping graph, whereas my code implementation need to "spread" from a point on the graph into its neighboring territories and check after the spreading is complete to check whether if any points lies outside of reach (O(n) at best). Since this check is to be performed every time in the innermost iteration, it degenerates the complexity into an O(n^(3+)) problem and becomes really inefficient.
Is there a method to check for "gap" in a manner similar to that of human cognition? Or can the problem be fundamentally thought of and simplified into a problem easier for computer to solve?
Your problem sounds like a variant of bin packing problem. I would approach this by constraint satisfaction method. Below I'll use Minizinc pseudo-code.
A board consists of cells, each cell could be colored into one color from several. We can represent it as follows:
int: rows;
int: cols;
int: colors_num;
array [1...colors_num] of int: colors;
array [1..rows,1..cols] of var colors: board;
Next, we add constraints. For example, if a cell has color A then atleast 1 adjacent cell must have the same color A:
constraint forall (c in colors) (
if board[i, j] == c then
at_least (1, [board[i-1, j], board[i+1, j], board[i, j-1], board[i,j+1]], c)
else
true
endif
You can describe all allowed/prohibited shapes as constraints or use some other smart approach introducing possible cuts.
Constraint satisfaction should be much more efficient than your recursive approach. However, it's not very scalable - if you'll try to generate a game for a gigantic board (hundreds or thousands of cells/colors), it will take quite some time and memory to generate minos.

Logical Constraints : How to express symmetry in a graph?

I've got a logical problem, I find some expression of what I want, but for now I'm overconstrainted and I don't know how to relax it.
The Context:
Let's assume we've got an oriented graph which can have cycles. There is no weight on edges. So something really simple as follow (don't bother with what is inside each node, it is not important) :
One important thing : w1 is never changing his name.
The Problem:
I'm formalizing logical problem with graph. To tell that there is an edge between the node wi and the node wj : I have got a propositional variable (which can be true or false) named Rij. So for N nodes, I've got N^2 propositional variables.
Theses graphs have isomorphism, and I would like to formalize the constraints
that if there is no edge between i and j then there is no edge between i and k (for k > j).
Say with others words, I want to put constraints to formalize the breaking of symmetry in graphs.
I try something naïve as follow :
forall i, forall j (>i), forall k (>j), ( not(Rij) --> not(Rik) )
This constraint is working quite often, unfortunately, it's over constraining. It happens sometimes that we can't put any label of nodes in order to respect theses constraints;
example of problem:
With this example, it is impossible to name A,B,C such as if there is no edge between wi and wj then there is no edge between wi and wk (with k>j).
Sum-up of the problem
There is a way to adapt theses logicals constraints (or to formulate totally new ones) in order than we can always rename A,B,C... in one and only one way, no matter the original graph. My problem: how to formulate theses constraints ?
Thanks in advance for your help !
Best Regards.

Tricky algorithm for sorting symbols in an array while preserving relationships via order

The problem
I have multiple groups which specify the relationships of symbols.. for example:
[A B C]
[A D E]
[X Y Z]
What these groups mean is that (for the first group) the symbols, A, B, and C are related to each other. (The second group) The symbols A, D, E are related to each other.. and so forth.
Given all these data, I would need to put all the unique symbols into a 1-dimension array wherein the symbols which are somehow related to each other would be placed closer to each other. Given the example above, the result should be something like:
[B C A D E X Y Z]
or
[X Y Z D E A B C]
In this resulting array, since the symbol A has multiple relationships (namely with B and C in one group and with D and E in another) it's now located between those symbols, somewhat preserving the relationship.
Note that the order is not important. In the result, X Y Z can be placed first or last since those symbols are not related to any other symbols. However, the closeness of the related symbols is what's important.
What I need help in
I need help in determining an algorithm that takes groups of symbol relationships, then outputs the 1-dimension array using the logic above. I'm pulling my hair out on how to do this since with real data, the number of symbols in a relationship group can vary, there is also no limit to the number of relationship groups and a symbol can have relationships with any other symbol.
Further example
To further illustrate the trickiness of my dilemma, IF you add another relationship group to the example above. Let's say:
[C Z]
The result now should be something like:
[X Y Z C B A D E]
Notice that the symbols Z and C are now closer together since their relationship was reinforced by the additional data. All previous relationships are still retained in the result also.
The first thing you need to do is to precisely define the result you want.
You do this by defining how good a result is, so that you know which is the best one. Mathematically you do this by a cost function. In this case one would typically choose the sum of the distances between related elements, the sum of the squares of these distances, or the maximal distance. Then a list with a small value of the cost function is the desired result.
It is not clear whether in this case it is feasible to compute the best solution by some special method (maybe if you choose the maximal distance or the sum of the distances as the cost function).
In any case it should be easy to find a good approximation by standard methods.
A simple greedy approach would be to insert each element in the position where the resulting cost function for the whole list is minimal.
Once you have a good starting point you can try to improve it further by modifying the list towards better solutions, for example by swapping elements or rotating parts of the list (local search, hill climbing, simulated annealing, other).
I think, because with large amounts of data and lack of additional criteria, it's going to be very very difficult to make something that finds the best option. Have you considered doing a greedy algorithm (construct your solution incrementally in a way that gives you something close to the ideal solution)? Here's my idea:
Sort your sets of related symbols by size, and start with the largest one. Keep those all together, because without any other criteria, we might as well say their proximity is the most important since it's the biggest set. Consider every symbol in that first set an "endpoint", an endpoint being a symbol you can rearrange and put at either end of your array without damaging your proximity rule (everything in the first set is an endpoint initially because they can be rearranged in any way). Then go through your list and as soon as one set has one or more symbols in common with the first set, connect them appropriately. The symbols that you connected to each other are no longer considered endpoints, but everything else still is. Even if a bigger set only has one symbol in common, I'm going to guess that's better than smaller sets with more symbols in common, because this way, at least the bigger set stays together as opposed to possibly being split up if it was put in the array later than smaller sets.
I would go on like this, updating the list of endpoints that existed so that you could continue making matches as you went through your set. I would keep track of if I stopped making matches, and in that case, I'd just go to the top of the list and just tack on the next biggest, unmatched set (doesn't matter if there are no more matches to be made, so go with the most valuable/biggest association). Ditch the old endpoints, since they have no matches, and then all the symbols of the set you just tacked on are the new endpoints.
This may not have a good enough runtime, I'm not sure. But hopefully it gives you some ideas.
Edit: Obviously, as part of the algorithm, ditch duplicates (trivial).
The problem as described is essentially the problem of drawing a graph in one dimension.
Using the relationships, construct a graph. Treat the unique symbols as the vertices of the graph. Place an edge between any two vertices that co-occur in a relationship; more sophisticated would be to construct a weight based on the number of relationships in which the pair of symbols co-occur.
Algorithms for drawing graphs place well-connected vertices closer to one another, which is equivalent to placing related symbols near one another. Since only an ordering is needed, the symbols can just be ranked based on their positions in the drawing.
There are a lot of algorithms for drawing graphs. In this case, I'd go with Fiedler ordering, which orders the vertices using a particular eigenvector (the Fiedler vector) of the graph Laplacian. Fiedler ordering is straightforward, effective, and optimal in a well-defined mathematical sense.
It sounds like you want to do topological sorting: http://en.wikipedia.org/wiki/Topological_sorting
Regarding the initial ordering, it seems like you are trying to enforce some kind of stability condition, but it is not really clear to me what this should be from your question. Could you try to be a bit more precise in your description?

Guaranteeing Unique Surrogate Key Assignment - Maximum Matching for Non-Bipartite Graph

I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.
In other words, if source system A has natural key ABC representing the same entity as source system B's natural key DEF, we would assign the same surrogate key to both. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC DEF
That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A's natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC NULL
2 ABC DEF
So the warehouse is a mess. There's much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.
A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:
Wikipedia - Matching
MIT 18.433 Combinatorial Optimization - Lecture Notes on Non-Bipartite Matching
I need an easy to understand implementation (not optimally performing) of Edmond's paths, trees, and flowers algorithm. I don't have a formal math or CS background, and what I do have is self-taught, and I'm not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.
EDIT:
A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C...) paints you into a local maxima corner.
In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I'm helping them with functions to assess global match quality. This is ideal since they're the ones signing off anyways, so my backside is covered.
Not using surrogate keys doesn't change the matching problem. There's still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.
I get the impression you're going about this the wrong way; as cdonner says, there are other ways to just rebuild the key structure without going through this mess. In particular, you need to guarantee that natural keys are always unique for a given record (violating this condition is what got you into this mess!). Having both ABC and DEF identify the same record is disastrous, but ultimately repairable. I'm not even sure why you need surrogate keys at all; while they do have many advantages, I'd give some consideration to going pure-relational and just gutting them from your schema, a la Celko; it might just get you out of this mess. But that's a decision that would have to be made after looking at your whole schema.
To address your potential solution, I've pulled out my copy of D. B. West's Introduction to Graph Theory, second edition, which describes the blossom algorithm on page 144. You'll need some mathematical background, with both mathematical notation and graph theory, to follow the algorithm, but it's sufficiently concise that I think it can help (if you decide to go this route). If you need explanation, first consult a resource on graph theory (Wikipedia, your local library, Google, wherever), or ask if you're not finding what you need.
3.3.17. Algorithm. (Edmonds' Blossom Algorithm [1965a]---sketch).
Input. A graph G, a matching M in G, an M-unsaturated vertex u.
Idea. Explore M-alternating paths from u, recording for each vertex the vertex from which it was reached, and contracting blossoms when found. Maintain sets S and T analogous to those in Algorithm 3.2.1, with S consisting of u and the vertices reached along saturated edges. Reaching an unsaturated vertex yields an augmentation.
Initialization. S = {u} and T = {} (empty set).
Iteration. If S has no unmarked vertex, stop; there is no M-augmenting path from u. Otherwise, select an unmarked v in S. To explore from v, successively consider each y in N(v) such that y is not in T.
If y is unsaturated by m, then trace back from y (expanding blossoms as needed) to report an M-augmenting (u, y)-path.
If y is in S, then a blossom has been found. Suspend the exploration of v and contract the blossom, replacing its vertices in S and T by a single new vertex in S. Continue the search from this vertex in the smaller graph.
Otherwise, y is matched to some w by M. Include y in T (reached from v), and include w in S (reached from y).
After exploring all such neighbors of v, mark v and iterate.
The algorithm as described here runs in time O(n^4), where n is the number of vertices. West gives references to versions that run as fast as O(n^5/2) or O(n^1/2 m) (m being the number of edges). If you want these references, or citations to Edmonds' original paper, just ask and I'll dig them out of the index (which kind of sucks in this book).
I think you would be better off by establishing a set of rules and attacking your key mapping table with a set of simple queries that enforce each rule, in an iterative fashion. Maybe I am oversimplifying because your example is simple.
The following are examples of rules - only you can decide which ones apply:
if there are duplicates, use the lowest (oldest) surrogate key
use the natural keys from the row with the highest (latest) surrogate key
use the natural keys from the most complete mapping row
use the most recent occurence of every natural key
... ?
Writing queries that rebuild your key mapping is trivial, once you have established the rules. I am not sure how this could be a math problem?
If you are looking for an implementation, Eppsteins PADS library has a matching algorithm, this should be fast enough for your purposes, the general matching algorithm is in CardinalityMatching.py. The comments in the implementation explain what is going on. The library is easy to use, to supply a graph in Python you can represent the graph using a dictionary G, such that G[v] gives a list (or set) of neighbors of the vertex v.
Example:
G = {1: [1], 2:[1,3], 3: [2,4], 4:[3]}
gives a line graph with 4 vertices.

Resources