Tricky algorithm for sorting symbols in an array while preserving relationships via order - algorithm

The problem
I have multiple groups which specify the relationships of symbols.. for example:
[A B C]
[A D E]
[X Y Z]
What these groups mean is that (for the first group) the symbols, A, B, and C are related to each other. (The second group) The symbols A, D, E are related to each other.. and so forth.
Given all these data, I would need to put all the unique symbols into a 1-dimension array wherein the symbols which are somehow related to each other would be placed closer to each other. Given the example above, the result should be something like:
[B C A D E X Y Z]
or
[X Y Z D E A B C]
In this resulting array, since the symbol A has multiple relationships (namely with B and C in one group and with D and E in another) it's now located between those symbols, somewhat preserving the relationship.
Note that the order is not important. In the result, X Y Z can be placed first or last since those symbols are not related to any other symbols. However, the closeness of the related symbols is what's important.
What I need help in
I need help in determining an algorithm that takes groups of symbol relationships, then outputs the 1-dimension array using the logic above. I'm pulling my hair out on how to do this since with real data, the number of symbols in a relationship group can vary, there is also no limit to the number of relationship groups and a symbol can have relationships with any other symbol.
Further example
To further illustrate the trickiness of my dilemma, IF you add another relationship group to the example above. Let's say:
[C Z]
The result now should be something like:
[X Y Z C B A D E]
Notice that the symbols Z and C are now closer together since their relationship was reinforced by the additional data. All previous relationships are still retained in the result also.

The first thing you need to do is to precisely define the result you want.
You do this by defining how good a result is, so that you know which is the best one. Mathematically you do this by a cost function. In this case one would typically choose the sum of the distances between related elements, the sum of the squares of these distances, or the maximal distance. Then a list with a small value of the cost function is the desired result.
It is not clear whether in this case it is feasible to compute the best solution by some special method (maybe if you choose the maximal distance or the sum of the distances as the cost function).
In any case it should be easy to find a good approximation by standard methods.
A simple greedy approach would be to insert each element in the position where the resulting cost function for the whole list is minimal.
Once you have a good starting point you can try to improve it further by modifying the list towards better solutions, for example by swapping elements or rotating parts of the list (local search, hill climbing, simulated annealing, other).

I think, because with large amounts of data and lack of additional criteria, it's going to be very very difficult to make something that finds the best option. Have you considered doing a greedy algorithm (construct your solution incrementally in a way that gives you something close to the ideal solution)? Here's my idea:
Sort your sets of related symbols by size, and start with the largest one. Keep those all together, because without any other criteria, we might as well say their proximity is the most important since it's the biggest set. Consider every symbol in that first set an "endpoint", an endpoint being a symbol you can rearrange and put at either end of your array without damaging your proximity rule (everything in the first set is an endpoint initially because they can be rearranged in any way). Then go through your list and as soon as one set has one or more symbols in common with the first set, connect them appropriately. The symbols that you connected to each other are no longer considered endpoints, but everything else still is. Even if a bigger set only has one symbol in common, I'm going to guess that's better than smaller sets with more symbols in common, because this way, at least the bigger set stays together as opposed to possibly being split up if it was put in the array later than smaller sets.
I would go on like this, updating the list of endpoints that existed so that you could continue making matches as you went through your set. I would keep track of if I stopped making matches, and in that case, I'd just go to the top of the list and just tack on the next biggest, unmatched set (doesn't matter if there are no more matches to be made, so go with the most valuable/biggest association). Ditch the old endpoints, since they have no matches, and then all the symbols of the set you just tacked on are the new endpoints.
This may not have a good enough runtime, I'm not sure. But hopefully it gives you some ideas.
Edit: Obviously, as part of the algorithm, ditch duplicates (trivial).

The problem as described is essentially the problem of drawing a graph in one dimension.
Using the relationships, construct a graph. Treat the unique symbols as the vertices of the graph. Place an edge between any two vertices that co-occur in a relationship; more sophisticated would be to construct a weight based on the number of relationships in which the pair of symbols co-occur.
Algorithms for drawing graphs place well-connected vertices closer to one another, which is equivalent to placing related symbols near one another. Since only an ordering is needed, the symbols can just be ranked based on their positions in the drawing.
There are a lot of algorithms for drawing graphs. In this case, I'd go with Fiedler ordering, which orders the vertices using a particular eigenvector (the Fiedler vector) of the graph Laplacian. Fiedler ordering is straightforward, effective, and optimal in a well-defined mathematical sense.

It sounds like you want to do topological sorting: http://en.wikipedia.org/wiki/Topological_sorting
Regarding the initial ordering, it seems like you are trying to enforce some kind of stability condition, but it is not really clear to me what this should be from your question. Could you try to be a bit more precise in your description?

Related

Can anyone suggest a quick triangle detection algorithm?

I have some data in the form of Line objects (eg Line1(start, end), where start and end are coordinates in the form of point objects). Is there a quick way to go through all the lines to see if any of them form a triangle? By quick I mean anything better than going through all nC3 possibilities.
Edit: Just realised I may not understand all the replies (I'm no Adrian Lamo). Please try and explain wrt Python.
1) geometric step: enter all line segments in a dictionary, with the first endpoint as the key and the second endpoint as the value. There will be duplicate keys, so you will keep a list of values rather than single values. In principle there will be no duplicates in the lists (unless you enter the same edge twice).
2) topological step: for all entries P in the dictionary, consider all the pairs of elements from its list, let (Q, R). Lookup Q and check if R belongs to the list of Q. If yes, you have found the triangle (P, Q, R).
By symmetry, all six permutations of every triangle will be reported. You can avoid that by enforcing that P<Q<R in the lexicographical sense.
Is there a quick way to go through all the lines to see if any of them form a triangle?
Yes. Assuming that your Points are integers (or can be easily converted to such, because they have fixed significant digits or similar):
Being creative for you here:
Make two fastly searchable storage structures (e.g. std::multimap<int>), one for each of the x and y coordinates of the endpoints, associating the coordinates with a pointer to the respective line
In the first structure, search for elements with the same x coordinate. Those are the only "candidates" for being a corner of a triangle. Searching for duplicate entries is fast, because of you using an appropriate data structure.
For each of these, verify whether they are actually an edge. If they are not, discard.
For each of the remaining corners, verify that both "opposite" line ends are part of the same edge. Discard the others. Done.

Sorting a list when the comparison between any two elements may be ambiguous?

I'm trying to optimize a sort for an isometric renderer. What is proving tricky is that the comparitor can return "A > B", "A < B", or "the relationship between A and B is ambiguous." All of the standard sorting algorithms expect that you can compare the values of any two objects, but I cannot do that here. Are there known algorithms that are made for this situation?
For example, there can be a situation where the relationship between A and C is ambiguous, but we know A > B, and B > C. The final list should be A,B,C.
Note also that there may be multiple solutions. If A > B, but C is ambiguous to both A and B, then the answer may be C,A,B or A,B,C.
Edit: I did say it was for sorting in an isometric renderer, but several people asked for more info so here goes. I've got an isometric game and I'm trying to sort the objects. Each object has a rectangular, axis-aligned footprint in the world, which means that they appears as a sort of diamond shapes from the perspective of the camera. Height of the objects is not known, so an object in front of another object is assumed to be capable of occluding the one in the back, and therefore must be drawn after (sorted later in the list than) the one in the back.
I also left off an important consideration, which is that a small number of objects (the avatars) move around.
Edit #2: I finally have enough rep to post pictures! A and B...well, they aren't strictly ambiguous because in each case they have a definite relationship compared to each other. However, that relationship cannot be known by looking at the variables of A and B themselves. Only when we look at C as well can we know what order they go in.
I definitely think topographical sort is the way to go, so I'm considering the question answered, but I'm happy to make clarifications to the question for posterity.
You may want to look at sorts for partial orders.
https://math.stackexchange.com/questions/55891/algorithm-to-sort-based-on-a-partial-order links to the right papers on this topic (specifically topological sorting).
Edit:
Given the definition of the nodes:
You need to make sure objects never can cause mutual occlusion. Take a look at the following grid, where the camera is in the bottom left corner.
______
____X_
_YYYX_
_YXXX_
_Y____
As you can see, parts of Y are hidden by X and parts of X are hidden by Y. Any drawing order will cause weird rendering. This can be solved in many ways, the simplest being to only allow convex, hole-free shapes as renderable primitives. Anything concave needs to be broken into chunks.
If you do this, you can then turn your partial order into a total order. Here's an example:
def compare(a,b):
if a.max_x < b.min_x:
return -1
elif a.max_y < b.min_y:
return -1
elif b.max_x < a.min_x:
return 1
elif b.max_y < a.min_y:
return 1
else:
# The bounding boxes intersect
# If the objects are both rectangular,
# this should be impossible
# If you allow non-rectangular convex shapes,
# like circles, you may need to do something fancier here.
raise NotImplementedException("I only handle non-intersecting bounding boxes")
And use any old sorting algortithm to give you your drawing order.
You should first build a directed graph, using that graph you will be able to find the relationships, by DFS-ing from each node.
Once you have relationships, some pairs might still be ambiguous. In that case look for partial sorting.

Find mutually compatible options from list of list of options

For purposes of this question, let us call a list of mutually incompatible options for "OptionS". I have a list of such OptionS, where each Option, apart from disqualifying all other Options in it's own OptionS list, also disqualify some Options from the other OptionS lists. These rules are symmetrical, so if A forbids B, B forbids A.
I want to pick exactly one Option from each list, such that no Options disqualify each other. There are too many Options (and OptionS) and too few disqualifications in each step to brute force a backtracking solution.
It reminds be a bit of Sudoku, but it is not an exact analog. From certain external factors, I have a rough likelihood for the different Options, or at least an ordering.
Is there a known better solution to this problem? Is it in NP?
Currently, I plan to just take random "paths" through the solution space, weighted by likelihood. A sort of simulated annealing.
EDIT - Clarification
I have a number, let's say between 5 and 500, of vectors.
Each vector contains a number, between 10 and 10000, of elements
Each element rules out a number of elements in the other vectors
This relation is symmetric
I want to pick exactly one element from each vector in a way that no elements disqualify each other
If there is no way to choose one from each vector, I want to at least choose as many as possible. The nature of the data is such that there will always be at least one (and at most a few) solution (or almost-solution - with just a few misses).
I cannot share the real data, but an example would be that the elements are integers between 1 and 10e9 and that only elements whose pairwise sum has more than P prime factors are allowed. Some numbers are more likely than others to "fit" other numbers, since larger numbers tend to have more factors, which makes some choices more likely just like the real one.
Pick P and the sizes and number of vectors as needed to make it suitably challenging :).
My naive solution:
I order the elements by how many other elements they rule out and try those who rule out few first (because that gives you a larger chance to be able to pick one from each).
Then I order the vectors by how many elements the "best" element rules out. Vectors that rule out many other elements are first. So the most constrained vector is tried first, even though the least constrained elements of that vector are tried first.
I then search depth first
The problem with this approach is that if the first choice is wrong, then the depth first search will never have time to reach the next choice.
A better way, which I try to explain in a comment below, would be to score each partial choice (node in the search tree) of elements according to how many you have chosen and how many elements are left. Then I could look deeper in the highest scoring node at each step, so the first choice is less rigid.
A similar way, which I might try first because it is slightly easier, is to do simulated annealing and take random paths, weighted by how many possibilities they keep, down the tree.
Depending on what constraints are allowed, I think you can reduce SAT to this.
Take a SAT expression e.g. (A|B|C)(~A|C|~D)...
Replace ~A by a and make a vector out of each term giving you {A,B,C} {a,C,d}...
You now have the problem of choosing one element from each vector, subject to the constraint that you cannot choose both versions of a variable - the constraints say that A is incompatible with a, B is incompatible with b, and so on.
If you can solve this instance of your problem you can solve SAT by setting to true variables that are chosen in your problem as A, B, C,... to false variables that are chosen as a, b, c,.. and making an arbitrary choice for anything not chosen - therefore your problem is at least as hard as SAT. (Except if you don't encounter these sorts of constraints, in which case I have not proved that your problem is this hard).
Given an instance of your problem, associate a variable with each element, write the constraints as boolean expressions (typically with only 2 variables) to give something which looks like 2-SAT, except that you need an expression for each vector of the form (A|B|C|D|...) to say that you must choose at least one element from each vector - so the exact solution version of your problem, at least, might code up quite nicely as input for a SAT-solver - so it is in NP and since we have already shown it is NP-hard it is NP-complete.
My first recommendation would be to find an off-the-shelf constraint solver and try that (request a maximum-weight solution with the log-likelihoods as weights), but if you're determined to implement a solver from scratch, then I would suggest that you start with something like WalkSAT. To summarize the link in the language of your question: at all times, keep a list of option choices (one from each option list, not necessarily compatible) and a list of conflicts (i.e., a set of pairs of indexes into the list of option lists). Repeatedly choose a conflict at random and resolve it by choosing differently for one half of the conflict or the other (most of the time) so as to decrease the number of conflicts afterward as much as possible or (some of the time) randomly, perhaps according to the likelihoods. Good data structures will be essential in making this run fast.

Computing overlaps of grids

Say I have two maps, each represented as a 2D array. Each map contains several distinct features (rocks, grass, plants, trees, etc.). I know the two maps are of the same general region but I would like to find out: 1.) if they overlap and 2.) if so, where does this overlap occur. Does anyone know of any algorithms which would help me do this?
[EDIT]
Each feature is contained entirely inside an array index. Although it is possibly to discern (for example) a rock from a patch of grass, it is not possible to discern one rock from another (or one patch of grass from another).
When doing this in 1D, I would for each index in the first collection (a string, really), try find the largest match in the second collection. If the match goes to the end, I have an overlap (like in action and ionbeam).
match( A on B ):
for each i in length(A):
see if A[i..] matches B[0..]
if no match found: do the same for B on A.
For 2D, you do the same thing, basically: find an 'edge' of A that overlaps with the opposite edge of B. Only the edges aren't 1D, but 2D:
for each point xa,ya in A:
find a row yb in B that has a match( A[ya] on B[yb] )
see if A[ya..] matches B[yb..]
You need to do this for the 2 diagonals, in each sense.
For one map, go through each feature and find the nearest other feature to it. Record these in a list, storing the type of each of the two features and the dx dy between them. Store in a hash table or sorted list. These are now location invariant, since they record only relative distances.
Now for your second map, start doing the same: pick any feature, find its closest neighbor, find the delta. Look for the same correspondence in the original map list. If the features are shared between the maps, you'll find it in the list and you now know one correspondence between the maps. Repeat for many features if necessary. The results will give you a decent answer of if the maps overlap, and if so, at what offset.
Sounds like image registration (wikipedia), finding a transformation (translation only, in your case) which can align two images. There's a pile of software that does this sort of thing linked off the wikipedia page.

Guaranteeing Unique Surrogate Key Assignment - Maximum Matching for Non-Bipartite Graph

I am maintaining a data warehouse with multiple sources of data about a class of entities that have to be merged. Each source has a natural key, and what is supposed to happen is that one and only one surrogate key is created for each natural key for all time. If one record from one source system with a particular natural key represents the same entity as another record from another source system with a different natural key, the same surrogate key will be assigned to both.
In other words, if source system A has natural key ABC representing the same entity as source system B's natural key DEF, we would assign the same surrogate key to both. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC DEF
That was the plan. However, this system has been in production for a while, and the surrogate key assignment is a mess. Source system A would give natural key ABC on one day, before source system B knew about it. The DW assigned surrogate key 1 to it. Then source system B started giving natural key DEF, which represents the same thing as source system A's natural key ABC. The DW incorrectly gave this combo surrogate key 2. The table would look like this:
SURROGATE_KEY SOURCE_A_NATURAL_KEY SOURCE_B_NATURAL_KEY
1 ABC NULL
2 ABC DEF
So the warehouse is a mess. There's much more complex situations than this. I have a short timeline for a cleanup that requires figuring out a clean set of surrogate key to natural key mappings.
A little Googling reveals that this can be modeled as a matching problem in a non-bipartite graph:
Wikipedia - Matching
MIT 18.433 Combinatorial Optimization - Lecture Notes on Non-Bipartite Matching
I need an easy to understand implementation (not optimally performing) of Edmond's paths, trees, and flowers algorithm. I don't have a formal math or CS background, and what I do have is self-taught, and I'm not in a math-y headspace tonight. Can anyone help? A well written explanation that guides me to an implementation would be deeply appreciated.
EDIT:
A math approach is optimal because we want to maximize global fitness. A greedy approach (first take all instances of A, then B, then C...) paints you into a local maxima corner.
In any case, I got this pushed back to the business analysts to do manually (all 20 million of them). I'm helping them with functions to assess global match quality. This is ideal since they're the ones signing off anyways, so my backside is covered.
Not using surrogate keys doesn't change the matching problem. There's still a 1:1 natural key mapping that has to be discovered and maintained. The surrogate key is a convenient anchor for that, and nothing more.
I get the impression you're going about this the wrong way; as cdonner says, there are other ways to just rebuild the key structure without going through this mess. In particular, you need to guarantee that natural keys are always unique for a given record (violating this condition is what got you into this mess!). Having both ABC and DEF identify the same record is disastrous, but ultimately repairable. I'm not even sure why you need surrogate keys at all; while they do have many advantages, I'd give some consideration to going pure-relational and just gutting them from your schema, a la Celko; it might just get you out of this mess. But that's a decision that would have to be made after looking at your whole schema.
To address your potential solution, I've pulled out my copy of D. B. West's Introduction to Graph Theory, second edition, which describes the blossom algorithm on page 144. You'll need some mathematical background, with both mathematical notation and graph theory, to follow the algorithm, but it's sufficiently concise that I think it can help (if you decide to go this route). If you need explanation, first consult a resource on graph theory (Wikipedia, your local library, Google, wherever), or ask if you're not finding what you need.
3.3.17. Algorithm. (Edmonds' Blossom Algorithm [1965a]---sketch).
Input. A graph G, a matching M in G, an M-unsaturated vertex u.
Idea. Explore M-alternating paths from u, recording for each vertex the vertex from which it was reached, and contracting blossoms when found. Maintain sets S and T analogous to those in Algorithm 3.2.1, with S consisting of u and the vertices reached along saturated edges. Reaching an unsaturated vertex yields an augmentation.
Initialization. S = {u} and T = {} (empty set).
Iteration. If S has no unmarked vertex, stop; there is no M-augmenting path from u. Otherwise, select an unmarked v in S. To explore from v, successively consider each y in N(v) such that y is not in T.
If y is unsaturated by m, then trace back from y (expanding blossoms as needed) to report an M-augmenting (u, y)-path.
If y is in S, then a blossom has been found. Suspend the exploration of v and contract the blossom, replacing its vertices in S and T by a single new vertex in S. Continue the search from this vertex in the smaller graph.
Otherwise, y is matched to some w by M. Include y in T (reached from v), and include w in S (reached from y).
After exploring all such neighbors of v, mark v and iterate.
The algorithm as described here runs in time O(n^4), where n is the number of vertices. West gives references to versions that run as fast as O(n^5/2) or O(n^1/2 m) (m being the number of edges). If you want these references, or citations to Edmonds' original paper, just ask and I'll dig them out of the index (which kind of sucks in this book).
I think you would be better off by establishing a set of rules and attacking your key mapping table with a set of simple queries that enforce each rule, in an iterative fashion. Maybe I am oversimplifying because your example is simple.
The following are examples of rules - only you can decide which ones apply:
if there are duplicates, use the lowest (oldest) surrogate key
use the natural keys from the row with the highest (latest) surrogate key
use the natural keys from the most complete mapping row
use the most recent occurence of every natural key
... ?
Writing queries that rebuild your key mapping is trivial, once you have established the rules. I am not sure how this could be a math problem?
If you are looking for an implementation, Eppsteins PADS library has a matching algorithm, this should be fast enough for your purposes, the general matching algorithm is in CardinalityMatching.py. The comments in the implementation explain what is going on. The library is easy to use, to supply a graph in Python you can represent the graph using a dictionary G, such that G[v] gives a list (or set) of neighbors of the vertex v.
Example:
G = {1: [1], 2:[1,3], 3: [2,4], 4:[3]}
gives a line graph with 4 vertices.

Resources