In search of an algorithm for sorting collection in nodes to satisfy a layout constraint - algorithm

Firstly, I apologize for the poor title; I cannot think of a good name for this algorithm.
I have an ordered list of stages. Each stage has a cast of characters, unordered. Characters can occur in multiple stages.
A crossing occurs when two consecutive stages cannot have their casts concatenated, with overlap allowed where it would unify the same character on both casts, in a way that leaves a character duplicated in the concatenation. Or, informally, a crossing is when a character would need to be at two different spots at once in a line-up of the combined casts. In code:
uncrossed = [D, F], [N, V, S]
overlap = [D, F, V], [V, N, S]
crossed = [D, V, F], [N, V, S]
In the first example, V isn't with D and F, so there aren't any crossings. In the second example, V is with D and F and then with N and S, but this isn't a problem because the ordering permits (with overlap) a crossing-less concatenation. On the third example, though, the ordering forces a crossing.
For my purposes, crossings can occur on non-consecutive stages as if characters did not actually stray from their previous order in the cast when they are not "on-stage."
I would like to order each stage's cast such that there are as few crossings as possible, understanding that it is definitely possible to have situations where crossings are inevitable. An example series which requires crossings:
required = [A, B], [B, C], [A, C], [A, B]
This all sounds very abstract and silly, so I'll provide a concrete example of a human solving this algorithm for a purpose similar to mine: http://xkcd.com/657/ In this case, the constraint is deliberately ignored for aesthetic purposes, but it's still possible to get a visual idea of what I'm talking about.
I already have some crude ideas for how to solve this, but nothing affordable, and I'm wondering if this is isomorphic to some problem already covered in the literature. It sounds vaguely topological as well.
Since people asked, this algorithm appears to be key to automatically generating pretty timelines for storyboards of characters in stories, and that's what I'm intending to use it for.

This isn't an answer, but I think maybe a more precise or even more correct formulation of what you are looking for:
There is a set, call it C, of characters, and there is a finite ordered sequence S_1, ... S_n of scenes, where a scene is a set consisting of some of the characters. Characters may (and typically do) appear in multiple scenes.
I'd like to phrase your desired outcome in a slightly different way from how you phrased it, because I think it makes it clearer how one may search for a solution (or at least, it makes it totally clear how to brute-force a solution):
The output of our algorithm is a sequence of arrangements of the characters. An arrangement of the characters is just a permutation of the ordered tuple [c_1, ... c_m], where the c_i are the characters, and there are m of them in total, so C = {c_1, ..., c_m}. We want n arrangements in total, call them A_1, ..., A_n, one per scene.
What arrangement A_n corresponds to is the top-to-bottom ordering of the characters in your storyboard, during scene n, in the following sense: draw a vertical line through your storyboard passing through scene n. This line should hit the characters' life-lines in the order specified by A_n.
We require the following property of our arrangements: given scene S_n, the arrangement A_n needs to put the characters contained in S_n into a contiguous chunk, in the following sense: suppose that S_n = {c_2, c_3, c_5}. Then A_n may yield [c_1, c_4, c_2, c_3, c_5], but may not yield [c_2, c_1, c_3, c_4, c_5]. This is because you don't want an errant character "cutting through" the scene in the storyboard.
We hope to minimize the number of "crossings." Here, crossings are easy to define: the number of crossings between A_i and A_(i+1) is exactly equal to the number of transpositions of adjacent characters required to go from permutation A_i to permutation A_(i+1).
I haven't given you an answer, but I think that given the above setup, a brute-force approach isn't too hard to code up, and will give you an answer overnight without a problem, if the storyboard isn't too big.
I think that if you posted this problem on MathOverflow, you could possibly get someone interested in it. Or maybe it has been solved, who knows?

Related

Algorithm for random sampling under multiple no-repeat conditions

I ran into the following issue:
So, I got an array of 100-1000 objects (size varies), e.g.something like
[{one:1,two:'A',three: 'a'}, {one:1,two:'A',three: 'b'}, {one:1,two:'A',three: 'c'}, {one:1,two:'A',three: 'd'},
{one:1,two:'B',three: 'a'},{one:2,two:'B',three: 'b'},{one:1,two:'B',three: ':c'}, {one:1,two:'B',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:1,two:'C',three: ':c'}, {one:2,two:'C',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:2,two:'C',three: ':c'}, {one:1,two:'C',three: 'd'},...]
The value for 'one' is pretty much arbitrary. 'two' and 'three' have to be balanced in a certain way: Basically, in the above, there is some n, such that n=4 times 'A'. 'B','C','D','a','b','c' and 'd' - and such an n exists in any variant of this problem. It is just not clear what the n is, and the combinations themselves can also vary (e.g. if we only had As and Bs, [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] as well as [{1,A,a},{1,A,b},{1,B,a},{1,B,b}] would both be possible arrays with n=2).
What I am trying to do now, is randomise the original array with the condition that there cannot be repeats in close order for some keys, i.e. the value of 'two' and 'three' for an object at index i-1 cannot be the same as the value of same attribute for the object at index i (and that should be true for all or as many objects as possible), i.e. [{1,B,a},{1,A,a},{1,C,b}] would not be allowed, [{1,B,a},{1,C,b},{1,A,a}] would be allowed.
I tried some brute-force method (randomise all, then push wrong indexes to the back) that works rarely, but it mostly just loops infinitely over the whole array, because it never ends up without repeats. Not sure, if this is because it is generally mathematically impossible for some original arrays, or if it is just because my solution sucks.
By now, I've been looking for over a week, and I am not even sure how to approach this.
Would be great, if someone knew a solution for this problem, or at least a reason why it isn't possible. Any help is greatly appreciated!
First, let us dissect the problem.
Forget for now about one, separate two and three into two independent sequences (assuming they are indeed independent, and not tied to each other).
The underlying problem is then as follows.
Given is a collection of c1 As, c2 Bs, c3 Cs, and so on. Place them randomly in such a way that no two consecutive letters are the same.
The trivial approach is as follows.
Suppose we already placed some letters, and are left with d1 As, d2 Bs, d3 Cs, and so on.
What is the condition when it is impossible to place the remaining letters?
It is when the count for one of the letters, say dk, is greater than one plus the sum of all other counts, 1 + d1 + d2 + ... excluding dk.
Otherwise, we can place them as K . K . K . K ..., where K is the k-th letter, and dots correspond to any letter except the k-th.
We can proceed at least as long as dk is still the greatest of the remaining quantities of letters.
So, on each step, if there is a dk equal to 1 + d1 + d2 + ... excluding dk, we should place the k-th letter right now.
Otherwise, we can place any other letter and still be able to place all others.
If there is no immediate danger of not being able to continue, adjust the probabilities to your liking, for example, weigh placing k-th letter as dk (instead of uniform probabilities for all remaining letters).
This problem smells of NP complete and lots of hard combinatorial optimization problems.
Just to find a solution, I'd always place as the next element the remaining element that can be placed which as few possible remaining elements can be placed next to. In other words try to get the hardest elements out of the way first - if they run into a problem, then you're stuck. If that works, then you're golden. (There are data structures like a heap which can be used to find those fairly efficiently.)
Now armed with a "good enough" solver, I'd suggest picking the first element randomly until the solver can solve the rest. Repeat. If at any time you find it takes too many guesses, just go with what the solver did last time. That way all the way you know that there IS a solution, even though you are trying to do things randomly at every step.
Graph
As I understand it, one does not play a role in constraints, so I'll label {one:1,two:'A',three: 'a'} with Aa. Thinking of objects as vertices, place them on a graph. Place edges whenever two respective vertices can be beside each other. For [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] it would be,
and for [{1,A,a},{1,A,b},{1,B,a},{1,B,b}],
The problem becomes: select a random Hamiltonian path, (if possible.) For the loop, it would be any path on the circuit [Aa, Bb, Aa, Bb] or the reverse. For the disconnected lines, it is not possible.
Possible algorithm
I think, to be uniformly random, we would have to enumerate all the possibilities and choose one at random. This is probably infeasible, even at 100 vertices.
A näive algorithm that relaxes the uniform criterion, I think, would be to select (a) random point that does not split the graph in two. Then select (b) random neighbour of (a) that does not split the graph in two. Remove (a) to the solution. (a) = (b). Keep going until the end or backtrack when there are no moves, (if possible.) There may be further heuristics that could cut down the branching factor.
Example
There are no vertices that would disconnect the graph, so choosing Ab uniformly at random.
The neighbours of Ab are {Ca, Bc, Ba, Cc} of which Ca is chosen randomly.
Ab splits the graph, so we must choose Bc.
The only choice left is which of Cc and Ba comes first. We might end up with: [Ab, Ca, Bc, Ab, Ba, Cc].

Sorting sequences where the binary sorting function return is undefined for some pairs

I'm doing some comp. mathematics work where I'm trying to sort a sequence with a complex mathematical sorting predicate, which isn't always defined between two elements in the sequence. I'm trying to learn more about sorting algorithms that gracefully handle element-wise comparisons that cannot be made, as I've only managed a very rudimentary approach so far.
My apologies if this question is some classical problem and it takes me some time to define it, algorithmic design isn't my strong suit.
Defining the problem
Suppose I have a sequence A = {a, b, c, d, e}. Let's define f(x,y) to be a binary function which returns 0 if x < y and 1 if y <= x, by applying some complex sorting criteria.
Under normal conditions, this would provide enough detail for us to sort A. However, f can also return -1, if the sorting criteria is not well-defined for that particular pair of inputs. The undefined-ness of a pair of inputs is commutative, i.e. f(q,r) is undefined if and only if f(r,q) is undefined.
I want to try to sort the sequence A if possible with the sorting criterion that are well defined.
For instance let's suppose that
f(a,d) = f(d,a) is undefined.
All other input pairs to f are well defined.
Then despite not knowing the inequality relation between a and d, we will be able to sort A based on the well-defined sorting criteria as long as a and d are not adjacent to one another in the resulting "sorted" sequence.
For instance, suppose we first determined the relative sorting of A - {d} to be {c, a, b, e}, as all of those pairs to fare well-defined. This could invoke any sorting algorithm, really.
Then we might call f(d,c), and
if d < c we are done - the sorted sequence is indeed {d, c, a, b, e}.
Else, we move to the next element in the sequence, and try to call f(a, d). This is undefined, so we cannot establish d's position from this angle.
We then call f(d, e), and move from right to left element-wise.
If we find some element x where d > x, we are done.
If we end up back at comparing f(a, d) once again, we have established that we cannot sort our sequence based on the well-defined sorting criterion we have.
The question
Is there a classification for these kinds of sorting algorithms, which handle undefined comparison pairs?
Better yet although not expected, is there a well-known "efficient" approach? I have defined my own extremely rudimentary brute-force algorithm which solves this problem, but I am certain it is not ideal.
It effectively just throws out all sequence elements which cannot be compared when encountered, and sorts the remaining subsequence if any elements remain, before exhaustively attempting to place all of the sequence elements which are not comparable to all other elements into the sorted subsequence.
Simply a path on which to do further research into this topic would be great - I lack experience with algorithms and consequently have struggled to find out where I should be looking for some more background on these sorts of problems.
This is very close to topological sorting, with your binary relation being edges. In particular, this is just extending a partial order into a total order. Naively if you consider all pairs using toposort (which is O(V+E)) you have a worst case O(n^2) algorithm (actually O(n+p) with n being the number of elements and p the number of comparable pairs).

Can this be solved with a line sweep algorithm?

Edit: Now I think this is a sweep line problem. (see update2 at the bottom)
In this problem we are given N objects and M constraints. (N can be 200k, M can be 100k). Each object is either black, or white. Each constraint is in the form (x, y) and means that in the range of objects x..y, there is exactly one white object; the rest are black. We would like to determine the maximum number of white objects that can exist, or if it isn't possible to satisfy the constraints.
I observe that if a constraint is fully contained in another, the inner constraint will dictate where a white object can be placed. Also, if there are several non-intersecting constraints contained within another, it should be impossible since it violates the fact that there can only be one white object per constraint. The algorithm should be fast enough to run under 2-3 seconds.
Update: One of the answers mentions the exact cover problem; is this a specialized instance that isn't NP-complete?
Update2: If we change each constraint into a begin and end event, and sort these events, could we just systematically sweep across these events and assign white objects?
You problem can be expressed as an exact cover problem: the constraint intervals form the set to be covered, and each white object covers those constraint intervals which it falls inside of. Your problem, then, is to find a subset of the white objects which covers each constraint interval exactly once.
Exact cover problems in general are NP-complete, although that obviously doesn't necessarily mean that any specific subset of them are. However, there nonetheless exist algorithms, such as Knuth's Algorithm X (as implemented by dancing links) that can solve most such problems quite efficiently.
It's possible that the one-dimensional structure of your problem might also allow more straightforward specialized solution methods. However, Algorithm X is a very good general tool for attacking such problems. (For example, the fastest sudoku solvers typically use something like it.)
Yes, there's a (point)-sweep algorithm. This one is sort of inelegant, but I think it works.
First, sweep for nested intervals. Process begin and end events in sorted order (tiebreakers left to you) and keep a list of active intervals not known to contain another interval. To handle a begin event, append the corresponding interval. To handle an end event, check whether the corresponding interval I has been removed. If not, remove I and all of the remaining intervals J before I from the list. For each such J, append two intervals whose union is the set difference J \ I to a list of blacked out intervals.
Second, sweep to contract the blacked out intervals. In other words, delete the objects known to be black, renumber, and adjust the constraints accordingly. If an entire constraint is blacked out, then there is no solution.
Third, sweep to solve the problem on what are now non-nested intervals. The greedy solution is provably optimal.
Example: suppose I have half-open constraints [0, 4), [1, 3), [2, 5). The first sweep creates blackouts [0, 1) and [3, 4). The second sweep leaves constraints [a, c), [a, c), [b, d).* The greedy sweep places white objects at new locations a, c, d (old locations 1, 4, 5).
Illustration of the second sweep:
0 1 2 3 4 5 old coordinates
[ )
[ )
[ )
** ** blackouts
a b c d new coordinates
[ )
[ )
[ )

Hierarchical undirected graph representation

I need to represent the graph like this:
Graph = graph([Object1,Object2,Object3,Object4],
[arc(Object1,Object2,connected),
arc(Object2,Object4,connected),
arc(Object3,Object4,connected),
arc(Object1,Object3,connected),
arc(Object2,Object3,parallel),
arc(Object1,Object4,parallel),
arc(Object2,Object3,similar_size),
arc(Object1,Object4,similar_size)])
I have no restriction for code, however I'd stick to this representation as it fits all the other structures I've already coded.
What I mean is the undirected graph in which vertices are some objects and edges representing undirected relations between them. To give you more background in this particular example I'm trying to represent a rectangle, so objects are its four edges(segments). Those segments are represented in the same way with use of vertices and so on. The point is to build the hierarchy of graphs which would represent constraints between objects on the same level.
The problem lays in the representation of edges. The most obvious way to represent an arc (a,b) would be to put both (a,b) and (b,a) in the program. This however floods my program with redundant data exponentialy. For example if I have vertices a,b,c,d. I can build segments (a,b),(a,c),(a,d),(b,c),(b,d),(c,d). But I get also (b,a),(c,a), and so on. At this points its not a problem. But later I build a rectangle. It can be build of segments (a,b),(b,c),(c,d),(a,d). And I'd like to get the answer - there's one rectangle. You can calculate however how many combination of this one rectangle I get. It also take too much time to calculate and obviously I don't want to finish at the rectangle level.
I thought about sorting the elements. I can sort vertices in a segment. But if I want to sort segments in a rectangle the constraints are no longer valid. The graph becomes directed. For example taking into consideration the first two relations let's say we have arcs (a,b) and (a,c). If arcs are not sorted the program answers as I want it to: arc(b,a,connected),arc(a,c,connected) with match: Object1=b,Object2=a,Object4=c. If I sort elements it's no longer valid as I cannot have arc(b,a,connected) and arc(a,b,connected) tried out. Only the second one. I'd stick with the sorting but I have no idea how to solve this last issue.
Hopefully I stated all of this quite clearly. I'd prefer to stay as close to the representation and ideas I already have. But completely new ones are also very welcome. I don't expect any exact answer, rather poitning me in the right direction or suggesting something specific to read as I'm quite new to Prolog and maybe this problem is not as uncommon as I think.
I'm trying to solve this since yesterday and couldn't come up with any easy answer. I looked at some discrete math and common undirected graphs representation like adjacency list. Let me know if anything is unclear - I'll try to provide more details.
Interesting question although a bit broad since it is not stated what you actually want to do with the arcs, rectangles etc; a representation may be efficient (time/space/elegance) only with certain uses. In any case, here are some ideas:
Sorting
the obvious issue is the one you mentioned; you can solve it by introducing a clause that succeeds if the sorted pair exists:
arc(X,Y):-
arc_data(X,Y)
; arc_data(Y,X).
note that you should not do something like:
arc(a,b).
arc(b,c).
arc(X,Y):-
arc(Y,X)
since this will result in a infinite loop if the arc does not exist.
you could however only check if the first arg is larger than the second:
arc(a,b).
arc(b,c).
arc(X,Y):-
compare(>,X,Y),
arc(Y,X)
This approach will not resolve the multiple solutions that may arise due to having an arc represented in two ways.
The easy fix would be to only check for one solution where only one solution is expected using once/1:
3 ?- arc(X,Y).
X = a,
Y = b ;
X = b,
Y = a.
4 ?- once(arc(X,Y)).
X = a,
Y = b.
Of course you cannot do this when there could be multiple solutions.
Another approach would be to enforce further abstraction: at the moment, when you have two points (a, b) you can create the arc (arc(a,b) or arc(b,a)) after checking if those points are connected. Instead of that, you should create the arc through a predicate (that could also check if the points are connected). The benefit is that you no longer get involved in the representation of the arc directly and can thus enforce sorting (yes, it's basically object orientation):
cv_arc(X,Y,Arc):-
( arc(X,Y),
Arc = arc(X,Y))
; ( arc(Y,X),
Arc = arc(Y,X)).
(assuming as a database arc(a,b)):
6 ?- cv_arc(a,b,A).
A = arc(a, b).
7 ?- cv_arc(b,a,A).
A = arc(a, b).
8 ?- cv_arc(b,c,A).
false.
Of course you would need to follow a similar principle for the rest of the objects; I assume that you are doing something like this to find a rectangle:
rectangle(A,B,C,D):-
arc(A,B),
arc(B,C),
arc(C,D),
arc(D,A).
besides the duplicates due to the arc (which are resolved) this would recognise ABCD, DABC etc as different rectangles:
28 ?- rectangle(A,B,C,D).
A = a,
B = b,
C = c,
D = d ;
A = b,
B = c,
C = d,
D = a ;
A = c,
B = d,
C = a,
D = b ;
A = d,
B = a,
C = b,
D = c.
We will do the same again:
rectangle(rectangle(A,B,C,D)):-
cv_arc(A,B,AB),
cv_arc(B,C,BC),
compare(<,AB,BC),
cv_arc(C,D,CD),
compare(<,BC,CD),
cv_arc(D,A,DA),
compare(<,CD,DA).
and running with arc(a,b). arc(b,c). arc(c,d). arc(a,d).:
27 ?- rectangle(R).
R = rectangle(a, b, c, d) ;
false.
Note that we did not re-order the rectangle if the arcs were in the wrong order; we simply failed it. This way we avoided duplicate solutions (if we ordered them and accepted it as a valid rectangle we would have the same rectangle four times) but the time spent to find the rectangle increases. We reduced the overhead by stopping the search at the first arc that is out of order instead of creating the whole rectangle. Also, the overhead would also be reduced if the arcs are ordered (since the first match would be ordered). On the other hand, if we consider the complexity of searching for all rectangles this way, the overhead is not that significant. Also, it only applies if we want just the first rectangle; should we want to get more solutions or ensure that there are no other solutions, prolog will search the whole tree, whether it reports the solutions or not.

How to find the best possible answer to a really large seeming problem?

First off, this is NOT a homework problem. I haven't had to do homework since 1988!
I have a list of words of length N
I have a max of 13 characters to choose from.
There can be multiples of the same letter
Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:
speedometer has 4 e's in it, something MOST words don't have,
so I could toss that word due to a poor fit characteristic, or it might just
go away based on the algorithm
I've looked # letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.
Genetic algorithms come to mind, but I've never tried them before....
Seems like I need a way to score each letter based upon its association with other letters in the words it is in....
It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)
Assuming it's hard, you have the usual routes:
branch and bound
stochastic search
approximation algorithms
Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of
Letters you've already used (with multiplicity)
Number of characters you still get to use
Letters still available
Words still in your list
Number of words still in your list (count of the previous set)
Number of words that are not possible in this state
Number of words that are already covered by your choice of letters
You'd start with
Empty set
13
{A, B, ..., Z}
Your whole list
N
0
0
Put that data structure into a queue.
At each step
Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities
From a state, I'd generate possible next states as follows:
For each letter L in the set of letters left
Generate a new state where:
you've added L to the list of chosen letters
the least letter is L
so you remove anything less than L from the allowed letters
So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.
Do all the various accounting to figure out the new states.
Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.
No special handling for speedometer required.
I can't imagine this would be fast, but it'd work.
There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.
Total brute forcing should work, although the implementation would become quite confusing.
Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.
Comments welcome. :)
Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.
I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.
I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!
Some thoughts based on my (mis)understanding:
If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)
Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:
spdmtr hs 4 's n t, smthng
MST wrds dn't hv, s cld
tss tht wrd d t pr ft
chrctrstc, r t mght jst g
wy bsd n th lgrthm
Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....

Resources