Algorithm for random sampling under multiple no-repeat conditions - algorithm

I ran into the following issue:
So, I got an array of 100-1000 objects (size varies), e.g.something like
[{one:1,two:'A',three: 'a'}, {one:1,two:'A',three: 'b'}, {one:1,two:'A',three: 'c'}, {one:1,two:'A',three: 'd'},
{one:1,two:'B',three: 'a'},{one:2,two:'B',three: 'b'},{one:1,two:'B',three: ':c'}, {one:1,two:'B',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:1,two:'C',three: ':c'}, {one:2,two:'C',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:2,two:'C',three: ':c'}, {one:1,two:'C',three: 'd'},...]
The value for 'one' is pretty much arbitrary. 'two' and 'three' have to be balanced in a certain way: Basically, in the above, there is some n, such that n=4 times 'A'. 'B','C','D','a','b','c' and 'd' - and such an n exists in any variant of this problem. It is just not clear what the n is, and the combinations themselves can also vary (e.g. if we only had As and Bs, [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] as well as [{1,A,a},{1,A,b},{1,B,a},{1,B,b}] would both be possible arrays with n=2).
What I am trying to do now, is randomise the original array with the condition that there cannot be repeats in close order for some keys, i.e. the value of 'two' and 'three' for an object at index i-1 cannot be the same as the value of same attribute for the object at index i (and that should be true for all or as many objects as possible), i.e. [{1,B,a},{1,A,a},{1,C,b}] would not be allowed, [{1,B,a},{1,C,b},{1,A,a}] would be allowed.
I tried some brute-force method (randomise all, then push wrong indexes to the back) that works rarely, but it mostly just loops infinitely over the whole array, because it never ends up without repeats. Not sure, if this is because it is generally mathematically impossible for some original arrays, or if it is just because my solution sucks.
By now, I've been looking for over a week, and I am not even sure how to approach this.
Would be great, if someone knew a solution for this problem, or at least a reason why it isn't possible. Any help is greatly appreciated!

First, let us dissect the problem.
Forget for now about one, separate two and three into two independent sequences (assuming they are indeed independent, and not tied to each other).
The underlying problem is then as follows.
Given is a collection of c1 As, c2 Bs, c3 Cs, and so on. Place them randomly in such a way that no two consecutive letters are the same.
The trivial approach is as follows.
Suppose we already placed some letters, and are left with d1 As, d2 Bs, d3 Cs, and so on.
What is the condition when it is impossible to place the remaining letters?
It is when the count for one of the letters, say dk, is greater than one plus the sum of all other counts, 1 + d1 + d2 + ... excluding dk.
Otherwise, we can place them as K . K . K . K ..., where K is the k-th letter, and dots correspond to any letter except the k-th.
We can proceed at least as long as dk is still the greatest of the remaining quantities of letters.
So, on each step, if there is a dk equal to 1 + d1 + d2 + ... excluding dk, we should place the k-th letter right now.
Otherwise, we can place any other letter and still be able to place all others.
If there is no immediate danger of not being able to continue, adjust the probabilities to your liking, for example, weigh placing k-th letter as dk (instead of uniform probabilities for all remaining letters).

This problem smells of NP complete and lots of hard combinatorial optimization problems.
Just to find a solution, I'd always place as the next element the remaining element that can be placed which as few possible remaining elements can be placed next to. In other words try to get the hardest elements out of the way first - if they run into a problem, then you're stuck. If that works, then you're golden. (There are data structures like a heap which can be used to find those fairly efficiently.)
Now armed with a "good enough" solver, I'd suggest picking the first element randomly until the solver can solve the rest. Repeat. If at any time you find it takes too many guesses, just go with what the solver did last time. That way all the way you know that there IS a solution, even though you are trying to do things randomly at every step.

Graph
As I understand it, one does not play a role in constraints, so I'll label {one:1,two:'A',three: 'a'} with Aa. Thinking of objects as vertices, place them on a graph. Place edges whenever two respective vertices can be beside each other. For [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] it would be,
and for [{1,A,a},{1,A,b},{1,B,a},{1,B,b}],
The problem becomes: select a random Hamiltonian path, (if possible.) For the loop, it would be any path on the circuit [Aa, Bb, Aa, Bb] or the reverse. For the disconnected lines, it is not possible.
Possible algorithm
I think, to be uniformly random, we would have to enumerate all the possibilities and choose one at random. This is probably infeasible, even at 100 vertices.
A näive algorithm that relaxes the uniform criterion, I think, would be to select (a) random point that does not split the graph in two. Then select (b) random neighbour of (a) that does not split the graph in two. Remove (a) to the solution. (a) = (b). Keep going until the end or backtrack when there are no moves, (if possible.) There may be further heuristics that could cut down the branching factor.
Example
There are no vertices that would disconnect the graph, so choosing Ab uniformly at random.
The neighbours of Ab are {Ca, Bc, Ba, Cc} of which Ca is chosen randomly.
Ab splits the graph, so we must choose Bc.
The only choice left is which of Cc and Ba comes first. We might end up with: [Ab, Ca, Bc, Ab, Ba, Cc].

Related

Conditional Randomization

Imagine there is a list of elements as follow:
1a, 2a, 3a, 4a, 5b, 6b, 7b, 8b
Now we need to randomize it such that not more than 2 "a"s or 2 "b"s get next to each other. For instance the following list is not allowed because of the 2nd, third and fourth elements:
3a, 7b, 8b, 5b, 2a, 1a, 5b, 4a
How can we write write an efficient code without generating many random sequences and many triad comparisons?
Create two bins, one for the a's and one for the b's. Pick from a random bin and record the bin. Pick a second number from a random bin. If the bin is not the same as before just record the bin. If the bin is the same as before then force the next pick to be from the other bin. Carry on forward, only forcing a bin when you have two picks in succession from the same bin.
I'm going to assume that:
There are only two kinds of element, a and b, and
There aren't "too many" of either kind (say, less than 30) or that you're willing to use a bignum package.
The basic idea is to (conceptually) first construct a valid sequence of as and bs, and then randomly assign the actual elements to the as and bs in the sequence. In practice, you could do both of these steps in parallel; every time you add an a to the sequence, you select a random a element from the set of such elements not yet assigned, and similarly with b elements.
The (slightly) complicated part is constructing the valid sequence without bias, and that's what I'm going to focus on.
As is often the case, the key is to be able to count the number of possible sequences, in a way which leads to an enumeration. We don't actually enumerate the possibilities -- that would take really a long time for even moderately long sequences -- but we do need to know for every prefix how to enumerate the sequences starting with that prefix.
Rather than produce the sequence element by element, we'll produce it in chunks of one or two elements of the same kind. Since we don't allow more than two consecutive elements of the same kind, the final sequence must be a series of alternating chunks. In effect, at every point except the very beginning, the choice is whether to select one or two of the "other" kind. At the beginning, we must select one or two of either kind, so we must first choose the starting kind, after which all the kinds are fixed; we merely need a sequence of 1's and 2's -- representing one element or two elements of the same kind -- with the kind alternating at each step. The sequence of 1s and 2s is constrained by the fact that we know how many elements there are of each kind, which corresponds to the sum of the numbers in the even and odd positions of the {1,2}-sequence.
Now, let's define f(m,n) as the count of sequences whose even and odd sums are m and n. (Using CS rather than maths rules, we'll assume that the first position is 0 (even) but it actually makes absolutely no difference.) Suppose that we have 6 as and 4 bs. There are then f(6,4) sequences which start with an a, and f(4,6) sequences which start with a b, so that the total count of valid sequences is f(6,4)+f(4,6).
Now, suppose we need to compute f(m,n). Assuming m is large enough, we have exactly two options: choose one of the m elements of the even kind or choose two of the m elements of the even kind. After that, we will swap even and odd because the next choice applies to the other kind.
That rather directly leads to the recursion
f(m, n) = f(n, m-1) + f(n, m-2)
which we might think of as a kind of two-dimensional fibonacci recursion. (Recall that fib(m) = fib(m-1) + fib(m-2); the difference here is the second argument, and the fact that the argument order flip-flops at each recursion.
As with Fibonacci numbers, computing the values naively without memoization leads to exponential blow-up of recursive calls, and a more efficient strategy is to compute the entire table starting from f(0,0) (which has the value 1, obviously); in essence, a dynamic programming approach. We could also just do the recursive computation with memoization, which is slightly less efficient but possibly easier to read.
For now, let's just assume that we've arranged for the computation of f(m,n) to be suitably fast, either because we've prebuilt the entire array of possibilities up to the largest values of m and n we will need, or because we're using a memoizing recursive solution so that we only need to do the slow computation once for any given m,n. Now let's construct the random sequence.
Suppose there are na a-elements and nb b-elements. Since we don't know whether the random sequence will start with an a or a b, we need to first make that decision. We know there are f(na,nb) valid sequences which start a and f(nb,na) valid sequences starting with a b, so we start by generating a random non-negative integer less than f(na,nb) + f(nb,na). If the random is less than f(na,nb) then we'll start with a-elements; otherwise we'll start with b elements.
Having made that decision, we'll proceed as follows. We know what the next element kind is and how many elements remain of each kind, so we only need to know whether to select one or two elements of the correct kind. To make that choice, we generate a non-negative random integer less than f(m, n); if it is less than f(n, m-1) then we select one element; otherwise we select two elements. Then we swap the element sets, fix the counts, and continue until m and n are both 0.

anagram string edit distance algorithm/code?

There are two anagram strings S and P. There are two basic operations:
Swap two letters that are in neighborhood, e.g, swap "A" and "C" in BCCAB, cost is 1.
Swap the first letter and the last letter in the string, cost is 1.
Question: Design an efficient algorithm that minimize the cost to change S to P.
I tried a greedy algorithm, but I found counter examples and I think it is incorrect. I know famous DP problem edit distance, but I did not get the formula for this one.
Anyone can help? An idea and pseudo code would be great.
I wonder if http://en.wikipedia.org/wiki/A*_search_algorithm would count as efficient? For a heuristic, look for the smallest distance each character has to go, treating the string as a circle, and divide the sum of these distances by two. On the circle, each character needs to participate in enough swaps to move it, one step at a time, to its destination, and each swap affects only two characters, so this heuristic should be a lower bound to the number of swaps required.
Without the ends-swap the answer is simple: you have to get the first and last letter right, and there's no way to "save" by doing it later; hence for word ai where 0 <= i < n you'd "bubble" the correct a0 and an-1 in place, then repeat for the word ai where 1 <= i < n-1 until you're left with 0 or 1 letters.
With the ends-swap option, you're left with much harder problem, since there are two directions where each letter can arrive in the correct place. You'd basically have a bipartite graph between source and target word, and you'd want to find a matching that minimizes the sum of distances. Even that is not really an algorithm, since each swap moves two of the letters, not just one.
Bottom line is, you may have to do a search, but at least you can bound the search with the no-ends-swap distance.

Generating a set of permutation given a set of numbers and some conditions on the relative positions of the elements

I am looking for an algorithm which, given a set of numbers {0, 1, 2, 4, 5...} and a set of conditions on the relative positions of each element, would check if a valid permutation exists. The conditions are always of type "Element in position i in the original array must be next(adjacent) to element in position j or z".
The last and first element in a permutation are considered adjacent.
Here's a simple example:
Let the numbers be {0, 1, 2, 3}
and a set of conditions: a0 must be next to a1, a0 must be next to a2, a3 must be next to a1
A valid solution to this example would be {0,1,3,2}.
Notice that any rotation/symmetry of this solution is also a valid solution. I just need to prove that such a solution exists.
Another example using the same set:
a0 must be next to a1, a0 must be next to a3, a0 must be next to a2.
There is no valid solution for this example since a number can only adjacent to 2 numbers.
The only idea I can come up with right now would be to use some kind of backtracking.
If a solution exists, this should converge quiet fast. If no solution exists, I can't imagine any way to avoid checking all possible permutations.
As I already stated, a rotation or symmetry doesn't affect the result for a given permutation, therefor it should be possible to reduce the number of possibilities.
Formulate this as a graph problem. Connect every pair of numbers that need to be next to each other. You are going to end up with a bunch of connected component. Each component has a number of permutations (lets call them mini-permutations), and you can have a permutation of the components.
When you create the graph make sure each component follows a bunch of rules: no cycles, no vertices with more than two vertices etc.
Basically you want to know if you can create chains of numbers. Put each number into a chain which keeps track of the number and up to two neighbors. Use the rules to join chains together. When you join two chains you'll end up with a chain with two loose ends (neighbors). If you can get through all the rules without running out of loose ends then it works.
I've implemented the graph solution with a slight modification.
If a node has too many neighbors, the algorithm would drop one edge and check the graph again.
I then use backtracking to roll back and check if it is possible to drop the next edge...
This method give the same result as the brute force method I wrote.
In term of complexity, this solution seems to be better than the brute force although I can't run it on more than 20 numbers(only 8 for the brute-force). In a sens, this is logical since such graph can actually generate a subset of valid permutation at once, plus in the worst case, it is equivalent to finding some compositions over the set of edges. It is backtracking after all.
Given that rotation doesn't have any effect on the validity of permutations, I was thinking about fixing the a0 in the first position(This can be achieved by simply rotating a valid permutation till a0 is in the first position) and then try to build the solution from there.
Using DP I might get something better than exponential complexity. But I must say that i'm still not sure where to begin :)

How to find the best possible answer to a really large seeming problem?

First off, this is NOT a homework problem. I haven't had to do homework since 1988!
I have a list of words of length N
I have a max of 13 characters to choose from.
There can be multiples of the same letter
Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:
speedometer has 4 e's in it, something MOST words don't have,
so I could toss that word due to a poor fit characteristic, or it might just
go away based on the algorithm
I've looked # letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.
Genetic algorithms come to mind, but I've never tried them before....
Seems like I need a way to score each letter based upon its association with other letters in the words it is in....
It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)
Assuming it's hard, you have the usual routes:
branch and bound
stochastic search
approximation algorithms
Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of
Letters you've already used (with multiplicity)
Number of characters you still get to use
Letters still available
Words still in your list
Number of words still in your list (count of the previous set)
Number of words that are not possible in this state
Number of words that are already covered by your choice of letters
You'd start with
Empty set
13
{A, B, ..., Z}
Your whole list
N
0
0
Put that data structure into a queue.
At each step
Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities
From a state, I'd generate possible next states as follows:
For each letter L in the set of letters left
Generate a new state where:
you've added L to the list of chosen letters
the least letter is L
so you remove anything less than L from the allowed letters
So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.
Do all the various accounting to figure out the new states.
Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.
No special handling for speedometer required.
I can't imagine this would be fast, but it'd work.
There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.
Total brute forcing should work, although the implementation would become quite confusing.
Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.
Comments welcome. :)
Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.
I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.
I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!
Some thoughts based on my (mis)understanding:
If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)
Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:
spdmtr hs 4 's n t, smthng
MST wrds dn't hv, s cld
tss tht wrd d t pr ft
chrctrstc, r t mght jst g
wy bsd n th lgrthm
Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....

Ordering a dictionary to maximize common letters between adjacent words

This is intended to be a more concrete, easily expressable form of my earlier question.
Take a list of words from a dictionary with common letter length.
How to reorder this list tto keep as many letters as possible common between adjacent words?
Example 1:
AGNI, CIVA, DEVA, DEWA, KAMA, RAMA, SIVA, VAYU
reorders to:
AGNI, CIVA, SIVA, DEVA, DEWA, KAMA, RAMA, VAYU
Example 2:
DEVI, KALI, SHRI, VACH
reorders to:
DEVI, SHRI, KALI, VACH
The simplest algorithm seems to be: Pick anything, then search for the shortest distance?
However, DEVI->KALI (1 common) is equivalent to DEVI->SHRI (1 common)
Choosing the first match would result in fewer common pairs in the entire list (4 versus 5).
This seems that it should be simpler than full TSP?
What you're trying to do, is calculate the shortest hamiltonian path in a complete weighted graph, where each word is a vertex, and the weight of each edge is the number of letters that are differenct between those two words.
For your example, the graph would have edges weighted as so:
DEVI KALI SHRI VACH
DEVI X 3 3 4
KALI 3 X 3 3
SHRI 3 3 X 4
VACH 4 3 4 X
Then it's just a simple matter of picking your favorite TSP solving algorithm, and you're good to go.
My pseudo code:
Create a graph of nodes where each node represents a word
Create connections between all the nodes (every node connects to every other node). Each connection has a "value" which is the number of common characters.
Drop connections where the "value" is 0.
Walk the graph by preferring connections with the highest values. If you have two connections with the same value, try both recursively.
Store the output of a walk in a list along with the sum of the distance between the words in this particular result. I'm not 100% sure ATM if you can simply sum the connections you used. See for yourself.
From all outputs, chose the one with the highest value.
This problem is probably NP complete which means that the runtime of the algorithm will become unbearable as the dictionaries grow. Right now, I see only one way to optimize it: Cut the graph into several smaller graphs, run the code on each and then join the lists. The result won't be as perfect as when you try every permutation but the runtime will be much better and the final result might be "good enough".
[EDIT] Since this algorithm doesn't try every possible combination, it's quite possible to miss the perfect result. It's even possible to get caught in a local maximum. Say, you have a pair with a value of 7 but if you chose this pair, all other values drop to 1; if you didn't take this pair, most other values would be 2, giving a much better overall final result.
This algorithm trades perfection for speed. When trying every possible combination would take years, even with the fastest computer in the world, you must find some way to bound the runtime.
If the dictionaries are small, you can simply create every permutation and then select the best result. If they grow beyond a certain bound, you're doomed.
Another solution is to mix the two. Use the greedy algorithm to find "islands" which are probably pretty good and then use the "complete search" to sort the small islands.
This can be done with a recursive approach. Pseudo-code:
Start with one of the words, call it w
FindNext(w, l) // l = list of words without w
Get a list l of the words near to w
If only one word in list
Return that word
Else
For every word w' in l do FindNext(w', l') //l' = l without w'
You can add some score to count common pairs and to prefer "better" lists.
You may want to take a look at BK-Trees, which make finding words with a given distance to each other efficient. Not a total solution, but possibly a component of one.
This problem has a name: n-ary Gray code. Since you're using English letters, n = 26. The Wikipedia article on Gray code describes the problem and includes some sample code.

Resources