Related
I ran into the following issue:
So, I got an array of 100-1000 objects (size varies), e.g.something like
[{one:1,two:'A',three: 'a'}, {one:1,two:'A',three: 'b'}, {one:1,two:'A',three: 'c'}, {one:1,two:'A',three: 'd'},
{one:1,two:'B',three: 'a'},{one:2,two:'B',three: 'b'},{one:1,two:'B',three: ':c'}, {one:1,two:'B',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:1,two:'C',three: ':c'}, {one:2,two:'C',three: 'd'},
{one:1,two:'C',three: 'a'},{one:1,two:'C',three: 'b'},{one:2,two:'C',three: ':c'}, {one:1,two:'C',three: 'd'},...]
The value for 'one' is pretty much arbitrary. 'two' and 'three' have to be balanced in a certain way: Basically, in the above, there is some n, such that n=4 times 'A'. 'B','C','D','a','b','c' and 'd' - and such an n exists in any variant of this problem. It is just not clear what the n is, and the combinations themselves can also vary (e.g. if we only had As and Bs, [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] as well as [{1,A,a},{1,A,b},{1,B,a},{1,B,b}] would both be possible arrays with n=2).
What I am trying to do now, is randomise the original array with the condition that there cannot be repeats in close order for some keys, i.e. the value of 'two' and 'three' for an object at index i-1 cannot be the same as the value of same attribute for the object at index i (and that should be true for all or as many objects as possible), i.e. [{1,B,a},{1,A,a},{1,C,b}] would not be allowed, [{1,B,a},{1,C,b},{1,A,a}] would be allowed.
I tried some brute-force method (randomise all, then push wrong indexes to the back) that works rarely, but it mostly just loops infinitely over the whole array, because it never ends up without repeats. Not sure, if this is because it is generally mathematically impossible for some original arrays, or if it is just because my solution sucks.
By now, I've been looking for over a week, and I am not even sure how to approach this.
Would be great, if someone knew a solution for this problem, or at least a reason why it isn't possible. Any help is greatly appreciated!
First, let us dissect the problem.
Forget for now about one, separate two and three into two independent sequences (assuming they are indeed independent, and not tied to each other).
The underlying problem is then as follows.
Given is a collection of c1 As, c2 Bs, c3 Cs, and so on. Place them randomly in such a way that no two consecutive letters are the same.
The trivial approach is as follows.
Suppose we already placed some letters, and are left with d1 As, d2 Bs, d3 Cs, and so on.
What is the condition when it is impossible to place the remaining letters?
It is when the count for one of the letters, say dk, is greater than one plus the sum of all other counts, 1 + d1 + d2 + ... excluding dk.
Otherwise, we can place them as K . K . K . K ..., where K is the k-th letter, and dots correspond to any letter except the k-th.
We can proceed at least as long as dk is still the greatest of the remaining quantities of letters.
So, on each step, if there is a dk equal to 1 + d1 + d2 + ... excluding dk, we should place the k-th letter right now.
Otherwise, we can place any other letter and still be able to place all others.
If there is no immediate danger of not being able to continue, adjust the probabilities to your liking, for example, weigh placing k-th letter as dk (instead of uniform probabilities for all remaining letters).
This problem smells of NP complete and lots of hard combinatorial optimization problems.
Just to find a solution, I'd always place as the next element the remaining element that can be placed which as few possible remaining elements can be placed next to. In other words try to get the hardest elements out of the way first - if they run into a problem, then you're stuck. If that works, then you're golden. (There are data structures like a heap which can be used to find those fairly efficiently.)
Now armed with a "good enough" solver, I'd suggest picking the first element randomly until the solver can solve the rest. Repeat. If at any time you find it takes too many guesses, just go with what the solver did last time. That way all the way you know that there IS a solution, even though you are trying to do things randomly at every step.
Graph
As I understand it, one does not play a role in constraints, so I'll label {one:1,two:'A',three: 'a'} with Aa. Thinking of objects as vertices, place them on a graph. Place edges whenever two respective vertices can be beside each other. For [{1,A,a},{1,A,a},{1,B,b},{1,B,b}] it would be,
and for [{1,A,a},{1,A,b},{1,B,a},{1,B,b}],
The problem becomes: select a random Hamiltonian path, (if possible.) For the loop, it would be any path on the circuit [Aa, Bb, Aa, Bb] or the reverse. For the disconnected lines, it is not possible.
Possible algorithm
I think, to be uniformly random, we would have to enumerate all the possibilities and choose one at random. This is probably infeasible, even at 100 vertices.
A näive algorithm that relaxes the uniform criterion, I think, would be to select (a) random point that does not split the graph in two. Then select (b) random neighbour of (a) that does not split the graph in two. Remove (a) to the solution. (a) = (b). Keep going until the end or backtrack when there are no moves, (if possible.) There may be further heuristics that could cut down the branching factor.
Example
There are no vertices that would disconnect the graph, so choosing Ab uniformly at random.
The neighbours of Ab are {Ca, Bc, Ba, Cc} of which Ca is chosen randomly.
Ab splits the graph, so we must choose Bc.
The only choice left is which of Cc and Ba comes first. We might end up with: [Ab, Ca, Bc, Ab, Ba, Cc].
I've been challenging myself to look at algorithms and try to change them in ordem to make them the fastest i can. Recently i tried an algorithm which searches for the longest sequence of any letter on a string. The naive answer looks at all letters and when the current sequence is bigger than the biggest sequence found, the new biggest become the current. Example:
With C for current sequence and M for maximum sequence, order of letters checked and variables updates goes like this:
AAAACCDDD-> A(C=1,M=1)->A(C=2,M=2)->A(C=3,M=3)->A(C=4,M=4)->C(C=1,M=4)->C(C=2,M=4)->D(C=1,M=4)->D(C=2,M=4)->D(C=3,M=4) Answer: 4 It can be faster by stopping when there is no way to get a new biggest sequence given M,the place you are in the string and the string size.
I've tried and came up with an algorithm which usually accesses less elements of the string, I think will be easier to explain like this:
Instead of jumping 1 by 1, you jump what would be necessary to have a new biggest sequence if all letters across the jump were the same. So for example after you read AAAB, you would jump 3 spots because you suppose all 3 next letters are B (AAABBBB). Of course they might not be, and that is why you now go backwards counting consecutive B's right behind your position. Your next "jump" will be lower depending on how many B's you've found. So for instance
AAABCBBBBD after the jump you are in the third B. You go backwards and find one B, backwards again and finding a C you stop. Now you already know you have a sequence of 2 so your next jump can't be of 3 -you might miss a sequence of 4 B's. So you jump 2 and get to a B. Go backwards one and find a B. The next backwards position is where you started so you know that you found a sequence of 4.
In that example it didnt have much of a difference but if you use instead a string like AAABBBBCDDECEE you can see that after you jumped from the first C to the last C you would only need to backtrack once because after seeing that the letter behind you is E you don't care anymore about what was across that jump.
I've coded both methods and that second one has been 2 to 3 times faster. Now I'm really curious to know, is there a faster way to find it?
A few days ago I came across such a problem at the contest my uni was holding:
Given the history of guesses in a mastermind game using digits instead
of colors in a form of pairs (x, y) where x is the guess and y is how
many digits were placed correctly, guess the correct number. Each
input is guaranteed to have a solution.
Example for a 5-place game:
(90342, 2)
(70794, 0)
(39458, 2)
(34109, 1)
(51545, 2)
(12531, 1)
Should yield:
39542
Create an algorithm to correctly guess the result in an n-place
mastermind given the history.
So the only idea I had was to keep the probability of each digit being correct based on the correct shots in a given guess and then try to generate the most possible number, then the next one and so on - so for example we'd have 9 being 40% possible for the first place (cause the first guess has 2/5=40% correct), 7 being impossible and so on. Then we do the same for other places in the number and finally generate a number with the highest probability to test it against all the guesses.
The problem with this approach, though, is that generating the next possible number, and the next, and so on (as we probably won't score a home run in the first try) is really non-trivial (or at least I don't see an easy way of implementing this) and since this contest had something like a 90 minute timeframe and this wasn't the only problem, I don't think something so elaborate was the anticipated approach.
So how could one do it easier?
An approach that comes to mind is to write a routine that can generally filter an enumeration of combinations based on a particular try and its score.
So for your example, you would initially pick one of the most constrained tries (one of the ones with a score of 2) as a filter and then enumerate all combinations that satisfy it.
The output from that enumeration is then used as input to a filter run for the next unprocessed try, and so on, until the list of tries is exhausted.
The candidate try that comes out of the final enumeration is the solution.
Probability does not apply here. In this case a number is either right or wrong. There is no "partially right".
For 5 digits you can just test all 100,000 possible numbers against the given history and throw out the ones where the matches are incorrect. This approach becomes impractical for larger numbers at some point. You will be left with a list of numbers that meet the criteria. If there is exactly one in the list, then you have solved it.
python code, where matches counts the matching digits of its 2 parameters:
for k in range(0,100000):
if matches(k,90342)==2 and matches(k,70794)==0 and matches(k,39458)==2 and matches(k,34109)==1 and matches(k,51545)==2 and matches(k,12531):
print k
prints:
39542
First off, this is NOT a homework problem. I haven't had to do homework since 1988!
I have a list of words of length N
I have a max of 13 characters to choose from.
There can be multiples of the same letter
Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:
speedometer has 4 e's in it, something MOST words don't have,
so I could toss that word due to a poor fit characteristic, or it might just
go away based on the algorithm
I've looked # letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.
Genetic algorithms come to mind, but I've never tried them before....
Seems like I need a way to score each letter based upon its association with other letters in the words it is in....
It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)
Assuming it's hard, you have the usual routes:
branch and bound
stochastic search
approximation algorithms
Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of
Letters you've already used (with multiplicity)
Number of characters you still get to use
Letters still available
Words still in your list
Number of words still in your list (count of the previous set)
Number of words that are not possible in this state
Number of words that are already covered by your choice of letters
You'd start with
Empty set
13
{A, B, ..., Z}
Your whole list
N
0
0
Put that data structure into a queue.
At each step
Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities
From a state, I'd generate possible next states as follows:
For each letter L in the set of letters left
Generate a new state where:
you've added L to the list of chosen letters
the least letter is L
so you remove anything less than L from the allowed letters
So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.
Do all the various accounting to figure out the new states.
Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.
No special handling for speedometer required.
I can't imagine this would be fast, but it'd work.
There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.
Total brute forcing should work, although the implementation would become quite confusing.
Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.
Comments welcome. :)
Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.
I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.
I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!
Some thoughts based on my (mis)understanding:
If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)
Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:
spdmtr hs 4 's n t, smthng
MST wrds dn't hv, s cld
tss tht wrd d t pr ft
chrctrstc, r t mght jst g
wy bsd n th lgrthm
Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....
I don't even know if a solution exists or not. Here is the problem in detail. You are a program that is accepting an infinitely long stream of characters (for simplicity you can assume characters are either 1 or 0). At any point, I can stop the stream (let's say after N characters were passed through) and ask you if the string received so far is a palindrome or not. How can you do this using less sub-linear space and/or time.
Yes. The answer is about two-thirds of the way down http://rjlipton.wordpress.com/2011/01/12/stringology-the-real-string-theory/
EDIT: Some people have asked me to summarize the result, in case the link dies. The link gives some details about a proof of the following theorem: There is a multi-tape Turing machine that can recognize initial non-trivial palindromes in real-time. (A summary, also provided by the article linked: Suppose the machine has read x1, x2, ..., xk of the input. Then it has only constant time to decide if x1, x2, ..., xk is a palindrome.)
A multitape Turing machine is just one with several side-by-side tapes that it can read and write to; in a very specific sense it is exactly equivalent to a standard Turing machine.
A real-time computation is one in which a Turing machine must read a character from input at least once every M steps (for some bounded constant M). It is readily seen that any real-time algorithm should be linear-time, then.
There is a paper on the proof which is around 10 pages which is available behind an institutional paywall here which I will not repost elsewhere. You can contact the author for a more detailed explanation if you'd like; I just had read this recently and realized it was more or less what you were looking for.
You could use a rolling hash, or more rolling hashes for accuracy. Incrementally compute the hash of the characters read so far, in the order they were read, and in reverse order of reading.
If your hash function is x*3^(k-1)+x*3^(k-2)+...+x*3^0 for example, where x is a character you read, this is how you'd do it:
hLeftRight = 0
hRightLeft = 0
k = 0
repeat until there are numbers in the stream
x = stream.Get()
hLeftRight = 3*hLeftRight + x.Value
hRightLeft = hRightLeft + 3^k*x.Value
if (x.QueryPalindrome = true)
yield hLeftRight == hRightLeft
k = k + 1
Obviously you'd have to calculate the hashes modulo something, probably a prime or a power of two. And of course, this could lead to false positives.
Round 2
As I see it, with each new character, there are three cases:
Character breaks potential symmetry, for example, aab -> aabc
Character extends the middle, for example aab -> aabb
Character continues symmetry, for example aab->aaba
Assume you have a pointer that tracks down the string and points to the last character that continued a potential palindrome.
(I am going to use parenthesis to indicate a pointed at character)
Lets say you are starting with aa(b) and get an:
'a' (case 3), you move the pointer to
the left and check if it's an 'a' (it
is). You now have a(a)b.
'c' (case 1), you are not expecting a 'c', in this case you start back at the beginning and you now have aab(c).
The really tricky case is 2, because somehow you have to know that the character you just got isn't affecting symmetry, it is just extending the middle. For this, you have to hold an additional pointer that tracks where the plateau's (middle's) edge lies. For example, you have (b)baabb and you just got another 'b', in this case you have to know to reset the pointer to the base of the middle plateau here: bbaa(b)bb. Since we are going for constant time, you have to hold a pointer here to begin with (you can't afford the time to search for the plateau's edge). Now if you get another 'b', you know that you are still on the edge of that plateau and you keep the pointer where it is, so bbaa(b)bb -> bbaa(b)bbb. Now, if you get an 'a', you know that the 'b's are not part of the extended middle and you reset both pointers (The tracking pointer and the edge pointer) so you now have bbaabbbb((a)).
With these three cases, I think all bases are covered. If you ever want to check if the current string is a palindrome, check if the first pointer (not the plateau's edge pointer) is at index 0.
This might help you:
http://arxiv.org/pdf/1308.3466v1.pdf
If you store the last $k$ many input symbols you can easily find palindromes up to a length of $k$.
If you use the algorithms of the paper you can find the midpoints of palindromes and an length estimate of its length.