I'm working on a genetic algorithm as a sort of having-fun-with-programming exercise.
The problem is about mapping two texts in different languages one to another by sentences. Since different translations can start and end sentences in different places, there is a large, but not perfect correspondence between the texts.
So, take for example two strings:
a X. a a Y. a aaa a. X a a. a aa. aaa a Y aa.
X b bb. b Y bb bbb b X bb. bb. b. bb Y. bbb b.
where X and Y are anchors and As and Bs language specific words. The gene I'm working on takes all existing sentence breaks (full stops excluding the finishing ones) as binary values; the example above has six sentences in both texts, so ten bits are needed. Working manually, not understanding the text itself, I would assume the best answer is:
10011 11010
(1 - no change, 0 - dot removed and sentences merged) which would result in:
a X. a a Y a aaa a X a a. a aa. aaa a Y aa.
X b bb. b Y bb bbb b X bb. bb b. bb Y bbb b.
My fitness function evaluates four conditions, as follows:
The number of sentences should be as high as possible.
There must be an equal number of sentences in both texts.
The lengths of sentences should be as close as possible.
As many anchors as possible should be found in the same sentence in both texts (anchors are for example character names in a sample novel I'm using).
However, I'm not getting the result I'm hoping for since the condition 2, which is very important, takes over no matter what I do and evolution stops. If I don't heavily tweak what's happening, I end up either merging the whole thing into one huge sentence (gene 00000 00000 above) or finding the first matching number of sentences and getting stuck.
In general, how can this problem be overcome? How can I make my fitness function not stuck when a list of conditions that work against each other is needed? Should the process be divided into two stages? How can I then reintroduce condition 2 after some evolution? Should I evaluate it every N generations?
Rather than splitting the problem into two stages I think you'd be better to work on improving your fitness function. Having conditions that work against each other doesn't have to be a problem. The key is how your fitness score is calculated. If you give each condition a weight, each can be tweaked for better results.
If you could give more detail about the fitness score it might be easier to work out what is going on but it seems that certain conditions have an inherently greater weight than the others. Especially since condition 1 should give a low score if the result is one huge sentence, and vice versa. By tweaking the relative weights of each condition you should be able to aim for a gene where both conditions do well, rather than one or the other.
Related
I have two arrays of string of length m and n respectively, where the strings inside are all with length x, and I want to find the best matching pairs that contain the most number of common letter possible:
In a simple case, just consider these two strings
Sm = [AAAA, BBBB]
Sn = [ABBA, AAAA, AAAA, CCCC]
Expected results (2 pairs matched, 2 strings left alone):
Pair 1: AAAA -> AAAA because of score 4
Pair 2: BBBB -> ABBA because of score 2
Strings in Sn that are left alone:
AAAA because the same string in Sm has been matched already
CCCC because unable to match any
Score matrix:
My current method (Slow):
Get the string length x, which is the max score (the case where all letters are identical) - in this case it is 4
Brute force compare mxn times generate the score matrix above - in this case it is 2*4 times
Loop from x to 1: (In this case it is looping from 4 to 1)
Walk through the score matrix and pop the string pairs with score x
Mark remaining unpaired strings or strings with 0 score as alone
Question:
My current method is slow with O(mn) when producing the score matrix (x will not be large so I assume const here).
Is there any algorithm that can perform better than O(mn) complexity?
Sorry I don't yet have enough rep to simply provide a comment but in a project I wrote a long time ago I leveraged the Levenshtien Distance algorithm. Specifically see this project for some helpful insight.
As far as I can tell you are doing the most efficient thing. To be completely thorough you need to compare every string in Sn with every string in Sn, so at best the algorithm will be O(mn). Anything less would not be comparing every element to every element.
One optimization could be to remove all duplicates, but that would for the most part incur a performance hit that would likely cause more harm than good in almost all circumstances.
I have a file consist lot of lines like:
John is running at night
John is not walking at night
Jack is running at night
Jack is waiting for someone
John is waiting for someone
and I need to write a program that will group similar sentences and print them to a file.
Similar sentences are sentences that only a single word has been changed between them.
For example, the output file should look like:
John is running at night
Jack is running at night
The changing word was: Jhon, Jack
Jack is waiting for someone
John is waiting for someone
The changing word was: Jhon, Jack
I thought to implement it by parsing the file and arrange the strings in groups of a number of words in each string(all string that has 6 words will be group together and all string that has 5 words will be group together and so on)
After arranging to groups I can split each string to a set of words and compare each string to another string and check for a match.
I think my solution is not efficient.
Does anyone have a better solution he can think of?
Let us assume there are M sentences with an average of N words each. For every sentence we wish to produce a list of indices of other sentences (up to M - 1) that differ by exactly one word. Thus, the input size is O(MN) words and the output size is O(M²) numbers. Here is an algorithm that runs in O(MN + M²) and is therefore optimal.
First, read all the sentences, split them into words and index the words in a hash tables. Thus, we can think of sentences of arrays. To help our thought process, we can further think of sentences as Latin lowercase strings by replacing each initial word with a letter (this works up to 26 distinct words).
Now we wish to be able to query each pair of strings (A, B) in O(1) and ask "do A and B differ by exactly one letter"? To anwser,
let l be the common length of A and B;
let p be the length of the common prefix between A and B;
let s be the length of the common suffix between A and B;
then notice that A and B differ by exactly one letter if l = p + s + 1.
Therefore, our algorithm boils down to determining, in constant time, the length of the common prefix and common suffix for every pair of strings. We show how to do this for prefixes. The same approach works for suffixes, e.g. by reversing the strings.
First, sort the strings and measure the common prefixes between each consecutive pair. For example:
banana
> common prefix 3 ("ban")
band
> common prefix 4
bandit
> common prefix 1
brother
> common prefix 7
brotherly
> common prefix 0
car
Now, suppose you want to query the common prefix between "band" and "brotherly". This will be the minimum numeric value between "band" and "brotherly", or min(4, 1, 7) = 1. This can be achieved with range minimum queries in O(M) processing time and and O(1) per query, although simpler implementations are available in O(M log M) preprocessing time.
An autogram is a sentence which describes the characters it contains, usually enumerating each letter of the alphabet, but possibly also the punctuation it contains. Here is the example given in the wiki page.
This sentence employs two a’s, two c’s, two d’s, twenty-eight e’s, five f’s, three g’s, eight h’s, eleven i’s, three l’s, two m’s, thirteen n’s, nine o’s, two p’s, five r’s, twenty-five s’s, twenty-three t’s, six v’s, ten w’s, two x’s, five y’s, and one z.
Coming up with one is hard, because you don't know how many letters it contains until you finish the sentence. Which is what prompts me to ask: is it possible to write an algorithm which could create an autogram? For example, a given parameter would be the start of the sentence as an input e.g. "This sentence employs", and assuming that it uses the same format as the above "x a's, ... y z's".
I'm not asking for you to actually write an algorithm, although by all means I'd love to see if you know one to exist or want to try and write one; rather I'm curious as to whether the problem is computable in the first place.
You are asking two different questions.
"is it possible to write an algorithm which could create an autogram?"
There are algorithms to find autograms. As far as I know, they use randomization, which means that such an algorithm might find a solution for a given start text, but if it doesn't find one, then this doesn't mean that there isn't one. This takes us to the second question.
"I'm curious as to whether the problem is computable in the first place."
Computable would mean that there is an algorithm which for a given start text either outputs a solution, or states that there isn't one. The above-mentioned algorithms can't do that, and an exhaustive search is not workable. Therefore I'd say that this problem is not computable. However, this is rather of academic interest. In practice, the randomized algorithms work well enough.
Let's assume for the moment that all counts are less than or equal to some maximum M, with M < 100. As mentioned in the OP's link, this means that we only need to decide counts for the 16 letters that appear in these number words, as counts for the other 10 letters are already determined by the specified prefix text and can't change.
One property that I think is worth exploiting is the fact that, if we take some (possibly incorrect) solution and rearrange the number-words in it, then the total letter counts don't change. IOW, if we ignore the letters spent "naming themselves" (e.g. the c in two c's) then the total letter counts only depend on the multiset of number-words that are actually present in the sentence. What that means is that instead of having to consider all possible ways of assigning one of M number-words to each of the 16 letters, we can enumerate just the (much smaller) set of all multisets of number-words of size 16 or less, having elements taken from the ground set of number-words of size M, and for each multiset, look to see whether we can fit the 16 letters to its elements in a way that uses each multiset element exactly once.
Note that a multiset of numbers can be uniquely represented as a nondecreasing list of numbers, and this makes them easy to enumerate.
What does it mean for a letter to "fit" a multiset? Suppose we have a multiset W of number-words; this determines total letter counts for each of the 16 letters (for each letter, just sum the counts of that letter across all the number-words in W; also add a count of 1 for the letter "S" for each number-word besides "one", to account for the pluralisation). Call these letter counts f["A"] for the frequency of "A", etc. Pretend we have a function etoi() that operates like C's atoi(), but returns the numeric value of a number-word. (This is just conceptual; of course in practice we would always generate the number-word from the integer value (which we would keep around), and never the other way around.) Then a letter x fits a particular number-word w in W if and only if f[x] + 1 = etoi(w), since writing the letter x itself into the sentence will increase its frequency by 1, thereby making the two sides of the equation equal.
This does not yet address the fact that if more than one letter fits a number-word, only one of them can be assigned it. But it turns out that it is easy to determine whether a given multiset W of number-words, represented as a nondecreasing list of integers, simultaneously fits any set of letters:
Calculate the total letter frequencies f[] that W implies.
Sort these frequencies.
Skip past any zero-frequency letters. Suppose there were k of these.
For each remaining letter, check whether its frequency is equal to one less than the numeric value of the number-word in the corresponding position. I.e. check that f[k] + 1 == etoi(W[0]), f[k+1] + 1 == etoi(W[1]), etc.
If and only if all these frequencies agree, we have a winner!
The above approach is naive in that it assumes that we choose words to put in the multiset from a size M ground set. For M > 20 there is a lot of structure in this set that can be exploited, at the cost of slightly complicating the algorithm. In particular, instead of enumerating straight multisets of this ground set of all allowed numbers, it would be much better to enumerate multisets of {"one", "two", ..., "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}, and then allow the "fit detection" step to combine the number-words for multiples of 10 with the single-digit number-words.
I trying to create a DFA that can recognize strings with alphabet {a,b,c} where a and c appear a even number of times and where b appears an uneven number of times.
I am wondering that this may only be expressed with other mathods such as turing machine or context-free languages.
You might find it fun to think of the solution.
The way I would go about constructing such a machine is as follows.
Make eight states. Each state represents a possible 3 tuple combination. The start state is a state representing the combination where all three are even. If a is the first character in the input, then you would to a state that represents an odd number of a's and even number of b's and c's. The accept state is where a and c are even, and b is odd.
This is possible using a DFA simply have a state for each of the combinations of an odd number of a's, b's and c's. So if you're in the state with even # of a's, odd # of b's and an even # of c's then you can accept. You can also define simple transitions for any of the other cases. So naively this can be done with 8 states.
First off, this is NOT a homework problem. I haven't had to do homework since 1988!
I have a list of words of length N
I have a max of 13 characters to choose from.
There can be multiples of the same letter
Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:
speedometer has 4 e's in it, something MOST words don't have,
so I could toss that word due to a poor fit characteristic, or it might just
go away based on the algorithm
I've looked # letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.
Genetic algorithms come to mind, but I've never tried them before....
Seems like I need a way to score each letter based upon its association with other letters in the words it is in....
It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)
Assuming it's hard, you have the usual routes:
branch and bound
stochastic search
approximation algorithms
Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of
Letters you've already used (with multiplicity)
Number of characters you still get to use
Letters still available
Words still in your list
Number of words still in your list (count of the previous set)
Number of words that are not possible in this state
Number of words that are already covered by your choice of letters
You'd start with
Empty set
13
{A, B, ..., Z}
Your whole list
N
0
0
Put that data structure into a queue.
At each step
Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities
From a state, I'd generate possible next states as follows:
For each letter L in the set of letters left
Generate a new state where:
you've added L to the list of chosen letters
the least letter is L
so you remove anything less than L from the allowed letters
So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.
Do all the various accounting to figure out the new states.
Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.
No special handling for speedometer required.
I can't imagine this would be fast, but it'd work.
There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.
Total brute forcing should work, although the implementation would become quite confusing.
Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.
Comments welcome. :)
Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.
I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.
I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!
Some thoughts based on my (mis)understanding:
If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)
Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:
spdmtr hs 4 's n t, smthng
MST wrds dn't hv, s cld
tss tht wrd d t pr ft
chrctrstc, r t mght jst g
wy bsd n th lgrthm
Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....