Minimum distance metric on encoded sequence - algorithm

I'm looking for a minimum distance metric which preserve subsequences subtitution. With this I mean that any subsequence of second sequence can have different representation, but still be same to the first subsequence. The two sequences will always have same the length. I'm familiar with Hamming or Levenshtein distance, but they are probably useless in this case.
Consider this examples:
AABBAA
CCDDCC
has distance 0, because A = C and B = D (or AA = CC and BB = DD).
AABBBBBB
CCDDEEEE
has distance 2, because A = C and B = E (or AA = CC or BB = EE or BBBB = EEEE), but the B =/= D (or BB =/= DD).
However, this function may not behave exactly like that. I just need to know how unencoded sequence is similar, in term of repetition, to encoded one. You could assume that second sequence is encoded with something like caesar cipher (although I'm not sure if i.e the shift could vary through the time).
Note:
I also thought about compressing the two sequences with LZW algorithm and compare their compression ratio. Any other idea?

You can enumerate elements in your sequences with continuous numbers from the beginning and then use Levenshtein distance or something like that.
AACCAABB --> 11221133 (A->1, C->2, B->3)
CCXXCCYY --> 11221133 (C->1, X->2, Y->3)
d(AACCAABB, CCXXCCYY) = d(11221133, 11221133) = 0

Related

algorithm to find unique, non equivalent configurations given the height, the width, and the number of states each element can be

SO recently, I have been attempting to solve a code challenge and can not find the answer. The issue is not the implementation, but rather what to implement. The prompt can be found here http://pastebin.com/DxQssyKd
the main useful information from the prompt is as follows
"Write a function answer(w, h, s) that takes 3 integers and returns the number of unique, non-equivalent configurations that can be found on a star grid w blocks wide and h blocks tall where each celestial body has s possible states. Equivalency is defined as above: any two star grids with each celestial body in the same state where the actual order of the rows and columns do not matter (and can thus be freely swapped around). Star grid standardization means that the width and height of the grid will always be between 1 and 12, inclusive. And while there are a variety of celestial bodies in each grid, the number of states of those bodies is between 2 and 20, inclusive. The answer can be over 20 digits long, so return it as a decimal string."
The equivalency is in a way that
00
01
is equivalent to
01
00
and so on.
The problem is, what algorithm(s) should I use? i know this is somewhat related to permutations, combinations, and group theory, but I can not find anything specific.
The key weapon is Burnside's lemma, which equates the number of orbits of the symmetry group G = Sw × Sh acting on the set of configurations X = ([w] × [h] → [s]) (i.e., the answer) to the sum 1/|G| ∑g∈G |Xg|, where Xg = {x | g.x = x} is the set of elements fixed by g.
Given g, it's straightforward to compute |Xg|: use g to construct a graph on vertices [w] × [h] where there is an edge between (i, j) and g(i, j) for all (i, j). Count c, the number of connected components, and return sc. The reasoning is that every vertex in a connected component must have the same state, but vertices in different components are unrelated.
Now, for 12 × 12 grids, there are far too many values of g to do this calculation on. Fortunately, when g and g' are conjugate (i.e., there exists some h such that h.g.h-1 = g') we find that |Xg'| = |{x | g'.x = x}| = |{x | h.g.h-1.x = x}| = |{x | g.h-1.x = h-1.x}| = |{h.y | g.y = y}| = |{y | g.y = y}| = |Xg|. We can thus sum over conjugacy classes and multiply each term by the number of group elements in the class.
The last piece is the conjugacy class structure of G = Sw × Sh. The conjugacy class structure of this direct product is really just the direct product of the conjugacy classes of Sw and Sh. The conjugacy classes of Sn are in one-to-one correspondence with integer partitions of n, enumerable by standard recursive methods. To compute the size of the class, you'll divide n! by the product of the partition terms (because circular permutations of the cycles are equivalent) and also by the product of the number of symmetries between cycles of the same size (product of the factorials of the multiplicities). See https://groupprops.subwiki.org/wiki/Conjugacy_class_size_formula_in_symmetric_group.

Need efficient algorithm in combinatorics

I am trying to find the best (realistic) algorithm for solving a cryptography challenge, in which:
the given cipher text C is made of about 6000 characters taken in the set S={A,B,C,...,Y,a,b,c,...y}. So |S| = 50.
the encryption scheme does not allow to have two identical adjacent characters in C
25 letters in S are called Nulls, and are unknown
these Nulls must be removed from C to obtain the actual cipher text C' which can then be attacked.
the list of Nulls in C is named N and |N| is close to |C|/2 = 3000
so: |N| + |C'| = |C|
My aim is to identify the 25 Nulls, satisfying these two conditions:
there may not be two identical adjacent characters in C'
there may not be two identical adjacent Nulls in N
Obviously by brute force there are 50!/(25! 25!) = 126410606437752 combinations of 25 Nulls in S, so this is not a realistic approach.
I have tried to recursively explore the tree of sets of Nulls and 'cut branches' as much and as soon as possible.
For example, when adding a letter of S to the subset of Nulls, if the sequence "x n1n2 x" appears in C where x is not yet a Null and n1n2 are Nulls, then x should be a Null too.
However this is not enough for a run-time lower than a few centuries...
Can you think of a more clever algorithm for identifying these 25 Nulls ?
Note: there might be more than one set of Nulls satisfying the two conditions
lets try something like this:
Create a list of sets - each set contains one char from S. the set is the null chars.
while you have more then two sets:
for each set
search the cipher text for X[<set-chars>]+X
if found, union the set with the set X in it.
if no sets where united, start recursing with two sets united.
You can speed up things if you keep a different cipher text for each set, removing from it the chars in the set. if you do so, the search is easier - you are searching for XX, witch is constant length. every time you union two sets you need to remove all the chars in the sets from the cipher text.
The time this well take depends on the string C you are given.
An explanation about the sets - each set is an option for C' or N. If you find that A and X are in the same group, then {A, X} is either a subset of N or of C'. If later you will find the same about Y and B, then {Y, B} is a subset. Later, finding a substring YAXAXY means that Y is in the same group as A and X, and so will B, because it's with Y. At the end you will end with two groups - one for C' and one for N, witch you can't distinguish between.
elyashiv's method is the good one.
It is very fast.
I have produced the two sets C' and N, which are equivalent.
The sub-sets of S, S1 and S2 which produce C' and N are adequately such that S = S1 U S2.
Thank you.

Implementing cartesian product, such that it can skip iterations

I want to implement a function which will return cartesian product of set, repeated given number. For example
input: {a, b}, 2
output:
aa
ab
bb
ba
input: {a, b}, 3
aaa
aab
aba
baa
bab
bba
bbb
However the only way I can implement it is firstly doing cartesion product for 2 sets("ab", "ab), then from the output of the set, add the same set. Here is pseudo-code:
function product(A, B):
result = []
for i in A:
for j in B:
result.append([i,j])
return result
function product1(chars, count):
result = product(chars, chars)
for i in range(2, count):
result = product(result, chars)
return result
What I want is to start computing directly the last set, without computing all of the sets before it. Is this possible, also a solution which will give me similar result, but it isn't cartesian product is acceptable.
I don't have problem reading most of the general purpose programming languages, so if you need to post code you can do it in any language you fell comfortable with.
Here's a recursive algorithm that builds S^n without building S^(n-1) "first". Imagine an infinite k-ary tree where |S| = k. Label with the elements of S each of the edges connecting any parent to its k children. An element of S^m can be thought of as any path of length m from the root. The set S^m, in that way of thinking, is the set of all such paths. Now the problem of finding S^n is a problem of enumerating all paths of length n - and we can name a path by considering the sequence of edge labels from beginning to end. We want to directly generate S^n without first enumerating all of S^(n-1), so a depth-first search modified to find all nodes at depth n seems appropriate. This is essentially how the below algorithm works:
// collection to hold generated output
members = []
// recursive function to explore product space
Products(set[1...n], length, current[1...m])
// if the product we're working on is of the
// desired length then record it and return
if m = length then
members.append(current)
return
// otherwise we add each possible value to the end
// and generate all products of the desired length
// with the new vector as a prefix
for i = 1 to n do
current.addLast(set[i])
Products(set, length, current)
currents.removeLast()
// reset the result collection and request the set be generated
members = []
Products([a, b], 3, [])
Now, a breadth-first approach is no less efficient than a depth-first one, and if you think about it would be no different from exactly what you're already doing. Indeed, and approach that generates S^n must necessarily generate S^(n-1) at least once, since that can be found in a solution to S^n.

How to remove OCR artifacts from text?

OCR generated texts sometimes come with artifacts, such as this one:
Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint
While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks.
How can one turn the above text into a more, say, canonical form, like:
Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint
Can this be done efficiently for large amounts of text?
One idea would be to concatenate the whole string (to skip the guessing, where word boundaries are) and then run a text segmentation algorithm on it, maybe something similar to this: http://norvig.com/ngrams/
If you have a dictionary for the target language, and all spaced-out words consist of just a single word, then it's easy: Just scan through the text, looking for maximal-length runs of spaced-out single letters, and replace them with the single corresponding dictionary word if it exists (and otherwise leave them unchanged).
The only real difficulty is with strings like m i t d e m that correspond to two or more separate words. A simple way would be to greedily "nibble off" prefixes that appear in the dictionary, but this might lead to suboptimal results, and in particular to a suffix that doesn't correspond to any dictionary string even though a different choice of breakpoints would have worked (e.g. b e i m A r z t won't work if you greedily grab bei instead of beim from the front). Fortunately there's a simple linear-time DP approach that will do a better job -- and can even incorporate weights on words, which can help to get the most likely decomposition in the event that there is more than one. Given a string S[1 .. n] (with spaces removed), we will compute f(i), the score of the best decomposition of the length-i prefix of S, for all 1 <= i <= n:
f(0) = 0
f(i) = max over all 0 <= j < i of f(j) + dictScore(S[j+1 .. i])
f(n) will then be the score of the best possible decomposition of the entire string. If you set dictScore(T) to 1 for words that exist in the dictionary and 0 for words that don't, you will get a decomposition into as many words as possible; if you set dictScore(T) to, e.g., -1 for words that exist in the dictionary and -2 for words that don't, you'll get a decomposition into as few words as possible. You can also choose to award higher scores for more "likely" words.
After computing these scores, you can walk back through the DP matrix to reconstruct a decomposition that corresponds to the maximal score.

Algorithm to find

the logic behind this was (n-2)3^(n-3) has lots of repetitons like (abc)***(abc) when abc is at start and at end and the strings repated total to 3^4 . similarly as abc moves ahead and number of sets of (abc) increase
You can use dynamic programming to compute the number of forbidden strings.
The algorithms follow from the observation below:
"Legal string of size n is the legal string of size n - 1 extended with one letter, so that the last three letters of the resulting string are not all distinct."
So if we had all the legal strings of size n-1 we could try extending them to obtain the legal strings of size n.
To check whether the extended string is legal we just need to know the last two letters of the previous string (of size n-1).
In the algorithm we will compute two arrays, where
different[i] # number of legal strings of length i in which last two letters are different
same[i] # number of legal strings of length i in which last two letters are the same
It can be easily proved that:
different[i+1] = different[i] + 2*same[i]
same[i+1] = different[i] + same[i]
It is the consequence of the following facts:
Any 'same' string of size i+1 can be obtained either from 'same' string of size i (think BB -> BBB) or from 'different' string (think AB -> ABB) and these are the only options.
Any 'different' string of size i+1 can be obtained either from 'different' string of size i (think AB-> ABA ) or from the 'same' string in two ways (AA -> AAB or AA -> AAC)
Having observed all this it is easy to write an algorithm that computes the result in O(n) time.
I suggest you use recursion, and look at two numbers:
F(n), the number of legal strings of length n whose last two symbols are the same.
G(n), the number of legal strings of length n whose last two symbols are different.
Is that enough to go on?
get the ASCII values of the last three letters and add the square values of these letters. If it gives a certain result, then it is forbidden. For A, B and C, it would be fine.
To do this:
1) find out how to get characters from your string.
2) find out how to get ASCII value of a character.
3) Multiply these ASCII values with themselves.
4) Do that for the three letters each time and add their values.

Resources