Turing Machine Element Distinctness Problem - algorithm

So the language is as follows:
E = {#x1#x2...#xi where alphabet is {0,1}* and no string can be a duplicate of another string }
I am trying to create the state diagram for this, but even before that I was coming up with the algorithm to solve it, but the issue I was encountering is whenever I compare the first two strings, I have to mark each character with an 'x' so how would I restore the first string? Like first I compare x1 and x2, by the time I'm done, in x2 and all characters in x1 would be marked with 'x', so when I move on to x3, x1 has nothing to compare.

Instead of marking considered symbols with an x, mark them with special symbols corresponding to the symbols being marked. So, instead of writing x for 0 and x for 1, write a for 0 and b for 1. In fact, go ahead and use symbols c and d also to replace values in "the earliest thing I need to check" so you can check all pairs. A high-level description of a Turing machine using this strategy is the following:
begin reading the first input, replacing 0 with c and 1 with d
go to the second input and if the second input is a match so far, write a for 0 and b for 1, then continue. If it's not a match, we know that these inputs don't match and we can begin comparing other pairs. Change the input you're checking to a and b only and reset the first input to 0 and 1 only.
repeat this process skipping over all a and b already there to check all pairs involving the first term.
once you've checked all pairs involving the first term, cross it out (using x maybe) and then repeat the whole process on the remaining input
This will check all pairs and work as expected. The key is, as you correctly surmised, being able to reconstruct parts of the input, meaning you need extra symbols in your tape alphabet. Never hesitate to introduce tape symbols - they're free and can never hurt.

Related

Could anyone explain the Hidden Sequence problem in CodeChef?

I'm trying to solve the Hidden Sequence problem on Code Chef, but I don't fully understand the explanation. I especially don't understand what's the use of Y in the triplets.
We know that there is a hidden sequence A1,A2,…,AN which contains only integers between 1 and K inclusive. We have acquired M triplets (X1,Y1,Z1),(X2,Y2,Z2),…,(XM,YM,ZM). A very reliable source has given us intel that for each valid i, the Yi-th occurrence of the integer Xi in the sequence A is AZi, i.e. AZi=Xi and there are Yi−1 indices j<Zi such that Aj=Xi
Find any sequence A consistent with this information or determine that no such sequence exists.
Could anyone explain it?
The hidden array may have duplicate values, like [1,2,3,1,2,3,1,1].
One of the triplets could be X=3, Y=2, Z=6.
This tells us that the second 3 is at position 6. So:
X = the value
Y = the occurrence of that value X (whether it is the first, second, third, ... occurrence)
Z = the position of the Y-th occurrence of that value X
This may be the only info you will get about the value 3. So you might not get a triplet saying X=3, Y=1, Z=3, which would tell you where the first 3 is positioned. Instead your algorithm must derive that from the other triplets.
The algorithm should somehow stay aware that in the first 5 positions there must be a 3. From other triplets it will know similar things. At a certain point this bulk of information will not allow that 3 to occur at just any of those 5 positions, as some positions will be needed for other values. This will narrow down the possibilities, until maybe only a few are left over, or maybe none.
Hope this explains what Y is about.

{ w | at every odd position of w is a 1}

The task is to construct a DFA for this language over the alphabet {0,1}.
I have constructed a DFA that consists of 4 states and that does not accept an empty word. However, in the answers they give a 3 state DFA that accepts it.
Why should my DFA accept an empty word if in the empty word there is no 1 at the odd position which means that it is not in the language?
The only requirement is that any symbol at an odd position must be 1. There is no requirement for a particular number of symbols, and specifically not that there be at least one.
Therefore, a DFA with an initial state where 0 leads to a rejection state and where 1 leads to a second state which accepts either symbol and returns to the start would be an acceptable answer, and would accept the empty string. This would be a three-state machine:
I think you are confused why should an empty string be a part of a mentioned set.
Let's take a look at another example. Consider you have a set of all possible strings having every character equal to 0. Such strings would be 0, 00, 000, 00000, etc. What about an empty string *? It actually pertain to this set as well. Empty string does not violate the definition of the set.
Compare this example with yours. You should check every odd position of the string and if you'll find anything other than 1 you should say that it is not an element of you set. It is not said anything about whether a string should have an odd position to be checked.

Is there an algorithm for choosing a few strings from a list so that the number of strings equal the number of different letters in them?

Edit2: I think the solution of David Eisenstat works but I will check it before I call the question solved.
Example list of strings:
1.) "a"
2.) "ab"
3.) "bc"
4.) "dc"
5.) "efa"
6.) "ef"
7.) "gh"
8.) "hi"
You can choose number 1.) there's 1 string and 1 letter in it: "a"
You can also choose 1.) and 2.) these are 2 strings with only two different letters in them "a" and "b"
other valid string combinations:
1.) 2.) 3.)
1.) 5.) 6.)
there's no valid combination with "h" (it would be ideal if cases like this could be proven however you can assume the program only needs to work when there's a valid answer)
There could be an extra condition that the strings you choose must include one specified letter, however simply finding all the possible combinations would solve the problem just as well. eg. specified letter "c" the only solution in this case would be: 1.) 2.) 3.)
[optional information] The purpose of this: I want to make a program which can choose from a big list of equations (probably around 100) which ones can be used to solve for a variable. Each equation gets one string, each letter in the string representing one unknown. The list of equations are all different eg. cannot be derived from each other, so you need as many equations as many unknowns there are in them. Solving for the unknowns will be done in a CAS, so you don't need to worry about it. However I believe the CAS (Maxima) might have a limit on how many equations it can solve simultaneously and it might be too slow if you give it too many unnecessary equations at a time.
As a start I would use an algorithm to reduce the number of strings just to make it faster. First all strings containing specified letter are in the reduced list, then all strings containing the letters from the strings in the reduced list are part of the reduced list until none is added. eg reduced list of "g" would be 7.) "gh" and 8.) "hi" This would only remove some unnecessary strings, but the task would remain the same with the rest.
I think this can be solved by taking away unnecessary strings from the reduced list until all the remaining are needed, however I don't know how to explicitly define which strings would be unnecessary (except for those mentioned in the previous paragraph).
If you work with the extra condition: This is an optimization task. I don't need a perfect solution, only an optimal solution. The program doesn't need to find the absolute minimum number of strings that give a solution. Having a few extra strings in the solution would probably only slow the computer down, but it would be acceptable.
Edit: Optional clarification about the meaning of the strings: Each letter in a string represent an unknown in an equation so the equation a=2 would be represented by "a" because that's the only unknown. The equation a+b=0 would be represented by "ab" and b^2-c=0 by "bc"
I'm not sure what to call this problem. It seems NP-hard, so I'm going to suggest an integer programming formulation, which can be attacked by an off-the-shelf solver.
Let x_i be a 0-1 variable indicating whether equation i is included in the output. Let y_j be a 0-1 variable indicating whether variable j is included in the output. We have constraints
for all equations i, for all variables j in equation i, y_j - x_i >= 0.
We need as many equations as variables in the output.
(sum over all equations i of x_i) - (sum over all variables j of y_j) = 0
As you point out, the empty set needs specifically to be disallowed. Let k be a variable that must appear in the output.
sum over all equations i containing variable k of x_i >= 1
Naturally, the objective is
minimize sum over all equations i of x_i.

Weighted unordered string edit distance

I need an efficient way of calculating the minimum edit distance between two unordered collections of symbols. Like in the Levenshtein distance, which only works for sequences, I require insertions, deletions, and substitutions with different per-symbol costs. I'm also interested in recovering the edit script.
Since what I'm trying to accomplish is very similar to calculating string edit distance, I figured it might be called unordered string edit distance or maybe just set edit distance. However, Google doesn't turn up anything with those search terms, so I'm interested to learn if the problem is known by another name?
To clarify, the problem would be solved by
def unordered_edit_distance(target, source):
return min(edit_distance(target, source_perm)
for source_perm in permuations(source))
So for instance, the unordered_edit_distance('abc', 'cba') would be 0, whereas edit_distance('abc', 'cba') is 2. Unfortunately, the number of permutations grows large very quickly and is not practical even for moderately sized inputs.
EDIT Make it clearer that operations are associated with different costs.
Sort them (not necessary), then remove items which are same (and in equal numbers!) in both sets.
Then if the sets are equal in size, you need that numer of substitutions; if one is greater, then you also need some insertions or deletions. Anyway you need the number of operations equal the size of the greater set remaining after the first phase.
Although your observation is kind of correct, but you are actually make a simple problem more complex.
Since source can be any permutation of the original source, you first need check the difference in character level.
Have two map each map count the number of individual characters in your target and source string:
for example:
a: 2
c: 1
d: 100
Now compare two map, if you missing any character of course you need to insert it, and if you have extra character you delete it. Thats it.
Let's ignore substitutions for a moment.
Now it becomes a fairly trivial problem of determining the elements only in the first set (which would count as deletions) and those only in the second set (which would count as insertions). This can easily be done by either:
Sorting the sets and iterating through both at the same time, or
Inserting each element from the first set into a hash table, then removing each element from the second set from the hash table, with each element not found being an insertion and each element remaining in the hash table after we're done being a deletion
Now, to include substitutions, all that remains is finding the optimal pairing of inserted elements to deleted elements. This is actually the stable marriage problem:
The stable marriage problem (SMP) is the problem of finding a stable matching between two sets of elements given a set of preferences for each element. A matching is a mapping from the elements of one set to the elements of the other set. A matching is stable whenever it is not the case that both:
Some given element A of the first matched set prefers some given element B of the second matched set over the element to which A is already matched, and
B also prefers A over the element to which B is already matched
Which can be solved with the Gale-Shapley algorithm:
The Gale–Shapley algorithm involves a number of "rounds" (or "iterations"). In the first round, first a) each unengaged man proposes to the woman he prefers most, and then b) each woman replies "maybe" to her suitor she most prefers and "no" to all other suitors. She is then provisionally "engaged" to the suitor she most prefers so far, and that suitor is likewise provisionally engaged to her. In each subsequent round, first a) each unengaged man proposes to the most-preferred woman to whom he has not yet proposed (regardless of whether the woman is already engaged), and then b) each woman replies "maybe" to her suitor she most prefers (whether her existing provisional partner or someone else) and rejects the rest (again, perhaps including her current provisional partner). The provisional nature of engagements preserves the right of an already-engaged woman to "trade up" (and, in the process, to "jilt" her until-then partner).
We just need to get the cost correct. To pair an insertion and deletion, making it a substitution, we'll lose both the cost of the insertion and the deletion, and gain the cost of the substitution, so the net cost of the pairing would be substitutionCost - insertionCost - deletionCost.
Now the above algorithm guarantees that all insertion or deletions gets paired - we don't necessarily want this, but there's an easy fix - just create a bunch of "stay-as-is" elements (on both the insertion and deletion side) - any insertion or deletion paired with a "stay-as-is" element would have a cost of 0 and would result in it remaining an insertion or deletion and nothing would happen for two "stay-as-is" elements ending up paired.
Key observation: you are only concerned with how many 'a's, 'b's, ..., 'z's or other alphabet characters are in your strings, since you can reorder all the characters in each string.
So, the problem boils down to the following: having s['a'] characters 'a', s['b'] characters 'b', ..., s['z'] characters 'z', transform them into t['a'] characters 'a', t['b'] characters 'b', ..., t['z'] characters 'z'. If your alphabet is short, s[] and t[] can be arrays; generally, they are mappings from the alphabet to integers, like map <char, int> in C++, dict in Python, etc.
Now, for each character c, you know s[c] and t[c]. If s[c] > t[c], you must remove s[c] - t[c] characters c from the first unordered string (s). If s[c] < t[c], you must add t[c] - s[c] characters c to the second unordered string (t).
Take X, the sum of s[c] - t[c] for all c such that s[c] > t[c], and you will get the number of characters you have to remove from s in total. Take Y, the sum of t[c] - s[c] for all c such that s[c] < t[c], and you will get the number of characters you have to remove from t in total.
Now, let Z = min (X, Y). We can have Z substitutions, and what's left is X - Z insertions and Y - Z deletions. Thus the total number of operations is Z + (X - Z) + (Y - Z), or X + Y - min (X, Y).

How to compute palindrome from a stream of characters in sub-linear space/time?

I don't even know if a solution exists or not. Here is the problem in detail. You are a program that is accepting an infinitely long stream of characters (for simplicity you can assume characters are either 1 or 0). At any point, I can stop the stream (let's say after N characters were passed through) and ask you if the string received so far is a palindrome or not. How can you do this using less sub-linear space and/or time.
Yes. The answer is about two-thirds of the way down http://rjlipton.wordpress.com/2011/01/12/stringology-the-real-string-theory/
EDIT: Some people have asked me to summarize the result, in case the link dies. The link gives some details about a proof of the following theorem: There is a multi-tape Turing machine that can recognize initial non-trivial palindromes in real-time. (A summary, also provided by the article linked: Suppose the machine has read x1, x2, ..., xk of the input. Then it has only constant time to decide if x1, x2, ..., xk is a palindrome.)
A multitape Turing machine is just one with several side-by-side tapes that it can read and write to; in a very specific sense it is exactly equivalent to a standard Turing machine.
A real-time computation is one in which a Turing machine must read a character from input at least once every M steps (for some bounded constant M). It is readily seen that any real-time algorithm should be linear-time, then.
There is a paper on the proof which is around 10 pages which is available behind an institutional paywall here which I will not repost elsewhere. You can contact the author for a more detailed explanation if you'd like; I just had read this recently and realized it was more or less what you were looking for.
You could use a rolling hash, or more rolling hashes for accuracy. Incrementally compute the hash of the characters read so far, in the order they were read, and in reverse order of reading.
If your hash function is x*3^(k-1)+x*3^(k-2)+...+x*3^0 for example, where x is a character you read, this is how you'd do it:
hLeftRight = 0
hRightLeft = 0
k = 0
repeat until there are numbers in the stream
x = stream.Get()
hLeftRight = 3*hLeftRight + x.Value
hRightLeft = hRightLeft + 3^k*x.Value
if (x.QueryPalindrome = true)
yield hLeftRight == hRightLeft
k = k + 1
Obviously you'd have to calculate the hashes modulo something, probably a prime or a power of two. And of course, this could lead to false positives.
Round 2
As I see it, with each new character, there are three cases:
Character breaks potential symmetry, for example, aab -> aabc
Character extends the middle, for example aab -> aabb
Character continues symmetry, for example aab->aaba
Assume you have a pointer that tracks down the string and points to the last character that continued a potential palindrome.
(I am going to use parenthesis to indicate a pointed at character)
Lets say you are starting with aa(b) and get an:
'a' (case 3), you move the pointer to
the left and check if it's an 'a' (it
is). You now have a(a)b.
'c' (case 1), you are not expecting a 'c', in this case you start back at the beginning and you now have aab(c).
The really tricky case is 2, because somehow you have to know that the character you just got isn't affecting symmetry, it is just extending the middle. For this, you have to hold an additional pointer that tracks where the plateau's (middle's) edge lies. For example, you have (b)baabb and you just got another 'b', in this case you have to know to reset the pointer to the base of the middle plateau here: bbaa(b)bb. Since we are going for constant time, you have to hold a pointer here to begin with (you can't afford the time to search for the plateau's edge). Now if you get another 'b', you know that you are still on the edge of that plateau and you keep the pointer where it is, so bbaa(b)bb -> bbaa(b)bbb. Now, if you get an 'a', you know that the 'b's are not part of the extended middle and you reset both pointers (The tracking pointer and the edge pointer) so you now have bbaabbbb((a)).
With these three cases, I think all bases are covered. If you ever want to check if the current string is a palindrome, check if the first pointer (not the plateau's edge pointer) is at index 0.
This might help you:
http://arxiv.org/pdf/1308.3466v1.pdf
If you store the last $k$ many input symbols you can easily find palindromes up to a length of $k$.
If you use the algorithms of the paper you can find the midpoints of palindromes and an length estimate of its length.

Resources