Recursive language - computation-theory

"if a language is recursive, then there exists a method by which the strings in language can be written in some sequence"
I am also told that "if a language can be enumerated in a lexicographic order by some Turing machine, then such a language is called recursive"
First : are the two statements are different?
Second : should it be a lexicographic order only?

Think back to the reason why lexicographical order is necessary for the definition of a recursive language:
A language is recursive if it can be decided. That is to say, for a given word W and a given language L, it is possible to know whether W is a member of L, or not in finitely many steps.
A language is recursively enumerable if it can be accepted. That is to say, for a given word W and a given language L, it is possible to know whether W is a member of L in finitely many steps, but it is not possible to know whether W is not a member of L.
So if a machine just enumerated the words of L in any order, you can check to see if your word W is in that list. If it is, you stop. If it isn't, you have to wait forever to see if your word is eventually output by the machine. The language is recursively enumerable.
If you knew the order, though, you could evaluate whether the machine should have output W by now. If the machine has output a word X, and according to the ordering you know the machine is using, W is before X, you know that the machine will not ever emit W, so you know that W is not a member of L.
Lexicographical order is one of many total orderings of words that satisfy the property that you can tell when your word W should have been output, so if you don't see it by then, you can stop.
Other orders:
https://en.wikipedia.org/wiki/Lexicographical_order#Colexicographic_order
https://en.wikipedia.org/wiki/Kleene%E2%80%93Brouwer_order
So to answer your specific questions:
Are the two statements different?
Yes.
The first statement states "in some sequence", which does not specify that the sequence must be a total order over L's alphabet. Therefore the first statement is incorrect. The first statement defines a recursively enumerable language.
The second statement is correct, but is more restrictive than it needs to be. Lexicographical order is only one total order over an alphabet. Others can be used.
Should it be a lexicographical order only?
No.
As above, as long as the machine guarantees output in any total order over the alphabet, the language is recursive.

Related

Is it possible to create an algorithm which generates an autogram?

An autogram is a sentence which describes the characters it contains, usually enumerating each letter of the alphabet, but possibly also the punctuation it contains. Here is the example given in the wiki page.
This sentence employs two a’s, two c’s, two d’s, twenty-eight e’s, five f’s, three g’s, eight h’s, eleven i’s, three l’s, two m’s, thirteen n’s, nine o’s, two p’s, five r’s, twenty-five s’s, twenty-three t’s, six v’s, ten w’s, two x’s, five y’s, and one z.
Coming up with one is hard, because you don't know how many letters it contains until you finish the sentence. Which is what prompts me to ask: is it possible to write an algorithm which could create an autogram? For example, a given parameter would be the start of the sentence as an input e.g. "This sentence employs", and assuming that it uses the same format as the above "x a's, ... y z's".
I'm not asking for you to actually write an algorithm, although by all means I'd love to see if you know one to exist or want to try and write one; rather I'm curious as to whether the problem is computable in the first place.
You are asking two different questions.
"is it possible to write an algorithm which could create an autogram?"
There are algorithms to find autograms. As far as I know, they use randomization, which means that such an algorithm might find a solution for a given start text, but if it doesn't find one, then this doesn't mean that there isn't one. This takes us to the second question.
"I'm curious as to whether the problem is computable in the first place."
Computable would mean that there is an algorithm which for a given start text either outputs a solution, or states that there isn't one. The above-mentioned algorithms can't do that, and an exhaustive search is not workable. Therefore I'd say that this problem is not computable. However, this is rather of academic interest. In practice, the randomized algorithms work well enough.
Let's assume for the moment that all counts are less than or equal to some maximum M, with M < 100. As mentioned in the OP's link, this means that we only need to decide counts for the 16 letters that appear in these number words, as counts for the other 10 letters are already determined by the specified prefix text and can't change.
One property that I think is worth exploiting is the fact that, if we take some (possibly incorrect) solution and rearrange the number-words in it, then the total letter counts don't change. IOW, if we ignore the letters spent "naming themselves" (e.g. the c in two c's) then the total letter counts only depend on the multiset of number-words that are actually present in the sentence. What that means is that instead of having to consider all possible ways of assigning one of M number-words to each of the 16 letters, we can enumerate just the (much smaller) set of all multisets of number-words of size 16 or less, having elements taken from the ground set of number-words of size M, and for each multiset, look to see whether we can fit the 16 letters to its elements in a way that uses each multiset element exactly once.
Note that a multiset of numbers can be uniquely represented as a nondecreasing list of numbers, and this makes them easy to enumerate.
What does it mean for a letter to "fit" a multiset? Suppose we have a multiset W of number-words; this determines total letter counts for each of the 16 letters (for each letter, just sum the counts of that letter across all the number-words in W; also add a count of 1 for the letter "S" for each number-word besides "one", to account for the pluralisation). Call these letter counts f["A"] for the frequency of "A", etc. Pretend we have a function etoi() that operates like C's atoi(), but returns the numeric value of a number-word. (This is just conceptual; of course in practice we would always generate the number-word from the integer value (which we would keep around), and never the other way around.) Then a letter x fits a particular number-word w in W if and only if f[x] + 1 = etoi(w), since writing the letter x itself into the sentence will increase its frequency by 1, thereby making the two sides of the equation equal.
This does not yet address the fact that if more than one letter fits a number-word, only one of them can be assigned it. But it turns out that it is easy to determine whether a given multiset W of number-words, represented as a nondecreasing list of integers, simultaneously fits any set of letters:
Calculate the total letter frequencies f[] that W implies.
Sort these frequencies.
Skip past any zero-frequency letters. Suppose there were k of these.
For each remaining letter, check whether its frequency is equal to one less than the numeric value of the number-word in the corresponding position. I.e. check that f[k] + 1 == etoi(W[0]), f[k+1] + 1 == etoi(W[1]), etc.
If and only if all these frequencies agree, we have a winner!
The above approach is naive in that it assumes that we choose words to put in the multiset from a size M ground set. For M > 20 there is a lot of structure in this set that can be exploited, at the cost of slightly complicating the algorithm. In particular, instead of enumerating straight multisets of this ground set of all allowed numbers, it would be much better to enumerate multisets of {"one", "two", ..., "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"}, and then allow the "fit detection" step to combine the number-words for multiples of 10 with the single-digit number-words.

Anagram generation - Isnt it kind of subset sum?

Anagram:
An anagram is a type of word play, the result of rearranging the
letters of a word or phrase to produce a new word or phrase, using
all the original letters exactly once;
Subset Sum problem:
The problem is this: given a set of integers, is there a non-empty
subset whose sum is zero?
For example, given the set { −7, −3, −2, 5, 8}, the answer is yes
because the subset { −3, −2, 5} sums to zero. The problem is
NP-complete.
Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input. So does'nt it becomes a kind of subset sum problem.
Am I wrong?
The two problems are similar but are not isomorphic.
In an anagram the order of the letters matters. In a subset sum, the order does not matter.
In an anagram, all the letters must be used. In a subset sum, any subset will do.
In an anagram, the subgroups must form words taken from a comparatively small dictionary of allowable words (the dictionary). In a subset sum, the groups are unrestricted (no dictionary of allowable groupings).
If you'd prove that solving anagram finding (not more than polynomial number of times) solves subset sum problem - it would be a revolution in computer science (you'd prove P=NP).
Clearly finding anagrams is polynomial-time problem:
Checking if two records are anagrams of each other is as simple as sorting letters and compare the resulting strings (that is C*s*log(s) time, where s - number of letters in a record). You'll have at most n such checks, where n - number of records in a dictionary. So obviously the running time ~ C*s*log(s)*n is limited by a polynomial of input size - your input record and dictionary combined.
EDIT:
All the above is valid only if the anagram finding problem is defined as finding anagram of the input phrase in a dictionary of possible complete phrases.
While the wording of the anagram finding problem in the original question above...
Now say we have a dictionary of n words. Now Anagram Generation problem can be stated as to find a set of words in dictionary(of n words) which use up all letters of the input.
...seems to imply something different - e.g. a possibility that some sort of composition of more than one entry in a dictionary is also a valid choice for a possible anagram of the input.
This however seems immediately problematic and unclear because (1) usually phrase is not just sequence of random words (it should make sense as a whole phrase), (2) usually words in a phrase require separators that are also symbols - so it is not clear if the separators (whitespace characters) are required in the input to allow the separate entries in a dictionary and if separators are allowed in a single dictionary entry.
So in my initial answer above I applied a "semantic razor" by interpreting the problem definition the only way it is unambiguous and makes sense as an "anagram finding".
But also we might interpret the authors definition like this:
Given the dictionary of n letter sequences (separate dictionary entries may contain same sequences) and one target letter sequence - find any subset of the dictionary entries that if concatenated together would be exact rearrangement of the target letter sequence OR determine that such subset does not exist.
^^^- Even though this problem no longer really makes perfect sense as an "anagram finding problem" still it is interesting. It is very different problem to what I considered above.
One more thing remains unclear - the alphabet flexibility. To be specific the problem definition must also specify whether set of letters is fixed OR it is allowed to redefine it for each new solution of the problem when specifying dictionary and target sequence of said letters. That's important - capabilities and complexity depends on that.
The variant of this problem with the ability to define the alphabet (available number of letters) for each solution individually actually is equivalent to a subset sum problem. That makes it NP-complete.
I can prove the equivalence of our problem to a natural number variant of subset sum problem defined as
Given the collection (multiset) of natural numbers (repeated numbers allowed) and the target natural number - find any sub-collection that sums exactly to the target number OR determine that such sub-collection does not exist.
It is not hard to see that mostly linear number of steps is enough to translate one problem input to another and vice versa. So the solution of one problem translates to exactly one solution of another problem plus mostly linear overhead.
This positive-only variant of subset-sum is equivalent to zero-sum subset-sum variant given by the author (see e.g. Subset Sum Wikipedia article).
I think you are wrong.
Anagram Generation must be simpler than Subset Sum, because I can devise a trivial O(n) algorithm to solve it (as defined):
initialize the list of anagrams to an empty list
iterate the dictionary word by word
if all the input letters are used in the ith word
add the word to the list of anagrams
return the list of anagrams
Also, anagrams consist of valid words that are permutations of the input word (i.e. rearrangements) whereas subsets have no concept of order. They may actually include less elements than the input set (hence sub set) but an anagram must always be the same length as the input word.
It isn't NP-Complete because given a single set of letters, the set of anagrams remains identical regardless.
There is always a single mapping that transforms the letters of the input L to a set of anagrams A. so we can say that f(L) = A for any execution of f. I believe, if I understand correctly, that this makes the function deterministic. The order of a Set is irrelevant, so considering a differently ordered solution non-deterministic is invalid, it is also invalid because all entries in a dictionary are unique, and thus can be deterministically ordered.

How to compute palindrome from a stream of characters in sub-linear space/time?

I don't even know if a solution exists or not. Here is the problem in detail. You are a program that is accepting an infinitely long stream of characters (for simplicity you can assume characters are either 1 or 0). At any point, I can stop the stream (let's say after N characters were passed through) and ask you if the string received so far is a palindrome or not. How can you do this using less sub-linear space and/or time.
Yes. The answer is about two-thirds of the way down http://rjlipton.wordpress.com/2011/01/12/stringology-the-real-string-theory/
EDIT: Some people have asked me to summarize the result, in case the link dies. The link gives some details about a proof of the following theorem: There is a multi-tape Turing machine that can recognize initial non-trivial palindromes in real-time. (A summary, also provided by the article linked: Suppose the machine has read x1, x2, ..., xk of the input. Then it has only constant time to decide if x1, x2, ..., xk is a palindrome.)
A multitape Turing machine is just one with several side-by-side tapes that it can read and write to; in a very specific sense it is exactly equivalent to a standard Turing machine.
A real-time computation is one in which a Turing machine must read a character from input at least once every M steps (for some bounded constant M). It is readily seen that any real-time algorithm should be linear-time, then.
There is a paper on the proof which is around 10 pages which is available behind an institutional paywall here which I will not repost elsewhere. You can contact the author for a more detailed explanation if you'd like; I just had read this recently and realized it was more or less what you were looking for.
You could use a rolling hash, or more rolling hashes for accuracy. Incrementally compute the hash of the characters read so far, in the order they were read, and in reverse order of reading.
If your hash function is x*3^(k-1)+x*3^(k-2)+...+x*3^0 for example, where x is a character you read, this is how you'd do it:
hLeftRight = 0
hRightLeft = 0
k = 0
repeat until there are numbers in the stream
x = stream.Get()
hLeftRight = 3*hLeftRight + x.Value
hRightLeft = hRightLeft + 3^k*x.Value
if (x.QueryPalindrome = true)
yield hLeftRight == hRightLeft
k = k + 1
Obviously you'd have to calculate the hashes modulo something, probably a prime or a power of two. And of course, this could lead to false positives.
Round 2
As I see it, with each new character, there are three cases:
Character breaks potential symmetry, for example, aab -> aabc
Character extends the middle, for example aab -> aabb
Character continues symmetry, for example aab->aaba
Assume you have a pointer that tracks down the string and points to the last character that continued a potential palindrome.
(I am going to use parenthesis to indicate a pointed at character)
Lets say you are starting with aa(b) and get an:
'a' (case 3), you move the pointer to
the left and check if it's an 'a' (it
is). You now have a(a)b.
'c' (case 1), you are not expecting a 'c', in this case you start back at the beginning and you now have aab(c).
The really tricky case is 2, because somehow you have to know that the character you just got isn't affecting symmetry, it is just extending the middle. For this, you have to hold an additional pointer that tracks where the plateau's (middle's) edge lies. For example, you have (b)baabb and you just got another 'b', in this case you have to know to reset the pointer to the base of the middle plateau here: bbaa(b)bb. Since we are going for constant time, you have to hold a pointer here to begin with (you can't afford the time to search for the plateau's edge). Now if you get another 'b', you know that you are still on the edge of that plateau and you keep the pointer where it is, so bbaa(b)bb -> bbaa(b)bbb. Now, if you get an 'a', you know that the 'b's are not part of the extended middle and you reset both pointers (The tracking pointer and the edge pointer) so you now have bbaabbbb((a)).
With these three cases, I think all bases are covered. If you ever want to check if the current string is a palindrome, check if the first pointer (not the plateau's edge pointer) is at index 0.
This might help you:
http://arxiv.org/pdf/1308.3466v1.pdf
If you store the last $k$ many input symbols you can easily find palindromes up to a length of $k$.
If you use the algorithms of the paper you can find the midpoints of palindromes and an length estimate of its length.

Tokenize valid words from a long string

Suppose you have a dictionary that contains valid words.
Given an input string with all spaces removed, determine whether the string is composed of valid words or not.
You can assume the dictionary is a hashtable that provides O(1) lookup.
Some examples:
helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid
If a string has multiple possible parsings, just return true is sufficient.
The string can be very long. Hence think an algorithm that is both space & time efficient.
I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).
For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.
For an other example, with the words {bag, bat, bun, but} the automaton would look like this:
Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string.
More formally, our algorithm is as follows:
Let S be a set of states, initially containing the starting state.
Read the next input character, let us denote it by a.
For each element s in S, determine the state that we move into from s on reading a; that is, the state r such that with the notation above (s,r,a) is an edge. Let us denote the set of these states by R. That is, R = {r | s in S, (s,r,a) is an edge}.
(If R is empty, the string is not accepted and the algorithm halts.)
If there are no more input symbols, check whether any of the accepting states is in R. (In our case, there is only one accepting state, the starting state.) If so, the string is accepted, if not, the string is not accepted.
Otherwise, take S := R and go to 2.
Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.)
I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.
I'd go for a recursive algorithm with implicit backtracking. Function signature: f: input -> result, with input being the string, result either true or false depending if the entire string can be tokenized correctly.
Works like this:
If input is the empty string, return true.
Look at the length-one prefix of input (i.e., the first character). If it is in the dictionary, run f on the suffix of input. If that returns true, return true as well.
If the length-one prefix from the previous step is not in the dictionary, or the invocation of f in the previous step returned false, make the prefix longer by one and repeat at step 2. If the prefix cannot be made any longer (already at the end of the string), return false.
Rinse and repeat.
For dictionaries with low to moderate amount of ambiguous prefixes, this should fetch a pretty good running time in practice (O(n) in the average case, I'd say), though in theory, pathological cases with O(2^n) complexity can probably be constructed. However, I doubt we can do any better since we need backtracking anyways, so the "instinctive" O(n) approach using a conventional pre-computed lexer is out of the question. ...I think.
EDIT: the estimate for the average-case complexity is likely incorrect, see my comment.
Space complexity would be only stack space, so O(n) even in the worst-case.

The perverse hangman problem

Perverse Hangman is a game played much like regular Hangman with one important difference: The winning word is determined dynamically by the house depending on what letters have been guessed.
For example, say you have the board _ A I L and 12 remaining guesses. Because there are 13 different words ending in AIL (bail, fail, hail, jail, kail, mail, nail, pail, rail, sail, tail, vail, wail) the house is guaranteed to win because no matter what 12 letters you guess, the house will claim the chosen word was the one you didn't guess. However, if the board was _ I L M, you have cornered the house as FILM is the only word that ends in ILM.
The challenge is: Given a dictionary, a word length & the number of allowed guesses, come up with an algorithm that either:
a) proves that the player always wins by outputting a decision tree for the player that corners the house no matter what
b) proves the house always wins by outputting a decision tree for the house that allows the house to escape no matter what.
As a toy example, consider the dictionary:
bat
bar
car
If you are allowed 3 wrong guesses, the player wins with the following tree:
Guess B
NO -> Guess C, Guess A, Guess R, WIN
YES-> Guess T
NO -> Guess A, Guess R, WIN
YES-> Guess A, WIN
This is almost identical to the "how do I find the odd coin by repeated weighings?" problem. The fundamental insight is that you are trying to maximise the amount of information you gain from your guess.
The greedy algorithm to build the decision tree is as follows:
- for each guess, choose the guess which for which the answer is "true" and which the answer is "false" is as close to 50-50 as possible, as information theoretically this gives the most information.
Let N be the size of the set, A be the size of the alphabet, and L be the number of letters in the word.
So put all your words in a set. For each letter position, and for each letter in your alphabet count how many words have that letter in that position (this can be optimised with an additional hash table). Choose the count which is closest in size to half the set. This is O(L*A).
Divide the set in two taking the subset which has this letter in this position, and make that the two branches to the tree. Repeat for each subset until you have the whole tree. In worst case this will require O(N) steps, but if you have a nice dictionary this will lead to O(logN) steps.
This isn't strictly an answer, since it doesn't give you a decision tree, but I did something very similar when writing my hangman solver. Basically, it looks at the set of words in its dictionary that match the pattern and picks the most common letter. If it guesses wrong, it eliminates the largest number of candidates. Since there's no penalty to guessing right in hangman, I think this is the optimal strategy given the constraints.
So with the dictionary you gave, it would first guess a correctly. Then it would guess r, also correctly, then b (incorrect), then c.
The problem with perverse hangman is that you always guess wrong if you can guess wrong, but that's perfect for this algorithm since it eliminates the largest set first. As a slightly more meaningful example:
Dictionary:
mar
bar
car
fir
wit
In this case it guesses r incorrectly first and is left with just wit. If wit were replaced in the dictionary with sir, then it would guess r correctly then a incorrectly, eliminating the larger set, then w or f at random incorrectly, followed by the other for the final word with only 1 incorrect guess.
So this algorithm will win if it's possible to win, though you have to actually run through it to see if it does win.

Resources