Recursively enumerable (computably enumerable) languages closed under permutation? - computability

If L is any language. The language perms(L) is the language of all permutations of words from L.
True or False: If L is recursively enumerable (computably enumerable), then perms(L) is also recursively enumerable.
This was on a previous final along with the question: if L is decidable then so is perms(L), which I found to be true.
I suppose I would say false, but I have no proof to back this claim.

Think about what "recursively enumerable" means. It means that you can define a TM that will write each string in the language down. If you give the TM enough time, it will eventually write any given string down.
For any given string in the language, it has a finite number of permutations. A Turing machine could certainly write down all permutations of a string, given the string.
Imagine putting these two Turing machines together: the first to enumerate all strings in your language, and the second one which emits all permutations of each of these strings. The result is an enumeration of all permutations of strings in the first language.
The combination of Turing Machines described above results in a new Turing Machine. We thus have a Turing Machine enumerating all strings in the desired language. By definition, this language is recursively enumerable.

Related

Complement of non-deterministic context-free language

The complement of a context-free language is not always context-free. It is not allowed to just swap the final and non-final states of an NPDA and assume that it produces the complement of the language. Could someone give an example where it goes wrong?
And why does the above described procedure work for regular languages given a DFA? Maybe because DFA and NFA are equivalent and DPDA and NPDA are not?
Well, swapping the final vs non-final states of an NFA doesn't even guarantee you'll get the complement of the language. Consider this rather curious NFA:
----->q0--a-->[q1]
|
a
|
V
q2
This NFA accepts the language {a}. Swapping the final and non-final states, the accepted language becomes {e, a}. These languages are not complementary since they have an intersection.
In exactly the same way, swapping the states of a NPDA is not guaranteed to work either. The difference, as you point out, is that for any NFA, there is some equivalent DFA (indeed, there are lots), and swapping toggling the finality of states will work for those, so the languages are guaranteed to be closed under complementation.
For NPDAs, though, we do not necessarily have equivalent DPDAs (where swapping finality would work fine). Thus, it is possible that the complement of some languages accepted only by NPDAs is not context-free.
Indeed, the context-free language {a^i b^j c^k | i != j or j != k} is accepted only by NPDAs and its complement {strings not of the form a^i b^j c^k or strings of that form with i=j=k) is not context-free.
The grammar which does not specify a unique move from at least one sigma element.
From any state by taking one input we can not determine to which step we will reach so the grammar generating such type of situation is called non deterministic grammar.
[enter image description here][1]
https://i.stack.imgur.com/U6vaJ.jpg
For a particular input the computer will give different output on different execution.
Can’t solve the problem in polynomial time.
Cannot determine the next step of execution due to more than one path the algorithm can take.

Pairwise independent hash functions for strings?

Many randomized algorithms and data structures (such as the Count-Min Sketch) require hash functions with the pairwise independence property. Intuitively, this means that the probability of a hash collision with a specific element is small, even if the output of the hash function for that element is known.
I have found many descriptions of pairwise independent hash functions for fixed-length bitvectors based on random linear functions. However, I have not yet seen any examples of pairwise independent hash functions for strings.
Are there any families of pairwise independent hash functions for strings?
I'm pretty sure they exist, but there's a bit of measure-theoretic subtlety to your question. You might be better off asking on mathoverflow. I'm very rusty with this stuff, but I think I can show that, even if they do exist, you don't actually want one.
To begin with, you need a probability measure on the strings, and any such measure will necessarily look very different from any notion of "uniform." (It's a countable set and all the sigma-algebras over countable sets just clump together sets of elements and assign a probability to each of those sets. You'll want all of the clumps to be singletons.)
Now, if you only give finitely many strings positive probability, you're back in the finite case. So let's ignore that for now and assume that, for any epsilon > 0, you can find a string whose probability is strictly between 0 and epsilon.
Suppose we restrict to the case where the hash functions map strings to {0,1}.
Your family of hash functions will need to be infinite as well and you'll want to talk about it as a probability space of hash functions. If you have a set H of hash functions that has positive probability, then every string is mapped to both 0 and 1 by (different) elements of H. In particular, no single element of H has positive probability. So H has to be uncountable and you've suddenly run into difficult representability issues.
I'd be very happy if someone who hasn't forgotten measure theory would chime in here.
Not with a seed of bounded length and an output of nonzero bounded length.
A fairly crude argument to this effect is, for a finite family of hash functions H, consider a map f from an element x to a tuple giving h(x) for every h in H. Since the codomains of each h and thus f are finite, there exist two strings mapped the same way by all h in H, which, given that there are at least two possible hash values, contradicts pairwise independence.

Quickly checking if set is superset of stored sets

The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.
Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.
I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).
You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.
Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.

Given a integer number, find the smallest function that given it

I have a very large positive integer number (million digits). I need represent it with the smallest possible function, this number is variable, it means, I need an algorithm that generates the smallest possible function to get the given number.
Example: For the number 29512665430652752148753480226197736314359272517043832886063884637676943433478020332709411004889 the algorithm must return "9^99". It must be able to analyze numbers and always return a math function that represent the number. Example the number 21847450052839212624230656502990235142567050104912751880812823948662932355202 must return "9^5^16+1".
Heard of Kolmogorov complexity?
To answer your question: unless you restrict yourself to some specific set of functions, it's impossible.
EDIT: Even in your example, how do you know that the shortest representation of 21​847​450​052​839​212​624​230​656​502​990​235​142​567​050​104​912​751​880​812​823​948​662​932​355​202 is actually 9^5^16+1? Isn't it a quite hard to prove even in this specific case?
If you restrict yourself to some set of functions then you can use the following algorithm:
For i = 1 to n
enumerate all strings s of length i
if s represents a valid expression according to rules chosen a priori,
and evaluates to the number in the input,
return s
It is guaranteed to halt because on the last iteration of the outer loop (i = n) you will get eventually to a string contains the input verbatim.
Of course, this is not very efficient. Specifically O(bn) where n is the length of the input and b is the size of the alphabet.
Expanding on #ybungalobill's terse answer, your function is equivalent to a function that computes the Kolmogorov complexity of an arbitrary string. (The equivalence is obvious if you treat each digit of your very large numbers as characters, and the numbers as sequences of characters.)
According to the Wikipedia page on Kolmogorov complexity, the K(s) function that gives the complexity of a string s is not a computable function. (The page includes a proof.)
In other words, the algorithm you want simply does not exist.
#BlueRaja - Danny Pflughoeft: yes, it is. I'm trying to create some compression that uses this algorithm, but by the way this is impossible.
That's because it's technically impossible to compress arbitrary data, for the same reason, but that doesn't stop us from doing it :)
There are much better ways of compressing data, however. Take a look at, for instance, LZ. It is so ubiquitous that you can almost certainly find a library to do the compression for you, regardless of what language you're writing in. DEFLATE is another popular one.
Hope that helps!
If you're not looking for optimality, just a reasonably good job, then there are a bunch of heuristics you can use. For example, try to decompose n using all of the following
n = a^k + b
for k = 2, 3, ..., log n, and pick the one with the smallest a + b, say. You can compute a and b using a = floor(n^(1/k)) and b = n-a^k. Then recurse on a and b.
Of course, this uses only exponentiation and addition to find a good compression. If you allow subtraction as well, use a=round(n^(1/k)) instead and let b be negative.
Allowing multiplication as well makes it quite a bit harder because you would probably need to factor n.

Using finite automata as keys to a container

I have a problem where I really need to be able to use finite automata as the keys to an associative container. Each key should actually represent an equivalence class of automata, so that when I search, I will find an equivalent automaton (if such a key exists), even if that automaton isn't structurally identical.
An obvious last-resort approach is of course to use linear search with an equivalence test for each key checked. I'm hoping it's possible to do a lot better than this.
I've been thinking in terms of trying to impose an arbitrary but consistent ordering, and deriving an ordered comparison algorithm. First principles involve the sets of strings that the automata represent. Evaluate the set of possible first tokens for each automaton, and apply an ordering based on those two sets. If necessary, continue to the sets of possible second tokens, third tokens etc. The obvious problem with doing this naively is that there's an infinite number of token-sets to check before you can prove equivalence.
I've been considering a few vague ideas - minimising the input automata first and using some kind of closure algorithm, or converting back to a regular grammar, some ideas involving spanning trees. I've come to the conclusion that I need to abandon the set-of-tokens lexical ordering, but the most significant conclusion I've reached so far is that this isn't trivial, and I'm probably better off reading up on someone elses solution.
I've downloaded a paper from CiteSeerX - Total Ordering on Subgroups and Cosets - but my abstract algebra isn't even good enough to know if this is relevant yet.
It also occurred to me that there might be some way to derive a hash from an automaton, but I haven't given this much thought yet.
Can anyone suggest a good paper to read? - or at least let me know if the one I've downloaded is a red herring or not?
I believe that you can obtain a canonical form from minimized automata. For any two equivalent automatons, their minimized forms are isomorphic (I believe this follows from Myhill-Nerode theorem). This isomorphism respects edge labels and of course node classes (start, accepting, non-accepting). This makes it easier than unlabeled graph isomorphism.
I think that if you build a spanning tree of the minimized automaton starting from the start state and ordering output edges by their labels, then you'll get a canonical form for the automaton which can then be hashed.
Edit: Non-tree edges should be taken into account too, but they can also be ordered canonically by their labels.
here is a thesis form 1992 where they produce canonical minimized automata: Minimization of Nondeterministic Finite Automata
Once you have the canonical, form you can easily hash it for example by performing a depth first enumeration of the states and transitions, and hashing a string obtained by encoding state numbers (count them in the order of their first appearance) for states and transitions as triples
<from_state, symbol, to_state, is_accepting_final_state>
This should solve the problem.
When a problem seems insurmountable, the solution is often to publicly announce how difficult you think the problem is. Then, you will immediately realise that the problem is trivial and that you've just made yourself look an idiot - and that's basically where I am now ;-)
As suggested in the question, to lexically order the two automata, I need to consider two things. The two sets of possible first tokens, and the two sets of possible everything-else tails. The tails can be represented as finite automata, and can be derived from the original automata.
So the comparison algorithm is recursive - compare the head, if different you have your result, if the same then recursively compare the tail.
The problem is the infinite sequence needed to prove equivalence for regular grammars in general. If, during a comparison, a pair of automata recur, equivalent to a pair that you checked previously, you have proven equivalence and you can stop checking. It is in the nature of finite automata that this must happen in a finite number of steps.
The problem is that I still have a problem in the same form. To spot my termination criteria, I need to compare my pair of current automata with all the past automata pairs that occurred during the comparison so far. That's what has been giving me a headache.
It also turns out that that paper is relevant, but probably only takes me this far. Regular languages can form a group using the concatenation operator, and the left coset is related to the head:tail things I've been considering.
The reason I'm an idiot is because I've been imposing a far too strict termination condition, and I should have known it, because it's not that unusual an issue WRT automata algorithms.
I don't need to stop at the first recurrence of an automata pair. I can continue until I find a more easily detected recurrence - one that has some structural equivalence as well as logical equivalence. So long as my derive-a-tail-automaton algorithm is sane (and especially if I minimise and do other cleanups at each step) I will not generate an infinite sequence of equivalent-but-different-looking automata pairs during the comparison. The only sources of variation in structure are the original two automata and the tail automaton algorithm, both of which are finite.
The point is that it doesn't matter that much if I compare too many lexical terms - I will still get the correct result, and while I will terminate a little later, I will still terminate in finite time.
This should mean that I can use an unreliable recurrence detection (allowing some false negatives) using a hash or ordered comparison that is sensitive to the structure of the automata. That's a simpler problem than the structure-insensitive comparison, and I think it's the key that I need.
Of course there's still the issue of performance. A linear search using a standard equivalence algorithm might be a faster approach, based on the issues involved here. Certainly I would expect this comparison to be a less efficient equivalence test than existing algorithms, as it is doing more work - lexical ordering of the non-equivalent cases. The real issue is the overall efficiency of a key-based search, and that is likely to need some headache-inducing analysis. I'm hoping that the fact that non-equivalent automata will tend to compare quickly (detecting a difference in the first few steps, like traditional string comparisons) will make this a practical approach.
Also, if I reach a point where I suspect equivalence, I could use a standard equivalence algorithm to check. If that check fails, I just continue comparing for the ordering where I left off, without needing to check for the tail language recurring - I know that I will find a difference in a finite number of steps.
If all you can do is == or !=, then I think you have to check every set member before adding another one. This is slow. (Edit: I guess you already know this, given the title of your question, even though you go on about comparison functions to directly compare two finite automata.)
I tried to do that with phylogenetic trees, and it quickly runs into performance problems. If you want to build large sets without duplicates, you need a way to transform to a canonical form. Then you can check a hash, or insert into a binary tree with the string representation as a key.
Another researcher who did come up with a way to transform a tree to a canonical rep used Patricia trees to store unique trees for duplicate-checking.

Resources