Related
I'm searching for a function/algorithm like a hash algorithm. It takes an arbitrary number of bytes as input and gives a fixed-sized output. But the main difference is I want it to be reversible.
This is impossible because of the Pigeonhole Principle. The Pigeonhole Principle says that if you have n pigeonholes and m pigeons, and m > n, then there will be at least one pigeonhole with at least two pigeons in it.
This Principle has been mathematically proven, there is no way around it. In fact, the proof is simple enough that it can be done by high school students, and often is assigned as homework. You can try it yourself, for example, take two drawers and three socks and try to place the socks in the drawers so that there is only at most one sock in each drawer. You will quickly see that it is impossible.
Therefore, a total function from a domain D to a codomain C, where the cardinality |D| of D is greater than the cardinality |C| of C (i.e. |D| > |C|) can never be bijective.
I'm searching for a function/algorithm like a hash algorithm. It takes an arbitrary number of bytes as input and gives a fixed-sized output.
By the way, that is just the definition of a hash algorithm. So, your "function like a hash algorithm" is just a hash algorithm.
I need to use a hash function which belongs to a family of k-wise independent hash functions. Any pointers on any library or toolkit in C, C++ or python which can generate a set of k-wise independent hash functions from which I can pick a function.
Background: I am trying to implement this algorithm here: http://researcher.watson.ibm.com/researcher/files/us-dpwoodru/knw10b.pdf for the Distinct Elements problem.
I have looked at this thread: Generating k pairwise independent hash functions which mentions using Murmur hash to generate a pairwise independent hash function. I was wondering if there is anything similar for k-wise independent hash functions. If there is none available, would it be possible for me to construct such a set of k-wise independent hash functions.
Thanks in advance.
The simplest k-wise independent hash function (mapping positive integer x < p to one of m buckets) is just
where p is some big random prime (261-1 will work)
and ai are some random positive integers less than p, a0 > 0.
2-wise independent hash:
h(x) = (ax + b) % p % m
again, p is prime, a > 0, a,b < p (i.e. a can't be zero but b can when that is a random choice)
These formulas define families of hash functions. They work (in theory) if you select a hash function randomly from corresponding family (i.e. if you generate random a's and b) each time you run your algorithm.
There is no such thing as "a k-wise independent hash function". However, there are k-wise independent families of functions.
As a reminder, a family of functions is k-wise independent when if h is picked randomly from the family and x_1 .. x_k and y_1 .. y_k are picked arbitrarily, the probability that "for all i, h(x_i) = y_i" is Y^-k, where Y is the size of the co-domain from which the y_i were selected.
There are a few families of functions that are known to be k-wise independent for small k like 2, 3, 4, and 5. For arbitrary k, you will likely need to use polynomial hashing. Note that there are two variants of this, one of which is not even 2-independent, so be careful when implementing it.
The polynomial hash family can hash from a field F to itself using k constants a_0 through a_{k-1} and is defined by the sum of a_i x^i, where x is the key you are hashing. Field arithmetic can be implemented on your computer by taking letting F be the integers modulo a prime p. That's probably not convenient, as it is often better to have the domain and range be uint32_t or the like. In that case you can use the field F_{2^32}, and you can use polynomial multiplication over Z_2 and then division by an irreducible polynomial in that field. Otherwise, you can operate in Z_p where p is larger than 2^32 (or 64) and take the result of the polynomial mod 2^32, I think. That will only be almost k-wise independent, but sometimes that's sufficient for the analysis to go through. It will not be easy to re-analyze the KNW algorithm to change its hash families.
To generate a member of a k-wise independent family, use your favorite random number generator to pick the function randomly. In the case of polynomila hashing, that means picking the as referenced above. /dev/random should suffice.
The paper you point to, "An Optimal Algorithm for the Distinct Elements Problem", is a nice one and has been cited many times. However, it is not easy to implement, and it may be slower or even take more space than HyperLogLog, due to hidden constants in the big-O notations. A number of papers have noted the complexity of this algorithm and even called it infeasible compared to HyperLogLog. If you want to implement an estimator for the number of distinct elements, you might start with an earlier algorithm. There is plenty of complexity there if your goal is education. If your goal is practicality, you also want to stay away from KNW, because it could be a lot of work just to make something less practical that HyperLogLog.
As another piece of advice, you should probably ignore the suggestions to "just use Murmur hash" or "pick k values from xxhash" if you want to learn about and understand this algorithm or other random algorithms that use hashing. Murmur/xx might be fine in practice, but they are not k-wise independent families, and some of that advice on this page is not even semantically well-formed. For instance, "if you need k different hash, just re-use the same algorithm k times, with k different seeds" isn't relevant to k-wise independent families. For this algorithm you want to implement, you'll end up apply the hash functions an arbitrary number of times. You don't "need k different hash", you need n different hash values generated by first picking randomly from a k-independent hash family and second applying the chosen function to the streaming keys that are the input to algorithms like this.
This is one of many solutions, but you could use for example the following open-source hash algorithm:
https://github.com/Cyan4973/xxHash
Then, to generate different hashes, you just have to provide different seeds.
Considering the main function declaration :
unsigned int XXH32 (const void* input, int len, unsigned int seed);
So if you need k different hash values, just re-use the same algorithm k times, with k different seeds.
Just use a good non-cryptographic hash function. This advice perhaps will make me unpopular with my colleagues in theoretical computer science, but consider your adversary.
Nature. Yeah, maybe it'll hit the minuscule fraction inputs that cause your hash function to behave badly, but there are plenty of other ways for things to go wrong that a k-wise independent hash family won't fix (e.g., the random number generator that chose the hash function didn't do a good job, bugs, etc.), so you need to test end-to-end anyway.
Oblivious adversary. This is what the theory assumes. Oblivious adversaries cannot look at your random bits. If only they were so nice in real life!
Non-oblivious adversary. Randomness is pointless. Use a binary tree.
I'm not 100% sure what you mean by "k-wise independent hash functions", but you can get k distinct hash functions by coming up with two hash functions, and then using linear combinations of them.
I have an example in my bloom filter module: http://stromberg.dnsalias.org/svn/bloom-filter/trunk/bloom_filter_mod.py Ignore the get_bitno_seed_rnd function, look at hash1, hash2 and get_bitno_lin_comb
The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.
Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.
I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).
You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.
Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.
I have a very large positive integer number (million digits). I need represent it with the smallest possible function, this number is variable, it means, I need an algorithm that generates the smallest possible function to get the given number.
Example: For the number 29512665430652752148753480226197736314359272517043832886063884637676943433478020332709411004889 the algorithm must return "9^99". It must be able to analyze numbers and always return a math function that represent the number. Example the number 21847450052839212624230656502990235142567050104912751880812823948662932355202 must return "9^5^16+1".
Heard of Kolmogorov complexity?
To answer your question: unless you restrict yourself to some specific set of functions, it's impossible.
EDIT: Even in your example, how do you know that the shortest representation of 21847450052839212624230656502990235142567050104912751880812823948662932355202 is actually 9^5^16+1? Isn't it a quite hard to prove even in this specific case?
If you restrict yourself to some set of functions then you can use the following algorithm:
For i = 1 to n
enumerate all strings s of length i
if s represents a valid expression according to rules chosen a priori,
and evaluates to the number in the input,
return s
It is guaranteed to halt because on the last iteration of the outer loop (i = n) you will get eventually to a string contains the input verbatim.
Of course, this is not very efficient. Specifically O(bn) where n is the length of the input and b is the size of the alphabet.
Expanding on #ybungalobill's terse answer, your function is equivalent to a function that computes the Kolmogorov complexity of an arbitrary string. (The equivalence is obvious if you treat each digit of your very large numbers as characters, and the numbers as sequences of characters.)
According to the Wikipedia page on Kolmogorov complexity, the K(s) function that gives the complexity of a string s is not a computable function. (The page includes a proof.)
In other words, the algorithm you want simply does not exist.
#BlueRaja - Danny Pflughoeft: yes, it is. I'm trying to create some compression that uses this algorithm, but by the way this is impossible.
That's because it's technically impossible to compress arbitrary data, for the same reason, but that doesn't stop us from doing it :)
There are much better ways of compressing data, however. Take a look at, for instance, LZ. It is so ubiquitous that you can almost certainly find a library to do the compression for you, regardless of what language you're writing in. DEFLATE is another popular one.
Hope that helps!
If you're not looking for optimality, just a reasonably good job, then there are a bunch of heuristics you can use. For example, try to decompose n using all of the following
n = a^k + b
for k = 2, 3, ..., log n, and pick the one with the smallest a + b, say. You can compute a and b using a = floor(n^(1/k)) and b = n-a^k. Then recurse on a and b.
Of course, this uses only exponentiation and addition to find a good compression. If you allow subtraction as well, use a=round(n^(1/k)) instead and let b be negative.
Allowing multiplication as well makes it quite a bit harder because you would probably need to factor n.
Just as background, I'm aware of the Fisher-Yates perfect shuffle. It is a great shuffle with its O(n) complexity and its guaranteed uniformity and I'd be a fool not to use it ... in an environment that permits in-place updates of arrays (so in most, if not all, imperative programming environments).
Sadly the functional programming world doesn't give you access to mutable state.
Because of Fisher-Yates, however, there's not a lot of literature I can find on how to design a shuffling algorithm. The few places that address it at all do so briefly before saying, in effect, "so here's Fisher-Yates which is all the shuffling you need to know". I had to, in the end, come up with my own solution.
The solution I came up with works like this to shuffle any list of data:
If the list is empty, return the empty set.
If the list has a single item, return that single item.
If the list is non-empty, partition the list with a random number generator and apply the algorithm recursively to each partition, assembling the results.
In Erlang code it looks something like this:
shuffle([]) -> [];
shuffle([L]) -> [L];
shuffle(L) ->
{Left, Right} = lists:partition(fun(_) ->
random:uniform() < 0.5
end, L),
shuffle(Left) ++ shuffle(Right).
(If this looks like a deranged quick sort to you, well, that's what it is, basically.)
So here's my problem: the same situation that makes finding shuffling algorithms that aren't Fisher-Yates difficult makes finding tools to analyse a shuffling algorithm equally difficult. There's lots of literature I can find on analysing PRNGs for uniformity, periodicity, etc. but not a lot of information out there on how to analyse a shuffle. (Indeed some of the information I found on analysing shuffles was just plain wrong -- easily deceived through simple techniques.)
So my question is this: how do I analyse my shuffling algorithm (assuming that the random:uniform() call up there is up to the task of generating apropriate random numbers with good characteristics)? What mathematical tools are there at my disposal to judge whether or not, say, 100,000 runs of the shuffler over a list of integers ranging 1..100 has given me plausibly good shuffling results? I've done a few tests of my own (comparing increments to decrements in the shuffles, for example), but I'd like to know a few more.
And if there's any insight into that shuffle algorithm itself that would be appreciated too.
General remark
My personal approach about correctness of probability-using algorithms: if you know how to prove it's correct, then it's probably correct; if you don't, it's certainly wrong.
Said differently, it's generally hopeless to try to analyse every algorithm you could come up with: you have to keep looking for an algorithm until you find one that you can prove correct.
Analysing a random algorithm by computing the distribution
I know of one way to "automatically" analyse a shuffle (or more generally a random-using algorithm) that is stronger than the simple "throw lots of tests and check for uniformity". You can mechanically compute the distribution associated to each input of your algorithm.
The general idea is that a random-using algorithm explores a part of a world of possibilities. Each time your algorithm asks for a random element in a set ({true, false} when flipping a coin), there are two possible outcomes for your algorithm, and one of them is chosen. You can change your algorithm so that, instead of returning one of the possible outcomes, it explores all solutions in parallel and returns all possible outcomes with the associated distributions.
In general, that would require rewriting your algorithm in depth. If your language supports delimited continuations, you don't have to; you can implement "exploration of all possible outcomes" inside the function asking for a random element (the idea is that the random generator, instead of returning a result, capture the continuation associated to your program and run it with all different results). For an example of this approach, see oleg's HANSEI.
An intermediary, and probably less arcane, solution is to represent this "world of possible outcomes" as a monad, and use a language such as Haskell with facilities for monadic programming. Here is an example implementation of a variant¹ of your algorithm, in Haskell, using the probability monad of the probability package :
import Numeric.Probability.Distribution
shuffleM :: (Num prob, Fractional prob) => [a] -> T prob [a]
shuffleM [] = return []
shuffleM [x] = return [x]
shuffleM (pivot:li) = do
(left, right) <- partition li
sleft <- shuffleM left
sright <- shuffleM right
return (sleft ++ [pivot] ++ sright)
where partition [] = return ([], [])
partition (x:xs) = do
(left, right) <- partition xs
uniform [(x:left, right), (left, x:right)]
You can run it for a given input, and get the output distribution :
*Main> shuffleM [1,2]
fromFreqs [([1,2],0.5),([2,1],0.5)]
*Main> shuffleM [1,2,3]
fromFreqs
[([2,1,3],0.25),([3,1,2],0.25),([1,2,3],0.125),
([1,3,2],0.125),([2,3,1],0.125),([3,2,1],0.125)]
You can see that this algorithm is uniform with inputs of size 2, but non-uniform on inputs of size 3.
The difference with the test-based approach is that we can gain absolute certainty in a finite number of steps : it can be quite big, as it amounts to an exhaustive exploration of the world of possibles (but generally smaller than 2^N, as there are factorisations of similar outcomes), but if it returns a non-uniform distribution we know for sure that the algorithm is wrong. Of course, if it returns an uniform distribution for [1..N] and 1 <= N <= 100, you only know that your algorithm is uniform up to lists of size 100; it may still be wrong.
¹: this algorithm is a variant of your Erlang's implementation, because of the specific pivot handling. If I use no pivot, like in your case, the input size doesn't decrease at each step anymore : the algorithm also considers the case where all inputs are in the left list (or right list), and get lost in an infinite loop. This is a weakness of the probability monad implementation (if an algorithm has a probability 0 of non-termination, the distribution computation may still diverge), that I don't yet know how to fix.
Sort-based shuffles
Here is a simple algorithm that I feel confident I could prove correct:
Pick a random key for each element in your collection.
If the keys are not all distinct, restart from step 1.
Sort the collection by these random keys.
You can omit step 2 if you know the probability of a collision (two random numbers picked are equal) is sufficiently low, but without it the shuffle is not perfectly uniform.
If you pick your keys in [1..N] where N is the length of your collection, you'll have lots of collisions (Birthday problem). If you pick your key as a 32-bit integer, the probability of conflict is low in practice, but still subject to the birthday problem.
If you use infinite (lazily evaluated) bitstrings as keys, rather than finite-length keys, the probability of a collision becomes 0, and checking for distinctness is no longer necessary.
Here is a shuffle implementation in OCaml, using lazy real numbers as infinite bitstrings:
type 'a stream = Cons of 'a * 'a stream lazy_t
let rec real_number () =
Cons (Random.bool (), lazy (real_number ()))
let rec compare_real a b = match a, b with
| Cons (true, _), Cons (false, _) -> 1
| Cons (false, _), Cons (true, _) -> -1
| Cons (_, lazy a'), Cons (_, lazy b') ->
compare_real a' b'
let shuffle list =
List.map snd
(List.sort (fun (ra, _) (rb, _) -> compare_real ra rb)
(List.map (fun x -> real_number (), x) list))
There are other approaches to "pure shuffling". A nice one is apfelmus's mergesort-based solution.
Algorithmic considerations: the complexity of the previous algorithm depends on the probability that all keys are distinct. If you pick them as 32-bit integers, you have a one in ~4 billion probability that a particular key collides with another key. Sorting by these keys is O(n log n), assuming picking a random number is O(1).
If you infinite bitstrings, you never have to restart picking, but the complexity is then related to "how many elements of the streams are evaluated on average". I conjecture it is O(log n) in average (hence still O(n log n) in total), but have no proof.
... and I think your algorithm works
After more reflexion, I think (like douplep), that your implementation is correct. Here is an informal explanation.
Each element in your list is tested by several random:uniform() < 0.5 tests. To an element, you can associate the list of outcomes of those tests, as a list of booleans or {0, 1}. At the beginning of the algorithm, you don't know the list associated to any of those number. After the first partition call, you know the first element of each list, etc. When your algorithm returns, the list of tests are completely known and the elements are sorted according to those lists (sorted in lexicographic order, or considered as binary representations of real numbers).
So, your algorithm is equivalent to sorting by infinite bitstring keys. The action of partitioning the list, reminiscent of quicksort's partition over a pivot element, is actually a way of separating, for a given position in the bitstring, the elements with valuation 0 from the elements with valuation 1.
The sort is uniform because the bitstrings are all different. Indeed, two elements with real numbers equal up to the n-th bit are on the same side of a partition occurring during a recursive shuffle call of depth n. The algorithm only terminates when all the lists resulting from partitions are empty or singletons : all elements have been separated by at least one test, and therefore have one distinct binary decimal.
Probabilistic termination
A subtle point about your algorithm (or my equivalent sort-based method) is that the termination condition is probabilistic. Fisher-Yates always terminates after a known number of steps (the number of elements in the array). With your algorithm, the termination depends on the output of the random number generator.
There are possible outputs that would make your algorithm diverge, not terminate. For example, if the random number generator always output 0, each partition call will return the input list unchanged, on which you recursively call the shuffle : you will loop indefinitely.
However, this is not an issue if you're confident that your random number generator is fair : it does not cheat and always return independent uniformly distributed results. In that case, the probability that the test random:uniform() < 0.5 always returns true (or false) is exactly 0 :
the probability that the first N calls return true is 2^{-N}
the probability that all calls return true is the probability of the infinite intersection, for all N, of the event that the first N calls return 0; it is the infimum limit¹ of the 2^{-N}, which is 0
¹: for the mathematical details, see http://en.wikipedia.org/wiki/Measure_(mathematics)#Measures_of_infinite_intersections_of_measurable_sets
More generally, the algorithm does not terminate if and only if some of the elements get associated to the same boolean stream. This means that at least two elements have the same boolean stream. But the probability that two random boolean streams are equal is again 0 : the probability that the digits at position K are equal is 1/2, so the probability that the N first digits are equal is 2^{-N}, and the same analysis applies.
Therefore, you know that your algorithm terminates with probability 1. This is a slightly weaker guarantee that the Fisher-Yates algorithm, which always terminate. In particular, you're vulnerable to an attack of an evil adversary that would control your random number generator.
With more probability theory, you could also compute the distribution of running times of your algorithm for a given input length. This is beyond my technical abilities, but I assume it's good : I suppose that you only need to look at O(log N) first digits on average to check that all N lazy streams are different, and that the probability of much higher running times decrease exponentially.
Your algorithm is a sort-based shuffle, as discussed in the Wikipedia article.
Generally speaking, the computational complexity of sort-based shuffles is the same as the underlying sort algorithm (e.g. O(n log n) average, O(n²) worst case for a quicksort-based shuffle), and while the distribution is not perfectly uniform, it should approach uniform close enough for most practical purposes.
Oleg Kiselyov provides the following article / discussion:
Provably perfect random shuffling and its pure functional implementations
which covers the limitations of sort-based shuffles in more detail, and also offers two adaptations of the Fischer–Yates strategy: a naive O(n²) one, and a binary-tree-based O(n log n) one.
Sadly the functional programming world doesn't give you access to mutable state.
This is not true: while purely functional programming avoids side effects, it supports access to mutable state with first-class effects, without requiring side effects.
In this case, you can use Haskell's mutable arrays to implement the mutating Fischer–Yates algorithm as described in this tutorial:
Haskell Shuffling (Brett Hall)
Addendum
The specific foundation of your shuffle sort is actually an infinite-key radix sort: as gasche points out, each partition corresponds to a digit grouping.
The main disadvantage of this is the same as any other infinite-key sorting shuffle: there is no termination guarantee. Although the likelihood of termination increases as the comparison proceeds, there is never an upper bound: the worst-case complexity is O(∞).
I was doing some stuff similar to this a while ago, and in particular you might be interested in Clojure's vectors, which are functional and immutable but still with O(1) random access/update characteristics. These two gists have several implementations of a "take N elements at random from this M-sized list"; at least one of them turns into a functional implementation of Fisher-Yates if you let N=M.
https://gist.github.com/805546
https://gist.github.com/805747
Based on How to test randomness (case in point - Shuffling) , I propose:
Shuffle (medium sized) arrays composed of equal numbers of zeroes and ones. Repeat and concatenate until bored. Use these as input to the diehard tests. If you have a good shuffle, then you should be generating random sequences of zeroes and ones (with the caveat that the cumulative excess of zeroes (or ones) is zero at the boundaries of the medium sized arrays, which you would hope the tests detect, but the larger "medium" is the less likely they are to do so).
Note that a test can reject your shuffle for three reasons:
the shuffle algorithm is bad,
the random number generator used by the shuffler or during initialization is bad, or
the test implementation is bad.
You'll have to resolve which is the case if any test rejects.
Various adaptations of the diehard tests (to resolve certain numbers, I used the source from the diehard page). The principle mechanism of adaptation is to make the shuffle algorithm act as a source of uniformly distributed random bits.
Birthday spacings: In an array of n zeroes, insert log n ones. Shuffle. Repeat until bored. Construct the distribution of inter-one distances, compare with the exponential distribution. You should perform this experiment with different initialization strategies -- the ones at the front, the ones at the end, the ones together in the middle, the ones scattered at random. (The latter has the greatest hazard of a bad initialization randomization (with respect to the shuffling randomization) yielding rejection of the shuffling.) This can actually be done with blocks of identical values, but has the problem that it introduces correlation in the distributions (a one and a two can't be at the same location in a single shuffle).
Overlapping permutations: shuffle five values a bunch of times. Verify that the 120 outcomes are about equally likely. (Chi-squared test, 119 degrees of freedom -- the diehard test (cdoperm5.c) uses 99 degrees of freedom, but this is (mostly) an artifact of sequential correlation caused by using overlapping subsequences of the input sequence.)
Ranks of matrices: from 2*(6*8)^2 = 4608 bits from shuffling equal numbers of zeroes and ones, select 6 non-overlapping 8-bit substrings. Treat these as a 6-by-8 binary matrix and compute its rank. Repeat for 100,000 matrices. (Pool together ranks of 0-4. Ranks are then either 6, 5, or 0-4.) The expected fraction of ranks is 0.773118, 0.217439, 0.009443. Chi-squared compare with observed fractions with two degrees of freedom. The 31-by-31 and 32-by-32 tests are similar. Ranks of 0-28 and 0-29 are pooled, respectively. Expected fractions are 0.2887880952, 0.5775761902, 0.1283502644, 0.0052854502. Chi-squared test has three degrees of freedom.
and so on...
You may also wish to leverage dieharder and/or ent to make similar adapted tests.