In-place shuffle of a contiguous ragged array - algorithm

I have a ragged array represented as a contiguous block of memory along with its "shape", corresponding to the length of each row, and its "offsets", corresponding to the index of the first element in each row. To illustrate, I have an array conceptually like this:
[[0, 0, 0],
[1, 1, 1, 1],
[2],
[3, 3],
[4, 4]]
Represented in-memory as:
values: [0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 4, 4]
shape: [3, 4, 1, 2, 2]
offsets: [0, 3, 7, 8, 10]
I may have in the hundreds of millions of rows with typically, say, 3-20 four-byte floats per row, though with no hard upper bound on the row length.
I wish to shuffle the rows of this array randomly. Since the array is ragged, I can't see how the Fisher-Yates algorithm can be applied in a straightforward manner. I see how I can carry out a shuffle by randomly permuting the array shape, pre-allocating a new array, and then copying over rows according to the permutation generating the new shape with some book-keeping on the indexes. However, I do not necessarily have the RAM required to duplicate the array for the purposes of this operation.
Therefore, my question is whether there is a good way to perform this shuffle in-place, or using only limited extra memory? Run-time is also a concern, but shuffling is unlikely to be the main bottleneck.
For illustration purposes, I wrote a quick toy-version in Rust here, which attempts to implement the shuffle sketched above with allocation of a new array.

shape is redundant since shape[i] is offset[i+1]-offset[i] (if you extend offset by one element containing the length of the values array). But since your data structure has both these fields, you could shuffle your array by just in-place shuffling the two descriptor vectors (in parallel), using F-Y. This would be slightly easier if shape and offset were combined into an array of pairs (offset, length), which also might improve locality of reference, but it's certainly not critical if you have some need for the separate arrays.
That doesn't preserve the contiguity of the rows in the values list, but if all your array accesses are through offset, it will not require any other code modification.
It is possible to implement an in-place swap of two variable-length subsequences using a variant of the classic rotate-with-three-reversals algorithm. Given P V Q, a sequence conceptually divided into three variable length parts, we first reverse P, V, and Q in-place independently producing PR VR QR. Then we reverse the entire sequence in place, yielding Q V P. (Afterwards, you'd need to fixup the offsets array.)
That's linear time in the length of the span from P to Q, but as a shuffle algorithm it will add up to quadratic time, which is impractical for "hundreds of millions" of rows.

As often happens, I started with a complex idea and then simplified it. Here is the simple version, with the complex one below.
What we're going to do is quicksort it into a random arrangement. The key operation is partitioning. That is we want to take a section of m blocks and randomly partition them into m_l blocks on the left and m_r on the right.
The idea here is this. We have a queue of temporarily copied blocks on the left, and a queue of temporarily copied blocks on the right. It is a queue of blocks, but the queue size is the number of elements in it. The partitioning logic looks like this:
while m_l + m_r < m:
pick the larger queue, breaking ties randomly
if the queue is empty:
read a block into it
get block from queue
if random() < m_l / (m_l + m_r):
# put the block on the left
until we have enough room to write the block:
copy blocks into the left queue
write block to the left
m_l--
else:
# put the block on the right
until we have enough room to write the block:
copy blocks into the right queue
write block to the right
m_r--
And now we need to recursively repeat until we've quicksorted it into a random order.
Note that, unlike with a regular quicksort, we are constructing our partitions to be exactly the size we expect. So we don't have to recurse. Instead we can:
# pass 1
partition whole array
# pass 2
partition 0..(m//2-1)
partition (m//2)..(m-1)
# pass 3
partition 0..(m//4-1)
partition (m//4)..(m//2-1)
partition (m//2)..((3*m)//4-1)
partition ((3*m)//4)..(m-1)
etc.
The result is time O(n * log(m)). And the queues will never get past 5k data where k is the largest block size.
Here is an approach that we can calculate in time O(n log(n)). The maximum space needed is O(k) where k is the maximum block size.
First note, shape and offsets are largely redundant. Because shape[i] = offset[i+1] - offset[i] for all i but the last. So with O(1) extra data (which we already have in values.len()) we can make shape redundant, then (ab)use it, however we want, and then calculate it at the end.
So let's start by picking a random permutation of 0..(shape.len()-1) and placing it in shape. This will be where each element will go, and can be found in time O(n) using Fisher-Yates.
Our idea now is to use quicksort to actually get them to the right places.
First, our pivot. For O(n) work in a single pass we can add up the lengths of all blocks which will come before the median block, and also find the length of said median block.
Now quicksort is dependent upon being able to swap things. But we can't do that directly (your whole problem). However the idea is that we'll partition from the middle out. And so the values, shape and offsets arrays will have beginning and ending sections that we haven't gotten to, and a piece in the middle that we've rewritten. Where those sections meet we'll also need to have queues of blocks copied off of the left and right and not yet written back. And, of course, we'll need to have a record of where the boundaries are.
So the idea is this.
set up the data structures.
copy a few blocks in the middle to one of the queues - enough to have a place for the median block to go.
record where the median will go
while have not finished partitioning:
pick the larger queue (break ties by farthest from its end, then randomly)
if it is empty:
read a block into it
figure out where its next block needs to be written
copy blocks in its way to the appropriate queue
write the block out
Where writing the block out means writing its elements to the right place, setting its offset, and setting its shape to the still final location for that block.
This operation will partition around the median block. Recursively repeat to sort each side into blocks being in their final position.
And, finally, fix the shape array back to what it was supposed to be.
The time complexity of this is O(n log(n)) for the same reason that quicksort is. As for space complexity, if k is the largest block size, any time the queues get past size 4k then the next block you extract must be able to be written, so they cannot grow any farther. This makes the maximum space used O(k).

Related

Algorithm for seeing if many different arrays are subsets of another one?

Let's say I have an array of ~20-100 integers, for example [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (actually numbers more like [106511349 , 173316561, ...], all nonnegative 64-bit integers under 2^63, but for demonstration purposes let's use these).
And many (~50,000) smaller arrays of usually 1-20 terms to match or not match:
1=[2, 3, 8, 20]
2=[2, 3, NOT 8]
3=[2, 8, NOT 16]
4=[2, 8, NOT 16] (there will be duplicates with different list IDs)
I need to find which of these are subsets of the array being tested. A matching list must have all of the positive matches, and none of the negative ones. So for this small example, I would need to get back something like [3, 4]. List 1 fails to match because it requires 20, and list 2 fails to match because it has NOT 8. The NOT can easily be represented by using the high bit/making the number negative in those cases.
I need to do this quickly up to 10,000 times per second . The small arrays are "fixed" (they change infrequently, like once every few seconds), while the large array is done per data item to be scanned (so 10,000 different large arrays per second).
This has become a bit of a bottleneck, so I'm looking into ways to optimize it.
I'm not sure the best data structures or ways to represent this. One solution would be to turn it around and see what small lists we even need to consider:
2=[1, 2, 3, 4]
3=[1, 2]
8=[1, 2, 3, 4]
16=[3, 4]
20=[1]
Then we'd build up a list of lists to check, and do the full subset matching on these. However, certain terms (often the more frequent ones) are going to end up in many of the lists, so there's not much of an actual win here.
I was wondering if anyone is aware of a better algorithm for solving this sort of problem?
you could try to make a tree with the smaller arrays since they change less frequently, such that each subtree tries to halve the number of small arrays left.
For example, do frequency analysis on numbers in the smaller arrays. Find which number is found in closest to half of the smaller arrays. Make that the first check in the tree. In your example that would be '3' since it occurs in half the small arrays. Now that's the head node in the tree. Now put all the small lists that contain 3 to the left subtree and all the other lists to the right subtree. Now repeat this process recursively on each subtree. Then when a large array comes in, reverse index it, and then traverse the subtree to get the lists.
You did not state which of your arrays are sorted - if any.
Since your data is not that big, I would use a hash-map to store the entries of the source set (the one with ~20-100 integers). That would basically let you test if a integer is present in O(1).
Then, given that 50,000(arrays) * 20(terms each) * 8(bytes per term) = 8 megabytes + (hash map overhead), does not seem large either for most systems, I would use another hash-map to store tested arrays. This way you don't have to re-test duplicates.
I realize this may be less satisfying from a CS point of view, but if you're doing a huge number of tiny tasks that don't affect each other, you might want to consider parallelizing them (multithreading). 10,000 tasks per second, comparing a different array in each task, should fit the bill; you don't give any details about what else you're doing (e.g., where all these arrays are coming from), but it's conceivable that multithreading could improve your throughput by a large factor.
First, do what you were suggesting; make a hashmap from input integer to the IDs of the filter arrays it exists in. That lets you say "input #27 is in these 400 filters", and toss those 400 into a sorted set. You've then gotta do an intersection of the sorted sets for each one.
Optional: make a second hashmap from each input integer to it's frequency in the set of filters. When an input comes in, sort it using the second hashmap. Then take the least common input integer and start with it, so you have less overall work to do on each step. Also compute the frequencies for the "not" cases, so you basically get the most bang for your buck on each step.
Finally: this could be pretty easily made into a parallel programming problem; if it's not fast enough on one machine, it seems you could put more machines on it pretty easily, if whatever it's returning is useful enough.

Can you do a parallel counting sort in O(n/p) time?

Is is possible to do a counting sort in parallel and achieve O(n/p) runtime?
Take an example where we have an array with millions of elements that range from 1-10. A merge sort will run in no better than O(nlogn) time. A counting sort applied to this problem will run in O(n) time. Parallelizing a counting sort could be interesting. If we assign a subarray with n/p elements to each processor and each processor has its own count array of size 9, the initial step where element counts are accumulated should take O(n/p) time. Consolidating all the count arrays into a single array should take O(p) time as you are only iterating p count arrays, each a constant size.
I haven't been able to fully think through the last step in the counting sort where the elements are placed in order. If the elements of the count array are atomic, you could assign n/p sections of the original array to individual processors and achieve some parallelization, but there would be contention at individual elements of the count array, potentially substantially reducing parallelization. If the input array is all 10's, all processors would serialize on the 9th element of the count array, reducing algorithmic efficiency to O(n).
You could assign subarrays of the count array to each of p processors and you are back to O(n/p) runtime, but only if the elements are distributed fairly evenly. And, in our example, you would be limited to 10 processors. If the elements are not distributed evenly, one or more processors could be doing a larger proportion of the work. For instance, if half of the elements in the input array are 10, one processor would have to step through half the array. Worst case, the array is all 10's and a single processor would have to step through the entire array devolving runtime to O(n).
Maybe you could divide individual elements of the count array among multiple processors. For instance, if there are 50 10's in the input array, element 9 of the count array would reflect this. You could have 5 processors write 10 10's each to the proper positions in the output array. This again devolves to O(n) runtime if there are fewer than p elements at each index location of the count array, but it avoids the problem where distribution of element values is uneven.
Is it possible to do a counting sort in O(n/p) time?
Yes, it is possible. Divide your array in p parts of equal length. Then create a counting array 'c' for each process. Let each process count the number of elements and store them in c. This will take O(n/p). Now add all counting arrays c together and make the array shared to all processes. This will take O(p*b), where b is the number of possible values. So far this is exactly your approach. Now you can recreate the array in p processes since you can calculate the first and last index of a value from c. For each value i its first index is the sum of all previous values in c. Its last index is its first index plus c[i]. This calculation can be done in O(i) where i is smalleer then b, so it is less then O(b). Each process can now repopulate its own part. This again takes O(n/p). To sum all up, you have n/p + p*b + b + n/p. If p*b << n it will result in O(2*n/p). (Since 2/p is a constant factor, you still have the class O(n). But the parallelisation will significantly speed up your algorithm.)

Looking for a limited shuffle algorithm

I have a shuffling problem. There is lots of pages and discussions about shuffling a array of values completely, like a stack of cards.
What I need is a shuffle that will uniformly displace the array elements at most N places away from its starting position.
That is If N is 2 then element I will be shuffled at most to a position from I-2 to I+2 (within the bounds of the array).
This has proven to be tricky with some simple solutions resulting in a directional bias to the element movement, or by a non-uniform amount.
You're right, this is tricky! First, we need to establish some more rules, to ensure we don't create artificially non-random results:
Elements can be left in the position they started in. This is a necessary part of any fair shuffle, and also ensures our shuffle will work for N=0.
When N is larger than an element's distance from the start or end of the array, it's allowed to be moved to the other side. We could tweak the algorithm to forbid this, but it would violate the "uniformly" requirement - elements near either end would be more likely to stay put than elements near the middle.
Now we can actually solve the problem.
Generate an array of random value in the range i + [-N, N] where i is the current index in the array. Normalize values outside the array bounds (e.g. -1 should become length-1 and length should become 0).
Look for pairs of duplicate values (collisions) in the array, and recompute them. You have a few options:
Recompute both values until they don't collide with each other, they could both still collide with other values.
Recompute just one until it doesn't collide with the other, the first value could still collide, but the second should now be unique, which might mean fewer calls to the RNG.
Identify the set of available indices for each collision (e.g. in [3, 1, 1, 0] index 2 is available), pick a random value from that set, and set one of the array values to selected result. This avoids needing to loop until the collision is resolved, but is more complex to code and risks running into a case where the set is empty.
However you address individual collisions, repeat the process until every value in the array is unique.
Now move each element in the original array to the index specified in the array we generated.
I'm not sure how to best implement #2, I'd suggest you benchmark it. If you don't want to take the time to benchmark, I'd go with the first option. The others are optimizations that might be faster, but might actually end up being slower.
This solution has an unbounded runtime in theory, but should terminate reasonably quickly in practice. Again, benchmark and test it before using it anywhere critical.
One possible solution I have come up with though how 'naive' it is I am not certain. Especially at edges, the far edge especially.
create a array of flags (boolean) N long (representing elements that have been swapped)
For At each index check if it has already been swapped (according first element in flags array) if so, move on to next (see below)
rotate the flags array, deleting the first element (representing this
element), and add a new 'not swapped' element to end. ASIDE: This
maybe done using a modulus array lookup, to avoid having to actually
move array contents, especially for large N
Loop...
pick a number from 0 to N (or less than N, if N plus current
index is larger that array being shuffled.
If 0, element swaps with itself, move to next.
Otherwise if that element marked as swapped, Loop and try again.
Note there is always 2 elements in flags array that can be picks, itself
and the last element (unless close to end of array being shuffled)
Swap current element with selected unswapped element, mark the selected element as swapped in the flags array. Loop to next element

Quicksort: pivot position after one partition

I am reading about quicksort, looking at different implementations and I am trying to wrap my head around something.
In this implementation (which of course works), the pivot is chosen as the middle element and then the left and right pointer move to the right and left accordingly, swapping elements to partition around the pivot.
I was trying the array [4, 3, 2, 6, 8, 1, 0].
On the first partition, pivot is 6 and all the left elements are already smaller than 6, so the left pointer will stop at the pivot. On the right side, we will swap 0 with 6, and then 1 and 8, so at the end of the first iteration, the array will look like:
[4, 3, 2, 0, 1, 8, 6].
However, I was under the impression that after each iteration in quicksort, the pivot ends up in its rightful place, so here it should end up in position 5 of the array.
So, it is possible (and ok) that the pivot doesn't end up in its correct iteration or is it something obvious I am missing?
There are many possible variations of the quicksort algorithm. In this one it is OK for the pivot to be not in its correct place in its iteration.
The defining feature of every variation of the quicksort algorithm is that after the partition step, we have a part in the beginning of the array, where all the elements are less or equal to pivot, and a non-overlapping part in the end of the array where all the elements are greater or equal to pivot. There may also be a part between them, where every element is equal to pivot. This layout ensures, that after we sort the left part and the right part with recursive calls, and leave the middle part intact, the whole array will be sorted.
Notice, that in general elements equal to pivot may go to any part of the array. A good implementation of quicksort, that avoids quadratic time for the most obvious case, i.e. all equal elements, must spread elements equal to pivot between parts rationally.
Possible variants include:
The middle part includes only 1 element: the pivot. In that case pivot takes its final place in the array after the partition and won't be used in the recursive calls. That's what you meant by pivot taking its place in its iteration. For this approach the good implementation must move about half the elements equal to pivot to the left part and the other half to the right part, otherwise we would have quadratic time for an array with all equal elements.
There is no middle part. Pivot and all elements equal to it are spread between the left and the right part. That's what the implementation you linked does. Once again, in this approach about half of the elements equal to pivot should go to the left part, and the other half to the right part. This can also be mixed with the first variation, depending on whether we are sorting an array with an odd or an even number of elements.
Every element equal to pivot goes to the middle part. There are no elements equal to pivot in either left or right part. That's quite efficient and that's the example Wikipedia gives for solving the all-elements-equal problem. Arrays with all elements equal to each other are sorted in linear time in that case.
Thus, the correct and efficient implementation of quicksort is quite tricky (there is also a problem of choosing a good pivot, for which several approaches with different tradeoffs exist as well; or an optimisation of switching to another non-recursive sorting algorithm for smaller sub-array sizes).
Also, it seems that the implementation you linked to, may do recursive calls on overlapping subarrays:
if (i <= j) {
exchange(i, j);
i++;
j--;
}
For example, when i is equal to j, those elements will be swapped, and i will become greater than j by 2. After that 3 elements will overlap between the ranges of the following recursive calls. The code still seems to work correctly though.

How to compare two elements in a list in constant time

Suppose I have a list of elements like [5, 3, 1, 2, 4], and I want to compare two elements by position. Whichever comes first in the list is greater, or true. So:
compare(5, 3) # true
compare(2, 1) # false
compare(3, 4) # true
How can I do that in constant time? One way I thought of doing this was using maps, where the key is the element and the value is the position in the list:
order = {5: 0, 3: 1, 1: 2, 2: 3, 4: 4}
Then we have amortized O(1) time, but this will be O(N) space. Does anyone have a more elegant solution?
Your map idea looks pretty good. The fact that a map is O(N) for memory shouldn't be a problem, because you can't get less than O(N) unless you use compression techniques (a list is O(N) as well).
Also, since the map stores the indices of each element, you could forget about the original list, and just use the map. That is, unless you need the list for some reason. Even if you need to insert an element into the middle (say at position 3), you can update the map in linear time just by iterating over the element and incrementing the necessary indices.
So, the map looks to be just as efficient a solution as the list for basic operations, with the added awesomeness of an O(1) compare function. As for elegance, the map is pretty hard to beat since it doens't require any extra work beyond what's described here.
If you have to start from a list and you are only doing the operation a few times, use:
def compare(ls, n1, n2):
return ls.index(n1) > ls.index(n2)
If you can choose the representation beforehand, or if you will need to do this many times with the same list, do what you did with the dictionary.
Remember also that the list uses O(N) space, so the addition of O(N) space of the dictionary is no big deal.
correction: the compare function above will take O(1) time, because accesses in python take O(1) time, see http://wiki.python.org/moin/TimeComplexity

Resources