Unbiased Shuffling with Large Number of Duplicates - algorithm

The Fisher–Yates algorithm generates unbiased random permutations of a finite sequence. The running time is proportional to the number elements being shuffled.
I want to shuffle a few non-zero elements with a large number of zero elements.
Implementing the Fisher–Yates algorithm with a list would lead to the shuffling process taking too long and requiring too much storage. Most steps in the Fisher--Yates algorithm would simply switch the position of duplicate zero elements.
Does there exists a random shuffle (or alternative) algorithm that:
Leads to unbiased permutations
Does not require the shuffling and storing of all duplicate elements

Since a Fisher-Yates shuffle produces a random permutation, its inverse is also a random permutation:
For i=1 to n-1:
choose a random number j in [0,i]
swap elements i and j
In this algorithm, though, if you have m non-zero elements, and you start with all of them at the end, then the first n-m iterations are guaranteed to be swapping zeros, so you can just skip those.
Use a hash map instead of an array if you want to avoid storing all the zero elements.

Related

QuickSelect - Print k smallest elements of array A of size n in O(n) time and O(5k) auxiliary space - read only once

I've been trying to solve this problem:
The Select algorithm allows us to find in a given array A the value of the ith index in linear time (O(n)), but requires us to keep A in memory throughout the entire algorithm.
Suggest an algorithm which receives an array of size n, named A, which contains natural numbers, and prints the k smallest elements of A with the following restrictions:
For each i=1...n, you are only allowed to read the value A[i] once. You are not allowed to write into, or exchange between elements of A.
You are allowed to use a second array of size 5k, named B, which can be written and read without restrictions.
Run-time must be linear, by size of A. Assume k<n, and that 5k<n.
I realized that I need to utilize a Median of Medians approach, but I'm having a hard time thinking about how the pivot would be calculated, as I an only store 5k elements.
This would mean that I cannot calculate the Median of Medians which would make the best 70% 30% pivot choice, and won't reach a linear run-time.
I would appreciate any input in the matter.
Thanks!
Start copying from A into an auxiliary array. Whenever you collect 2k elements, use quickselect to keep the smallest k elements and discard the rest.
Finally call quickselect once more to discard all but the smallest k elements in the remainder.

Algorithm to calculate permutations

I'm aware of Heap's algorithm to calculate permutations of a given sequence, but what if I wanted to calculate the permutations of a k-elements subset for a given sequence N?
The solution I'm thinking of this time is a backtracking one, but it would need to generate a new sequence of sub-elements each time deleting one and recursively calling the permutation function. This sounds expensive and I would like to know if there's a better solution
Use an algorithm to generate combinations of size K from the set of N.
(Pick any from the SO question: Algorithm to return all combinations of k elements from n).
Using the result, apply Heap's Algorithm to create all permutations of this k-element subset (or another Algorithm to generate all possible permutations of a list).
Generate the next subset of size K and repeat (steps 1 and 2) until all subsets of size K have been enumerated.

How to generate random permutations fast

I read a question in an algorithm book:
"Given a positive integer n, choose 100 random permutations of [1,2,...,n],..."
I know how to generate random permutations with Knuth's algorithm. but does there exist any fast algorithms to generate large amount of permutations ?
Knuth shuffles require you to do n random swaps for a permutation of n elements (see http://en.wikipedia.org/wiki/Random_permutation#Knuth_shuffles) so the complexity is O(n) which is about the best you can expect if you receive a permutation on n elements.
Is this causing you a practical problem? If so, perhaps you could look at what you are doing with all those permutations in practice. Apart from simply getting by on less, you could think about deferring generating a permutation until you are sure you need it. If you need a permutation on n objects but only look at k of those n objects, perhaps you need a scheme for generating only those k elements. For small k, you could simply generate k random numbers in the range [0, n) at random, repeating generations which return numbers which have already come up. For small k, this would be unlikely.
There exist N! permutations of numbers from 1 to N. If you sort them lexicographically, like in a dictionary, it's possible to construct permutation knowing it order in a list of sorted permutations.
For example, let N=3, lexicographically sorted list of permutations is {123,132,213,231,312,321}. You generate number between 1 and 3!, for example 5. 5-th permutaion is 312. How to construct it?
Let's find the 1-st number of 5-th permutation. Let's divide permutations into blocks, criteria is 1-st number, I means such groups - {123,132},{213,231},{312,321}. Each group contains (n-1)! elements. The first number of permutation is the block number. 5-th permutation is in ceil(5/(3-1)!)=3 block. So, we've just found the first number of 5-th permutation it's 3.
Now I'm looking for not 5-th but (5-(n-1)!*(ceil(5/2)-1))=5-2*2=1-th permutation in
{3,1,2},{3,2,1}. 3 is determined and the same for all group members, so I'm actually searching for 1-th permutation in {1,2},{2,1} and N now is 2. Again, next_num = ceil(1/(new_N-1)!) = 1.
Continue it N times.
Hope you got the idea. Complexity is O(N) - because you constructing permutation elements one by one with arithmetical tricks.
UPDATE
When you got next number by arithmetical opearions you should also keep used array and instead of X take X-th unused Complexity becomes NlogN because logN needed for getting X-th unused element

Efficiently generating all possible permutations of a linked list?

There are many algorithms for generating all possible permutations of a given set of values. Typically, those values are represented as an array, which has O(1) random access.
Suppose, however, that the elements to permute are represented as a doubly-linked list. In this case, you cannot randomly access elements in the list in O(1) time, so many permutation algorithms will experience an unnecessary slowdown.
Is there an algorithm for generating all possible permutations of a linked list with as little time and space overhead as possible?
Try to think of how you generate all permutations on a piece of paper.
You start from the rightmost number and go one position to the left until you see a number that is smaller than its neighbour. Than you place there the number that is next in value, and order all the remaining numbers in increasing order after it. Do this until there is nothing more to do. Put a little thought in it and you can order the numbers in linear time with respect to their number.
This in fact is the typical algorithm used for next permutation as far as I know. I see no reason why this would be faster on array than on list.
You might want to look into the Steinhaus–Johnson–Trotter algorithm. It generates all permutations of a sequence only by swapping adjacent elements; something which you can do in O(1) in a doubly linked list.
You should read the linked-list's data into an array, which takes O(n) and then use Heap's permutation ( http://www.geekviewpoint.com/java/numbers/permutation) to find all the permutations.
You can use the Factoradic Permutation Algorithm, and rearrange the node pointers accordingly to generate the resulting permutation in place without recursion.
A pseudo description:
element_to_permute = list
temp_list = new empty list
for i = 1 in n!
indexes[] = factoradic(i)
for j in indexes[]
rearrange pointers of node `indexes[j]` of `element_to_permute` in `temp_list`

Find a number with even number of occurrences

Given an array where number of occurrences of each number is odd except one number whose number of occurrences is even. Find the number with even occurrences.
e.g.
1, 1, 2, 3, 1, 2, 5, 3, 3
Output should be:
2
The below are the constraints:
Numbers are not in range.
Do it in-place.
Required time complexity is O(N).
Array may contain negative numbers.
Array is not sorted.
With the above constraints, all my thoughts failed: comparison based sorting, counting sort, BST's, hashing, brute-force.
I am curious to know: Will XORing work here? If yes, how?
This problem has been occupying my subway rides for several days. Here are my thoughts.
If A. Webb is right and this problem comes from an interview or is some sort of academic problem, we should think about the (wrong) assumptions we are making, and maybe try to explore some simple cases.
The two extreme subproblems that come to mind are the following:
The array contains two values: one of them is repeated an even number of times, and the other is repeated an odd number of times.
The array contains n-1 different values: all values are present once, except one value that is present twice.
Maybe we should split cases by complexity of number of different values.
If we suppose that the number of different values is O(1), each array would have m different values, with m independent from n. In this case, we could loop through the original array erasing and counting occurrences of each value. In the example it would give
1, 1, 2, 3, 1, 2, 5, 3, 3 -> First value is 1 so count and erase all 1
2, 3, 2, 5, 3, 3 -> Second value is 2, count and erase
-> Stop because 2 was found an even number of times.
This would solve the first extreme example with a complexity of O(mn), which evaluates to O(n).
There's better: if the number of different values is O(1), we could count value appearances inside a hash map, go through them after reading the whole array and return the one that appears an even number of times. This woud still be considered O(1) memory.
The second extreme case would consist in finding the only repeated value inside an array.
This seems impossible in O(n), but there are special cases where we can: if the array has n elements and values inside are {1, n-1} + repeated value (or some variant like all numbers between x and y). In this case, we sum all the values, substract n(n-1)/2 from the sum, and retrieve the repeated value.
Solving the second extreme case with random values inside the array, or the general case where m is not constant on n, in constant memory and O(n) time seems impossible to me.
Extra note: here, XORing doesn't work because the number we want appears an even number of times and others appear an odd number of times. If the problem was "give the number that appears an odd number of times, all other numbers appear an even number of times" we could XOR all the values and find the odd one at the end.
We could try to look for a method using this logic: we would need something like a function, that applied an odd number of times on a number would yield 0, and an even number of times would be identity. Don't think this is possible.
Introduction
Here is a possible solution. It is rather contrived and not practical, but then, so is the problem. I would appreciate any comments if I have holes in my analysis. If this was a homework or challenge problem with an “official” solution, I’d also love to see that if the original poster is still about, given that more than a month has passed since it was asked.
First, we need to flesh out a few ill-specified details of the problem. Time complexity required is O(N), but what is N? Most commentators appear to be assuming N is the number of elements in the array. This would be okay if the numbers in the array were of fixed maximum size, in which case Michael G’s solution of radix sort would solve the problem. But, I interpret constraint #1, in absence of clarification by the original poster, as saying the maximum number of digits need not be fixed. Therefore, if n (lowercase) is the number of elements in the array, and m the average length of the elements, then the total input size to contend with is mn. A lower bound on the solution time is O(mn) because this is the read-through time of the input needed to verify a solution. So, we want a solution that is linear with respect to total input size N = nm.
For example, we might have n = m, that is sqrt(N) elements of sqrt(N) average length. A comparison sort would take O( log(N) sqrt(N) ) < O(N) operations, but this is not a victory, because the operations themselves on average take O(m) = O(sqrt(N)) time, so we are back to O( N log(N) ).
Also, a radix sort would take O(mn) = O(N) if m were the maximum length instead of average length. The maximum and average length would be on the same order if the numbers were assumed to fall in some bounded range, but if not we might have a small percentage with a large and variable number of digits and a large percentage with a small number of digits. For example, 10% of the numbers could be of length m^1.1 and 90% of length m*(1-10%*m^0.1)/90%. The average length would be m, but the maximum length m^1.1, so the radix sort would be O(m^1.1 n) > O(N).
Lest there be any concern that I have changed the problem definition too dramatically, my goal is still to describe an algorithm with time complexity linear to the number of elements, that is O(n). But, I will also need to perform operations of linear time complexity on the length of each element, so that on average over all the elements these operations will be O(m). Those operations will be multiplication and addition needed to compute hash functions on the elements and comparison. And if indeed this solution solves the problem in O(N) = O(nm), this should be optimal complexity as it takes the same time to verify an answer.
One other detail omitted from the problem definition is whether we are allowed to destroy the data as we process it. I am going to do so for the sake of simplicity, but I think with extra care it could be avoided.
Possible Solution
First, the constraint that there may be negative numbers is an empty one. With one pass through the data, we will record the minimum element, z, and the number of elements, n. On a second pass, we will add (3-z) to each element, so the smallest element is now 3. (Note that a constant number of numbers might overflow as a result, so we should do a constant number of additional passes through the data first to test these for solutions.) Once we have our solution, we simply subtract (3-z) to return it to its original form. Now we have available three special marker values 0, 1, and 2, which are not themselves elements.
Step 1
Use the median-of-medians selection algorithm to determine the 90th percentile element, p, of the array A and partition the array into set two sets S and T where S has the 10% of n elements greater than p and T has the elements less than p. This takes O(n) steps (with steps taking O(m) on average for O(N) total) time. Elements matching p could be placed either into S or T, but for the sake of simplicity, run through array once and test p and eliminate it by replacing it with 0. Set S originally spans indexes 0..s, where s is about 10% of n, and set T spans the remaining 90% of indexes s+1..n.
Step 2
Now we are going to loop through i in 0..s and for each element e_i we are going to compute a hash function h(e_i) into s+1..n. We’ll use universal hashing to get uniform distribution. So, our hashing function will do multiplication and addition and take linear time on each element with respect to its length.
We’ll use a modified linear probing strategy for collisions:
h(e_i) is occupied by a member of T (meaning A[ h(e_i) ] < p but is not a marker 1 or 2) or is 0. This is a hash table miss. Insert e_i by swapping elements from slots i and h(e_i).
h(e_i) is occupied by a member of S (meaning A[ h(e_i) ] > p) or markers 1 or 2. This is a hash table collision. Do linear probing until either encountering a duplicate of e_i or a member of T or 0.
If a member of T, this is a again a hash table miss, so insert e_i as in (1.) by swapping to slot i.
If a duplicate of e_i, this is a hash table hit. Examine the next element. If that element is 1 or 2, we’ve seen e_i more than once already, change 1s into 2s and vice versa to track its change in parity. If the next element is not 1 or 2, then we’ve only seen e_i once before. We want to store a 2 into the next element to indicate we’ve now seen e_i an even number of times. We look for the next “empty” slot, that is one occupied by a member of T which we’ll move to slot i, or a 0, and shift the elements back up to index h(e_i)+1 down so we have room next to h(e_i) to store our parity information. Note we do not need to store e_i itself again, so we’ve used up no extra space.
So basically we have a functional hash table with 9-fold the number of slots as elements we wish to hash. Once we start getting hits, we begin storing parity information as well, so we may end up with only 4.5-fold number of slots, still a very low load factor. There are several collision strategies that could work here, but since our load factor is low, the average number of collisions should be also be low and linear probing should resolve them with suitable time complexity on average.
Step 3
Once we finished hashing elements of 0..s into s+1..n, we traverse s+1..n. If we find an element of S followed by a 2, that is our goal element and we are done. Any element e of S followed by another element of S indicates e was encountered only once and can be zeroed out. Likewise e followed by a 1 means we saw e an odd number of times, and we can zero out the e and the marker 1.
Rinse and Repeat as Desired
If we have not found our goal element, we repeat the process. Our 90th percentile partition will move the 10% of n remaining largest elements to the beginning of A and the remaining elements, including the empty 0-marker slots to the end. We continue as before with the hashing. We have to do this at most 10 times as we process 10% of n each time.
Concluding Analysis
Partitioning via the median-of-medians algorithm has time complexity of O(N), which we do 10 times, still O(N). Each hash operation takes O(1) on average since the hash table load is low and there are O(n) hash operations in total performed (about 10% of n for each of the 10 repetitions). Each of the n elements have a hash function computed for them, with time complexity linear to their length, so on average over all the elements O(m). Thus, the hashing operations in aggregate are O(mn) = O(N). So, if I have analyzed this properly, then on whole this algorithm is O(N)+O(N)=O(N). (It is also O(n) if operations of addition, multiplication, comparison, and swapping are assumed to be constant time with respect to input.)
Note that this algorithm does not utilize the special nature of the problem definition that only one element has an even number of occurrences. That we did not utilize this special nature of the problem definition leaves open the possibility that a better (more clever) algorithm exists, but it would ultimately also have to be O(N).
See the following article: Sorting algorithm that runs in time O(n) and also sorts in place,
assuming that the maximum number of digits is constant, we can sort the array in-place in O(n) time.
After that it is a matter of counting each number's appearences, which will take in average n/2 time to find one number whose number of occurrences is even.

Resources