Is it possible to rearrange an array in place in O(N)? - algorithm

If I have a size N array of objects, and I have an array of unique numbers in the range 1...N, is there any algorithm to rearrange the object array in-place in the order specified by the list of numbers, and yet do this in O(N) time?
Context: I am doing a quick-sort-ish algorithm on objects that are fairly large in size, so it would be faster to do the swaps on indices than on the objects themselves, and only move the objects in one final pass. I'd just like to know if I could do this last pass without allocating memory for a separate array.
Edit: I am not asking how to do a sort in O(N) time, but rather how to do the post-sort rearranging in O(N) time with O(1) space. Sorry for not making this clear.

I think this should do:
static <T> void arrange(T[] data, int[] p) {
boolean[] done = new boolean[p.length];
for (int i = 0; i < p.length; i++) {
if (!done[i]) {
T t = data[i];
for (int j = i;;) {
done[j] = true;
if (p[j] != i) {
data[j] = data[p[j]];
j = p[j];
} else {
data[j] = t;
break;
}
}
}
}
}
Note: This is Java. If you do this in a language without garbage collection, be sure to delete done.
If you care about space, you can use a BitSet for done. I assume you can afford an additional bit per element because you seem willing to work with a permutation array, which is several times that size.
This algorithm copies instances of T n + k times, where k is the number of cycles in the permutation. You can reduce this to the optimal number of copies by skipping those i where p[i] = i.

The approach is to follow the "permutation cycles" of the permutation, rather than indexing the array left-to-right. But since you do have to begin somewhere, everytime a new permutation cycle is needed, the search for unpermuted elements is left-to-right:
// Pseudo-code
N : integer, N > 0 // N is the number of elements
swaps : integer [0..N]
data[N] : array of object
permute[N] : array of integer [-1..N] denoting permutation (used element is -1)
next_scan_start : integer;
next_scan_start = 0;
while (swaps < N )
{
// Search for the next index that is not-yet-permtued.
for (idx_cycle_search = next_scan_start;
idx_cycle_search < N;
++ idx_cycle_search)
if (permute[idx_cycle_search] >= 0)
break;
next_scan_start = idx_cycle_search + 1;
// This is a provable invariant. In short, number of non-negative
// elements in permute[] equals (N - swaps)
assert( idx_cycle_search < N );
// Completely permute one permutation cycle, 'following the
// permutation cycle's trail' This is O(N)
while (permute[idx_cycle_search] >= 0)
{
swap( data[idx_cycle_search], data[permute[idx_cycle_search] )
swaps ++;
old_idx = idx_cycle_search;
idx_cycle_search = permute[idx_cycle_search];
permute[old_idx] = -1;
// Also '= -idx_cycle_search -1' could be used rather than '-1'
// and would allow reversal of these changes to permute[] array
}
}

Do you mean that you have an array of objects O[1..N] and then you have an array P[1..N] that contains a permutation of numbers 1..N and in the end you want to get an array O1 of objects such that O1[k] = O[P[k]] for all k=1..N ?
As an example, if your objects are letters A,B,C...,Y,Z and your array P is [26,25,24,..,2,1] is your desired output Z,Y,...C,B,A ?
If yes, I believe you can do it in linear time using only O(1) additional memory. Reversing elements of an array is a special case of this scenario. In general, I think you would need to consider decomposition of your permutation P into cycles and then use it to move around the elements of your original array O[].
If that's what you are looking for, I can elaborate more.
EDIT: Others already presented excellent solutions while I was sleeping, so no need to repeat it here. ^_^
EDIT: My O(1) additional space is indeed not entirely correct. I was thinking only about "data" elements, but in fact you also need to store one bit per permutation element, so if we are precise, we need O(log n) extra bits for that. But most of the time using a sign bit (as suggested by J.F. Sebastian) is fine, so in practice we may not need anything more than we already have.

If you didn't mind allocating memory for an extra hash of indexes, you could keep a mapping of original location to current location to get a time complexity of near O(n). Here's an example in Ruby, since it's readable and pseudocode-ish. (This could be shorter or more idiomatically Ruby-ish, but I've written it out for clarity.)
#!/usr/bin/ruby
objects = ['d', 'e', 'a', 'c', 'b']
order = [2, 4, 3, 0, 1]
cur_locations = {}
order.each_with_index do |orig_location, ordinality|
# Find the current location of the item.
cur_location = orig_location
while not cur_locations[cur_location].nil? do
cur_location = cur_locations[cur_location]
end
# Swap the items and keep track of whatever we swapped forward.
objects[ordinality], objects[cur_location] = objects[cur_location], objects[ordinality]
cur_locations[ordinality] = orig_location
end
puts objects.join(' ')
That obviously does involve some extra memory for the hash, but since it's just for indexes and not your "fairly large" objects, hopefully that's acceptable. Since hash lookups are O(1), even though there is a slight bump to the complexity due to the case where an item has been swapped forward more than once and you have to rewrite cur_location multiple times, the algorithm as a whole should be reasonably close to O(n).
If you wanted you could build a full hash of original to current positions ahead of time, or keep a reverse hash of current to original, and modify the algorithm a bit to get it down to strictly O(n). It'd be a little more complicated and take a little more space, so this is the version I wrote out, but the modifications shouldn't be difficult.
EDIT: Actually, I'm fairly certain the time complexity is just O(n), since each ordinality can have at most one hop associated, and thus the maximum number of lookups is limited to n.

#!/usr/bin/env python
def rearrange(objects, permutation):
"""Rearrange `objects` inplace according to `permutation`.
``result = [objects[p] for p in permutation]``
"""
seen = [False] * len(permutation)
for i, already_seen in enumerate(seen):
if not already_seen: # start permutation cycle
first_obj, j = objects[i], i
while True:
seen[j] = True
p = permutation[j]
if p == i: # end permutation cycle
objects[j] = first_obj # [old] p -> j
break
objects[j], j = objects[p], p # p -> j
The algorithm (as I've noticed after I wrote it) is the same as the one from #meriton's answer in Java.
Here's a test function for the code:
def test():
import itertools
N = 9
for perm in itertools.permutations(range(N)):
L = range(N)
LL = L[:]
rearrange(L, perm)
assert L == [LL[i] for i in perm] == list(perm), (L, list(perm), LL)
# test whether assertions are enabled
try:
assert 0
except AssertionError:
pass
else:
raise RuntimeError("assertions must be enabled for the test")
if __name__ == "__main__":
test()

There's a histogram sort, though the running time is given as a bit higher than O(N) (N log log n).

I can do it given O(N) scratch space -- copy to new array and copy back.
EDIT: I am aware of the existance of an algorithm that will proceed through. The idea is to perform the swaps on the array of integers 1..N while at the same time mirroring the swaps on your array of large objects. I just cannot find the algorithm right now.

The problem is one of applying a permutation in place with minimal O(1) extra storage: "in-situ permutation".
It is solvable, but an algorithm is not obvious beforehand.
It is described briefly as an exercise in Knuth, and for work I had to decipher it and figure out how it worked. Look at 5.2 #13.
For some more modern work on this problem, with pseudocode:
http://www.fernuni-hagen.de/imperia/md/content/fakultaetfuermathematikundinformatik/forschung/berichte/bericht_273.pdf

I ended up writing a different algorithm for this, which first generates a list of swaps to apply an order and then runs through the swaps to apply it. The advantage is that if you're applying the ordering to multiple lists, you can reuse the swap list, since the swap algorithm is extremely simple.
void make_swaps(vector<int> order, vector<pair<int,int>> &swaps)
{
// order[0] is the index in the old list of the new list's first value.
// Invert the mapping: inverse[0] is the index in the new list of the
// old list's first value.
vector<int> inverse(order.size());
for(int i = 0; i < order.size(); ++i)
inverse[order[i]] = i;
swaps.resize(0);
for(int idx1 = 0; idx1 < order.size(); ++idx1)
{
// Swap list[idx] with list[order[idx]], and record this swap.
int idx2 = order[idx1];
if(idx1 == idx2)
continue;
swaps.push_back(make_pair(idx1, idx2));
// list[idx1] is now in the correct place, but whoever wanted the value we moved out
// of idx2 now needs to look in its new position.
int idx1_dep = inverse[idx1];
order[idx1_dep] = idx2;
inverse[idx2] = idx1_dep;
}
}
template<typename T>
void run_swaps(T data, const vector<pair<int,int>> &swaps)
{
for(const auto &s: swaps)
{
int src = s.first;
int dst = s.second;
swap(data[src], data[dst]);
}
}
void test()
{
vector<int> order = { 2, 3, 1, 4, 0 };
vector<pair<int,int>> swaps;
make_swaps(order, swaps);
vector<string> data = { "a", "b", "c", "d", "e" };
run_swaps(data, swaps);
}

Related

Code Complexity in 3 array case

You are given with three sorted arrays ( in ascending order), you are required to find a triplet ( one element from each array) such that distance is minimum.
Distance is defined like this :
If a[i], b[j] and c[k] are three elements then
distance = max{abs(a[i]-b[j]),abs(a[i]-c[k]),abs(b[j]-c[k])}
Please give a solution in O(n) time complexity
Linear time algorithm:
double MinimalDistance(double[] A, double[] B, double[] C)
{
int i,j,k = 0;
double min_value = infinity;
double current_val;
int opt_indexes[3] = {0, 0, 0);
while(i < A.size || j < B.size || k < C.size)
{
current_val = calculate_distance(A[i],B[j],C[k]);
if(current_val < min_value)
{
min_value = current_val;
opt_indexes[1] = i;
opt_indexes[2] = j;
opt_indexes[3] = k;
}
if(A[i] < B[j] && A[i] < C[k] && i < A.size)
i++;
else if (B[j] < C[k] && j < B.size)
j++;
else
k++;
}
return min_value;
}
In each step you check the current distance, then increment the index of the array currently pointing to the minimal value. each array is iterated through exactly once, which mean the running time is O(A.size + B.size + C.size).
if you want the optimal indexes instead of the minimal values, you can return opt_indexes instead of min_value.
Suppose we have just one sorted array, then 3 consecutive elements which have less possible distances are the desired solution. Now when we have three arrays, just merge them all and make a big sorted array ABC (this can be done in O(n) by merge operation in merge-sort), just keep a flag to determine which element belongs in which original array. Now you have to find three consecutive elements in array like this:
a1,a2,b1,b2,b3,c1,b4,c2,c3,c4,b5,b6,a3,a4,a5,....
and here consecutive means they belong to the 3 different group in consecutive order, e.g: a2,b3,c1 or c4,b6,a3.
Now finding this tree elements is not hard, sure smallest and greatest one should be last and first of a elements of first and last group in some triple, e.g in the group: [c2,c3,c4],[b5,b6],[a3,a4,a5], we don't need to check a4,a5,c2,c3 is clear that possible solution in this case is among c4,[b5,b6],a5, also we don't need to compare c4 with b5,b6, or a5 with b5,b6, sure distance is made by a5-c4 (in this group). So we can start from left and keep track of last element and update best possible solution in each iteration by just keeping the last visited value of each group.
Example (first I should say that I didn't wrote the code because I think this is OP's task not me):
Suppose we have this sequences after sorted array:
a1,a2,b1,b2,b3,c1,b4,c2,c3,c4,b5,b6,a3,a4,a5,....
let iterate step by step:
We need to just keep track of last item for each item from our arrays, a is for keeping track of current best a_i, b for b_i, and c for c_i. suppose at first a_i=b_i=c_i=-1,
in the first step a will be a1, in the next step
a=a2,b=-1,c=-1
a=a2,b=b1,c=-1
a=a2,b=b2,c=-1
a=a2,b=b3,c=-1,
a=a2,b=b3,c=c1,
At this point we save current pointers (a2,b3,c1) as a best value for difference,
In the next step:
a=a2,c=c1,b=b4
Now we compare the difference of b4-a2 with previously best option, if is better, we save this pointers as a solution upto now and we proceed:
a=a2,b=b4,c=c2 (again compare and if needed update the best solution),
a=a2,b=b4,c=c3 (again ....)
a=a2,b=b4,c=c4 (again ....)
a=a2, b=b5,c=c4, ....
Ok if is not clear from the text, after merge we have (I'll suppose all of array have at least one element):
solution = infinite;
a=b=c=-1,
bestA=bestB=bestC=1;
for (int i=0;i<ABC.Length;i++)
{
if(ABC[i].type == "a") // type is a flag determines
// who is the owner of this element
{
a=ABC[i].Value;
if (b!=-1&&c!=-1)
{
if (max(|a-b|,|b-c|,|a-c|) < solution)
{
solution = max(|a-b|,|b-c|,|a-c|);
bestA= a,bestB = b,bestC = c;
}
}
}
// and two more if for type "b" and "c"
}
Sure there is more elegant algorithm than this, but I see you had problem with your link, so I guess this trivial way of looking at problem makes it easier, afterward you can understand your own link.

Algorithm to find duplicate in an array

I have an assignment to create an algorithm to find duplicates in an array which includes number values. but it has not said which kind of numbers, integers or floats. I have written the following pseudocode:
FindingDuplicateAlgorithm(A) // A is the array
mergeSort(A);
for int i <- 0 to i<A.length
if A[i] == A[i+1]
i++
return A[i]
else
i++
have I created an efficient algorithm?
I think there is a problem in my algorithm, it returns duplicate numbers several time. for example if array include 2 in two for two indexes i will have ...2, 2,... in the output. how can i change it to return each duplicat only one time?
I think it is a good algorithm for integers, but does it work good for float numbers too?
To handle duplicates, you can do the following:
if A[i] == A[i+1]:
result.append(A[i]) # collect found duplicates in a list
while A[i] == A[i+1]: # skip the entire range of duplicates
i++ # until a new value is found
Do you want to find Duplicates in Java?
You may use a HashSet.
HashSet h = new HashSet();
for(Object a:A){
boolean b = h.add(a);
boolean duplicate = !b;
if(duplicate)
// do something with a;
}
The return-Value of add() is defined as:
true if the set did not already
contain the specified element.
EDIT:
I know HashSet is optimized for inserts and contains operations. But I'm not sure if its fast enough for your concerns.
EDIT2:
I've seen you recently added the homework-tag. I would not prefer my answer if itf homework, because it may be to "high-level" for an allgorithm-lesson
http://download.oracle.com/javase/1.4.2/docs/api/java/util/HashSet.html#add%28java.lang.Object%29
Your answer seems pretty good. First sorting and them simply checking neighboring values gives you O(n log(n)) complexity which is quite efficient.
Merge sort is O(n log(n)) while checking neighboring values is simply O(n).
One thing though (as mentioned in one of the comments) you are going to get a stack overflow (lol) with your pseudocode. The inner loop should be (in Java):
for (int i = 0; i < array.length - 1; i++) {
...
}
Then also, if you actually want to display which numbers (and or indexes) are the duplicates, you will need to store them in a separate list.
I'm not sure what language you need to write the algorithm in, but there are some really good C++ solutions in response to my question here. Should be of use to you.
O(n) algorithm: traverse the array and try to input each element in a hashtable/set with number as the hash key. if you cannot enter, than that's a duplicate.
Your algorithm contains a buffer overrun. i starts with 0, so I assume the indexes into array A are zero-based, i.e. the first element is A[0], the last is A[A.length-1]. Now i counts up to A.length-1, and in the loop body accesses A[i+1], which is out of the array for the last iteration. Or, simply put: If you're comparing each element with the next element, you can only do length-1 comparisons.
If you only want to report duplicates once, I'd use a bool variable firstDuplicate, that's set to false when you find a duplicate and true when the number is different from the next. Then you'd only report the first duplicate by only reporting the duplicate numbers if firstDuplicate is true.
public void printDuplicates(int[] inputArray) {
if (inputArray == null) {
throw new IllegalArgumentException("Input array can not be null");
}
int length = inputArray.length;
if (length == 1) {
System.out.print(inputArray[0] + " ");
return;
}
for (int i = 0; i < length; i++) {
if (inputArray[Math.abs(inputArray[i])] >= 0) {
inputArray[Math.abs(inputArray[i])] = -inputArray[Math.abs(inputArray[i])];
} else {
System.out.print(Math.abs(inputArray[i]) + " ");
}
}
}

Find unique common element from 3 arrays

Original Problem:
I have 3 boxes each containing 200 coins, given that there is only one person who has made calls from all of the three boxes and thus there is one coin in each box which has same fingerprints and rest of all coins have different fingerprints. You have to find the coin which contains same fingerprint from all of the 3 boxes. So that we can find the fingerprint of the person who has made call from all of the 3 boxes.
Converted problem:
You have 3 arrays containing 200 integers each. Given that there is one and only one common element in these 3 arrays. Find the common element.
Please consider solving this for other than trivial O(1) space and O(n^3) time.
Some improvement in Pelkonen's answer:
From converted problem in OP:
"Given that there is one and only one common element in these 3 arrays."
We need to sort only 2 arrays and find common element.
If you sort all the arrays first O(n log n) then it will be pretty easy to find the common element in less than O(n^3) time. You can for example use binary search after sorting them.
Let N = 200, k = 3,
Create a hash table H with capacity ≥ Nk.
For each element X in array 1, set H[X] to 1.
For each element Y in array 2, if Y is in H and H[Y] == 1, set H[Y] = 2.
For each element Z in array 3, if Z is in H and H[Z] == 2, return Z.
throw new InvalidDataGivenByInterviewerException();
O(Nk) time, O(Nk) space complexity.
Use a hash table for each integer and encode the entries such that you know which array it's coming from - then check for the slot which has entries from all 3 arrays. O(n)
Use a hashtable mapping objects to frequency counts. Iterate through all three lists, incrementing occurrence counts in the hashtable, until you encounter one with an occurrence count of 3. This is O(n), since no sorting is required. Example in Python:
def find_duplicates(*lists):
num_lists = len(lists)
counts = {}
for l in lists:
for i in l:
counts[i] = counts.get(i, 0) + 1
if counts[i] == num_lists:
return i
Or an equivalent, using sets:
def find_duplicates(*lists):
intersection = set(lists[0])
for l in lists[1:]:
intersection = intersection.intersect(set(l))
return intersection.pop()
O(N) solution: use a hash table. H[i] = list of all integers in the three arrays that map to i.
For all H[i] > 1 check if three of its values are the same. If yes, you have your solution. You can do this check with the naive solution even, it should still be very fast, or you can sort those H[i] and then it becomes trivial.
If your numbers are relatively small, you can use H[i] = k if i appears k times in the three arrays, then the solution is the i for which H[i] = 3. If your numbers are huge, use a hash table though.
You can extend this to work even if you can have elements that can be common to only two arrays and also if you can have elements repeating elements in one of the arrays. It just becomes a bit more complicated, but you should be able to figure it out on your own.
If you want the fastest* answer:
Sort one array--time is N log N.
For each element in the second array, search the first. If you find it, add 1 to a companion array; otherwise add 0--time is N log N, using N space.
For each non-zero count, copy the corresponding entry into the temporary array, compacting it so it's still sorted--time is N.
For each element in the third array, search the temporary array; when you find a hit, stop. Time is less than N log N.
Here's code in Scala that illustrates this:
import java.util.Arrays
val a = Array(1,5,2,3,14,1,7)
val b = Array(3,9,14,4,2,2,4)
val c = Array(1,9,11,6,8,3,1)
Arrays.sort(a)
val count = new Array[Int](a.length)
for (i <- 0 until b.length) {
val j =Arrays.binarySearch(a,b(i))
if (j >= 0) count(j) += 1
}
var n = 0
for (i <- 0 until count.length) if (count(i)>0) { count(n) = a(i); n+= 1 }
for (i <- 0 until c.length) {
if (Arrays.binarySearch(count,0,n,c(i))>=0) println(c(i))
}
With slightly more complexity, you can either use no extra space at the cost of being even more destructive of your original arrays, or you can avoid touching your original arrays at all at the cost of another N space.
Edit: * as the comments have pointed out, hash tables are faster for non-perverse inputs. This is "fastest worst case". The worst case may not be so unlikely unless you use a really good hashing algorithm, which may well eat up more time than your sort. For example, if you multiply all your values by 2^16, the trivial hashing (i.e. just use the bitmasked integer as an index) will collide every time on lists shorter than 64k....
//Begineers Code using Binary Search that's pretty Easy
// bool BS(int arr[],int low,int high,int target)
// {
// if(low>high)
// return false;
// int mid=low+(high-low)/2;
// if(target==arr[mid])
// return 1;
// else if(target<arr[mid])
// BS(arr,low,mid-1,target);
// else
// BS(arr,mid+1,high,target);
// }
// vector <int> commonElements (int A[], int B[], int C[], int n1, int n2, int n3)
// {
// vector<int> ans;
// for(int i=0;i<n2;i++)
// {
// if(i>0)
// {
// if(B[i-1]==B[i])
// continue;
// }
// //The above if block is to remove duplicates
// //In the below code we are searching an element form array B in both the arrays A and B;
// if(BS(A,0,n1-1,B[i]) && BS(C,0,n3-1,B[i]))
// {
// ans.push_back(B[i]);
// }
// }
// return ans;
// }

Algorithm to select a single, random combination of values?

Say I have y distinct values and I want to select x of them at random. What's an efficient algorithm for doing this? I could just call rand() x times, but the performance would be poor if x, y were large.
Note that combinations are needed here: each value should have the same probability to be selected but their order in the result is not important. Sure, any algorithm generating permutations would qualify, but I wonder if it's possible to do this more efficiently without the random order requirement.
How do you efficiently generate a list of K non-repeating integers between 0 and an upper bound N covers this case for permutations.
Robert Floyd invented a sampling algorithm for just such situations. It's generally superior to shuffling then grabbing the first x elements since it doesn't require O(y) storage. As originally written it assumes values from 1..N, but it's trivial to produce 0..N and/or use non-contiguous values by simply treating the values it produces as subscripts into a vector/array/whatever.
In pseuocode, the algorithm runs like this (stealing from Jon Bentley's Programming Pearls column "A sample of Brilliance").
initialize set S to empty
for J := N-M + 1 to N do
T := RandInt(1, J)
if T is not in S then
insert T in S
else
insert J in S
That last bit (inserting J if T is already in S) is the tricky part. The bottom line is that it assures the correct mathematical probability of inserting J so that it produces unbiased results.
It's O(x)1 and O(1) with regard to y, O(x) storage.
Note that, in accordance with the combinations tag in the question, the algorithm only guarantees equal probability of each element occuring in the result, not of their relative order in it.
1O(x2) in the worst case for the hash map involved which can be neglected since it's a virtually nonexistent pathological case where all the values have the same hash
Assuming that you want the order to be random too (or don't mind it being random), I would just use a truncated Fisher-Yates shuffle. Start the shuffle algorithm, but stop once you have selected the first x values, instead of "randomly selecting" all y of them.
Fisher-Yates works as follows:
select an element at random, and swap it with the element at the end of the array.
Recurse (or more likely iterate) on the remainder of the array, excluding the last element.
Steps after the first do not modify the last element of the array. Steps after the first two don't affect the last two elements. Steps after the first x don't affect the last x elements. So at that point you can stop - the top of the array contains uniformly randomly selected data. The bottom of the array contains somewhat randomized elements, but the permutation you get of them is not uniformly distributed.
Of course this means you've trashed the input array - if this means you'd need to take a copy of it before starting, and x is small compared with y, then copying the whole array is not very efficient. Do note though that if all you're going to use it for in future is further selections, then the fact that it's in somewhat-random order doesn't matter, you can just use it again. If you're doing the selection multiple times, therefore, you may be able to do only one copy at the start, and amortise the cost.
If you really only need to generate combinations - where the order of elements does not matter - you may use combinadics as they are implemented e.g. here by James McCaffrey.
Contrast this with k-permutations, where the order of elements does matter.
In the first case (1,2,3), (1,3,2), (2,1,3), (2,3,1), (3,1,2), (3,2,1) are considered the same - in the latter, they are considered distinct, though they contain the same elements.
In case you need combinations, you may really only need to generate one random number (albeit it can be a bit large) - that can be used directly to find the m th combination.
Since this random number represents the index of a particular combination, it follows that your random number should be between 0 and C(n,k).
Calculating combinadics might take some time as well.
It might just not worth the trouble - besides Jerry's and Federico's answer is certainly simpler than implementing combinadics.
However if you really only need a combination and you are bugged about generating the exact number of random bits that are needed and none more... ;-)
While it is not clear whether you want combinations or k-permutations, here is a C# code for the latter (yes, we could generate only a complement if x > y/2, but then we would have been left with a combination that must be shuffled to get a real k-permutation):
static class TakeHelper
{
public static IEnumerable<T> TakeRandom<T>(
this IEnumerable<T> source, Random rng, int count)
{
T[] items = source.ToArray();
count = count < items.Length ? count : items.Length;
for (int i = items.Length - 1 ; count-- > 0; i--)
{
int p = rng.Next(i + 1);
yield return items[p];
items[p] = items[i];
}
}
}
class Program
{
static void Main(string[] args)
{
Random rnd = new Random(Environment.TickCount);
int[] numbers = new int[] { 1, 2, 3, 4, 5, 6, 7 };
foreach (int number in numbers.TakeRandom(rnd, 3))
{
Console.WriteLine(number);
}
}
}
Another, more elaborate implementation that generates k-permutations, that I had lying around and I believe is in a way an improvement over existing algorithms if you only need to iterate over the results. While it also needs to generate x random numbers, it only uses O(min(y/2, x)) memory in the process:
/// <summary>
/// Generates unique random numbers
/// <remarks>
/// Worst case memory usage is O(min((emax-imin)/2, num))
/// </remarks>
/// </summary>
/// <param name="random">Random source</param>
/// <param name="imin">Inclusive lower bound</param>
/// <param name="emax">Exclusive upper bound</param>
/// <param name="num">Number of integers to generate</param>
/// <returns>Sequence of unique random numbers</returns>
public static IEnumerable<int> UniqueRandoms(
Random random, int imin, int emax, int num)
{
int dictsize = num;
long half = (emax - (long)imin + 1) / 2;
if (half < dictsize)
dictsize = (int)half;
Dictionary<int, int> trans = new Dictionary<int, int>(dictsize);
for (int i = 0; i < num; i++)
{
int current = imin + i;
int r = random.Next(current, emax);
int right;
if (!trans.TryGetValue(r, out right))
{
right = r;
}
int left;
if (trans.TryGetValue(current, out left))
{
trans.Remove(current);
}
else
{
left = current;
}
if (r > current)
{
trans[r] = left;
}
yield return right;
}
}
The general idea is to do a Fisher-Yates shuffle and memorize the transpositions in the permutation.
It was not published anywhere nor has it received any peer-review whatsoever. I believe it is a curiosity rather than having some practical value. Nonetheless I am very open to criticism and would generally like to know if you find anything wrong with it - please consider this (and adding a comment before downvoting).
A little suggestion: if x >> y/2, it's probably better to select at random y - x elements, then choose the complementary set.
The trick is to use a variation of shuffle or in other words a partial shuffle.
function random_pick( a, n )
{
N = len(a);
n = min(n, N);
picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
// partially shuffle the array, and generate unbiased selection simultaneously
// this is a variation on fisher-yates-knuth shuffle
for (i=0; i<n; i++) // O(n) times
{
selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
value = a[ selected ];
a[ selected ] = a[ N ];
a[ N ] = value;
backup[ i ] = selected;
picked[ i ] = value;
}
// restore partially shuffled input array from backup
// optional step, if needed it can be ignored
for (i=n-1; i>=0; i--) // O(n) times
{
selected = backup[ i ];
value = a[ N ];
a[ N ] = a[ selected ];
a[ selected ] = value;
N++;
}
return picked;
}
NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and non-destructive on the input array (as a partial shuffle would be) but this is optional
adapted from here
update
another approach using only a single call to PRNG (pseudo-random number generator) in [0,1] by IVAN STOJMENOVIC, "ON RANDOM AND ADAPTIVE PARALLEL GENERATION OF COMBINATORIAL OBJECTS" (section 3), of O(N) (worst-case) complexity
Here is a simple way to do it which is only inefficient if Y is much larger than X.
void randomly_select_subset(
int X, int Y,
const int * inputs, int X, int * outputs
) {
int i, r;
for( i = 0; i < X; ++i ) outputs[i] = inputs[i];
for( i = X; i < Y; ++i ) {
r = rand_inclusive( 0, i+1 );
if( r < i ) outputs[r] = inputs[i];
}
}
Basically, copy the first X of your distinct values to your output array, and then for each remaining value, randomly decide whether or not to include that value.
The random number is further used to choose an element of our (mutable) output array to replace.
If, for example, you have 2^64 distinct values, you can use a symmetric key algorithm (with a 64 bits block) to quickly reshuffle all combinations. (for example Blowfish).
for(i=0; i<x; i++)
e[i] = encrypt(key, i)
This is not random in the pure sense but can be useful for your purpose.
If you want to work with arbitrary # of distinct values following cryptographic techniques you can but it's more complex.

Most common substring of length X

I have a string s and I want to search for the substring of length X that occurs most often in s. Overlapping substrings are allowed.
For example, if s="aoaoa" and X=3, the algorithm should find "aoa" (which appears 2 times in s).
Does an algorithm exist that does this in O(n) time?
You can do this using a rolling hash in O(n) time (assuming good hash distribution). A simple rolling hash would be the xor of the characters in the string, you can compute it incrementally from the previous substring hash using just 2 xors. (See the Wikipedia entry for better rolling hashes than xor.) Compute the hash of your n-x+1 substrings using the rolling hash in O(n) time. If there were no collisions, the answer is clear - if collisions happen, you'll need to do more work. My brain hurts trying to figure out if that can all be resolved in O(n) time.
Update:
Here's a randomized O(n) algorithm. You can find the top hash in O(n) time by scanning the hashtable (keeping it simple, assume no ties). Find one X-length string with that hash (keep a record in the hashtable, or just redo the rolling hash). Then use an O(n) string searching algorithm to find all occurrences of that string in s. If you find the same number of occurrences as you recorded in the hashtable, you're done.
If not, that means you have a hash collision. Pick a new random hash function and try again. If your hash function has log(n)+1 bits and is pairwise independent [Prob(h(s) == h(t)) < 1/2^{n+1} if s != t], then the probability that the most frequent x-length substring in s hash a collision with the <=n other length x substrings of s is at most 1/2. So if there is a collision, pick a new random hash function and retry, you will need only a constant number of tries before you succeed.
Now we only need a randomized pairwise independent rolling hash algorithm.
Update2:
Actually, you need 2log(n) bits of hash to avoid all (n choose 2) collisions because any collision may hide the right answer. Still doable, and it looks like hashing by general polynomial division should do the trick.
I don't see an easy way to do this in strictly O(n) time, unless X is fixed and can be considered a constant. If X is a parameter to the algorithm, then most simple ways of doing this will actually be O(n*X), as you will need to do comparison operations, string copies, hashes, etc., on a substring of length X at every iteration.
(I'm imagining, for a minute, that s is a multi-gigabyte string, and that X is some number over a million, and not seeing any simple ways of doing string comparison, or hashing substrings of length X, that are O(1), and not dependent on the size of X)
It might be possible to avoid string copies during scanning, by leaving everything in place, and to avoid re-hashing the entire substring -- perhaps by using an incremental hash algorithm where you can add a byte at a time, and remove the oldest byte -- but I don't know of any such algorithms that wouldn't result in huge numbers of collisions that would need to be filtered out with an expensive post-processing step.
Update
Keith Randall points out that this kind of hash is known as a rolling hash. It still remains, though, that you would have to store the starting string position for each match in your hash table, and then verify after scanning the string that all of your matches were true. You would need to sort the hashtable, which could contain n-X entries, based on the number of matches found for each hash key, and verify each result -- probably not doable in O(n).
It should be O(n*m) where m is the average length of a string in the list. For very small values of m then the algorithm will approach O(n)
Build a hashtable of counts for each string length
Iterate over your collection of strings, updating the hashtable accordingly, storing the current most prevelant number as an integer variable separate from the hashtable
done.
Naive solution in Python
from collections import defaultdict
from operator import itemgetter
def naive(s, X):
freq = defaultdict(int)
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
return max(freq.iteritems(), key=itemgetter(1))
print naive("aoaoa", 3)
# -> ('aoa', 2)
In plain English
Create mapping: substring of length X -> how many times it occurs in the s string
for i in range(len(s) - X + 1):
freq[s[i:i+X]] += 1
Find a pair in the mapping with the largest second item (frequency)
max(freq.iteritems(), key=itemgetter(1))
Here is a version I did in C. Hope that it helps.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *string = NULL, *maxstring = NULL, *tmpstr = NULL, *tmpstr2 = NULL;
unsigned int n = 0, i = 0, j = 0, matchcount = 0, maxcount = 0;
string = "aoaoa";
n = 3;
for (i = 0; i <= (strlen(string) - n); i++) {
tmpstr = (char *)malloc(n + 1);
strncpy(tmpstr, string + i, n);
*(tmpstr + (n + 1)) = '\0';
for (j = 0; j <= (strlen(string) - n); j++) {
tmpstr2 = (char *)malloc(n + 1);
strncpy(tmpstr2, string + j, n);
*(tmpstr2 + (n + 1)) = '\0';
if (!strcmp(tmpstr, tmpstr2))
matchcount++;
}
if (matchcount > maxcount) {
maxstring = tmpstr;
maxcount = matchcount;
}
matchcount = 0;
}
printf("max string: \"%s\", count: %d\n", maxstring, maxcount);
free(tmpstr);
free(tmpstr2);
return 0;
}
You can build a tree of sub-strings. The idea is to organise your sub-strings like a telephone book. You then look up the sub-string and increase its count by one.
In your example above, the tree will have sections (nodes) starting with the letters: 'a' and 'o'. 'a' appears three times and 'o' appears twice. So those nodes will have a count of 3 and 2 respectively.
Next, under the 'a' node a sub-node of 'o' will appear corresponding to the sub-string 'ao'. This appears twice. Under the 'o' node 'a' also appears twice.
We carry on in this fashion until we reach the end of the string.
A representation of the tree for 'abac' might be (nodes on the same level are separated by a comma, sub-nodes are in brackets, counts appear after the colon).
a:2(b:1(a:1(c:1())),c:1()),b:1(a:1(c:1())),c:1()
If the tree is drawn out it will be a lot more obvious! What this all says for example is that the string 'aba' appears once, or the string 'a' appears twice etc. But, storage is greatly reduced and more importantly retrieval is greatly speeded up (compare this to keeping a list of sub-strings).
To find out which sub-string is most repeated, do a depth first search of the tree, every time a leaf node is reached, note the count, and keep a track of the highest one.
The running time is probably something like O(log(n)) not sure, but certainly better than O(n^2).
Python-3 Solution:
from collections import Counter
list = []
list.append([string[i: j] for i in range(len(string)) for j in range(i + 1, len(string) + 1) if len(string[i:j]) == K]) # Where K is length
# now find the most common value in this list
# you can do this natively, but I prefer using collections
most_frequent = Counter(list).most_common(1)[0][0]
print(most_freqent)
Here is the native way to get the most common (for those that are interested):
most_occurences = 0
current_most = ""
for i in list:
frequency = list.count(i)
if frequency > most_occurences:
most_occurences = frequency
current_most = list[i]
print(f"{current_most}, Occurences: {most_occurences}")
[Extract K length substrings (geeks for geeks)][1]
[1]: https://www.geeksforgeeks.org/python-extract-k-length-substrings/
LZW algorithm does this
This is exactly what Lempel-Ziv-Welch (LZW used in GIF image format) compression algorithm does. It finds prevalent repeated bytes and changes them for something short.
LZW on Wikipedia
There's no way to do this in O(n).
Feel free to downvote me if you can prove me wrong on this one, but I've got nothing.

Resources