Best O(n) algorithm to find often appearing numbers? - algorithm

This is a example. Each number is a value
in the range between [0..k]. A number x is said to appear often in A if at least 1/3 of the numbers
in the array are equal to x.
What would be an O(n) algorithm finding the often appearing numbers for the
case when k is orders of magnitude larger than n?

Why not use a hash map, i.e. a hash-based mapping (dictionary) from integers to integers? Then just iterate over your input array and compute the counters. In imperative pseudo-code:
const int often = ceiling(n/3);
hashmap m;
for int i = 1 to n do {
if m.contains(A[i])
m[A[i]] += 1;
else
m[A[i]] = 1;
if m[A[i]] >= often
// A[i] is appearing often
// print it or store it in the result set, etc.
}
This is O(n) in terms of time (expected) and space.

Related

Find an algorithm for sorting integers with time complexity O(n + k*log(k))

Design an algorithm that sorts n integers where there are duplicates. The total number of different numbers is k. Your algorithm should have time complexity O(n + k*log(k)). The expected time is enough. For which values of k does the algorithm become linear?
I am not able to come up with a sorting algorithm for integers which satisfies the condition that it must be O(n + k*log(k)). I am not a very advanced programmer but I was in the problem before this one supposed to come up with an algorithm for all numbers xi in a list, 0 ≤ xi ≤ m such that the algorithm was O(n+m), where n was the number of elements in the list and m was the value of the biggest integer in the list. I solved that problem easily by using counting sort but I struggle with this problem. The condition that makes it the most difficult for me is the term k*log(k) under the ordo notation if that was n*log(n) instead I would be able to use merge sort, right? But that's not possible now so any ideas would be very helpful.
Thanks in advance!
Here is a possible solution:
Using a hash table, count the number of unique values and the number of duplicates of each value. This should have a complexity of O(n).
Enumerate the hashtable, storing the unique values into a temporary array. Complexity is O(k).
Sort this array with a standard algorithm such as mergesort: complexity is O(k.log(k)).
Create the resulting array by replicating the elements of the sorted array of unique values each the number of times stored in the hash table. complexity is O(n) + O(k).
Combined complexity is O(n + k.log(k)).
For example, if k is a small constant, sorting an array of n values converges toward linear time as n becomes larger and larger.
If during the first phase, where k is computed incrementally, it appears that k is not significantly smaller than n, drop the hash table and just sort the original array with a standard algorithm.
The runtime of O(n + k*log(k) indicates (like addition in runtimes often does) that you have 2 subroutines, one which runes in O(n) and the other that runs in O(k*log(k)).
You can first count the frequency of the elements in O(n) (for example in a Hashmap, look this up if youre not familiar with it, it's very useful).
Then you just sort the unique elements, from which there are k. This sorting runs in O(k*log(k)), use any sorting algorithm you want.
At the end replace the single unique elements by how often they actually appeared, by looking this up in the map you created in step 1.
A possible Java solution an be like this:
public List<Integer> sortArrayWithDuplicates(List<Integer> arr) {
// O(n)
Set<Integer> set = new HashSet<>(arr);
Map<Integer, Integer> freqMap = new HashMap<>();
for(Integer i: arr) {
freqMap.put(i, freqMap.getOrDefault(i, 0) + 1);
}
List<Integer> withoutDups = new ArrayList<>(set);
// Sorting => O(k(log(k)))
// as there are k different elements
Arrays.sort(withoutDups);
List<Integer> result = new ArrayList<>();
for(Integer i : withoutDups) {
int c = freqMap.get(i);
for(int j = 0; j < c; j++) {
result.add(i);
}
}
// return the result
return result;
}
The time complexity of the above code is O(n + k*log(k)) and solution is in the same line as answered above.

Big O - is n always the size of the input?

I made up my own interview-style problem, and have a question on the big O of my solution. I will state the problem and my solution below, but first let me say that the obvious solution involves a nested loop and is O(n2). I believe I found a O(n) solution, but then I realized it depends not only on the size of the input, but the largest value of the input. It seems like my running time of O(n) is only a technicality, and that it could easily run in O(n2) time or worse in real life.
The problem is:
For each item in a given array of positive integers, print all the other items in the array that are multiples of the current item.
Example Input:
[2 9 6 8 3]
Example Output:
2: 6 8
9:
6:
8:
3: 9 6
My solution (in C#):
private static void PrintAllDivisibleBy(int[] arr)
{
Dictionary<int, bool> dic = new Dictionary<int, bool>();
if (arr == null || arr.Length < 2)
return;
int max = arr[0];
for(int i=0; i<arr.Length; i++)
{
if (arr[i] > max)
max = arr[i];
dic[arr[i]] = true;
}
for(int i=0; i<arr.Length; i++)
{
Console.Write("{0}: ", arr[i]);
int multiplier = 2;
while(true)
{
int product = multiplier * arr[i];
if (dic.ContainsKey(product))
Console.Write("{0} ", product);
if (product >= max)
break;
multiplier++;
}
Console.WriteLine();
}
}
So, if 2 of the array items are 1 and n, where n is the array length, the inner while loop will run n times, making this equivalent to O(n2). But, since the performance is dependent on the size of the input values, not the length of the list, that makes it O(n), right?
Would you consider this a true O(n) solution? Is it only O(n) due to technicalities, but slower in real life?
Good question! The answer is that, no, n is not always the size of the input: You can't really talk about O(n) without defining what the n means, but often people use imprecise language and imply that n is "the most obvious thing that scales here". Technically we should usually say things like "This sort algorithm performs a number of comparisons that is O(n) in the number of elements in the list": being specific about both what n is, and what quantity we are measuring (comparisons).
If you have an algorithm that depends on the product of two different things (here, the length of the list and the largest element in it), the proper way to express that is in the form O(m*n), and then define what m and n are for your context. So, we could say that your algorithm performs O(m*n) multiplications, where m is the length of the list and n is the largest item in the list.
An algorithm is O(n) when you have to iterate over n elements and perform some constant time operation in each iteration. The inner while loop of your algorithm is not constant time as it depends on the hugeness of the biggest number in your array.
Your algorithm's best case run-time is O(n). This is the case when all the n numbers are same.
Your algorithm's worst case run-time is O(k*n), where k = the max value of int possible on your machine if you really insist to put an upper bound on k's value. For 32 bit int the max value is 2,147,483,647. You can argue that this k is a constant, but this constant is clearly
not fixed for every case of input array; and,
not negligible.
Would you consider this a true O(n) solution?
The runtime actually is O(nm) where m is the maximum element from arr. If the elements in your array are bounded by a constant you can consider the algorithm to be O(n)
Can you improve the runtime? Here's what else you can do. First notice that you can ensure that the elements are different. ( you compress the array in hashmap which stores how many times an element is found in the array). Then your runtime would be max/a[0]+max/a[1]+max/a[2]+...<= max+max/2+...max/max = O(max log (max)) (assuming your array arr is sorted). If you combine this with the obvious O(n^2) algorithm you'd get O(min(n^2, max*log(max)) algorithm.

How to get dot product of two sparsevectors in O(m+n) , where m and n are the number of elements in both vectors

I have two sparse vectors X and Y and want to get the dot product in O(m+n) where m and n are the numbers of non-zero elements in X and Y. The only way I can think of is picking each element in vector X and traverse through vector Y to find if there is element with the same index. But that would take O(m * n). I am implementing the vector as a linked list and each node has an element.
You can do it if your vectors are stored as a linked list of tuples whith each tuple containing the index and the value of a non zero element and sorted by the index.
You iterate through both vectors, by selecting the next element from the vector where you are at the lower index. If the indexes are the same you multiply the elements and store the result.
Repeat until one list reaches the end.
Since you have one step per non zero element in each list, the complexity is O(m+n) as required.
Footnote: The datastructure doesn't have to be linked list, but must provide a O(1) way to access the next non 0 element and it's index.
Sorted lists
Given that your nonzero elements are sorted by coordinate index in both vectors, it is achieved by merge algorithm. That is a standard algorithm in computer science, which merges two sorted sequences into one sorted sequence, and it works in O(M + N).
There are two ways to do it. The first one is to check for equal elements inside merge. And it is indeed the best way.
The second way is to merge first, then check for equals (they must be consecutive then):
std::pair<int, double> vecA[n], vecB[m], vecBoth[n+m];
std::merge(vecA, vecA+n, vecB, vecB+m, vecBoth);
double dotP = 0.0;
for (int i = 0; i+1 < n+m; i++)
if (vecBoth[i].first == vecBoth[i+1].first)
dotP += vecBoth[i].second * vecBoth[i+1].second;
Complexity of std::merge is O(M + N).
Example above assumes that the data is stored in arrays (which is the best choice for sparse vectors and matrices). If you want to use linked lists, you can also perform merge in O(M + N) time, see this question.
Unsorted lists
Even if your lists are unsorted, you can still perform dot product in O(M + N) time. The idea is to put all the elements of A into hash table first, then iterate through elements of B and see if there is an elements in hash with same index.
If indices are very large (e.g. more than million), then perhaps you should really use a nontrivial hash function. However, if your indices are rather small, then you can avoid using hash function. Simply use array of size greater than dimension of your vectors. In order to clear this array fast, you can use the trick with "generations".
//global data! must be threadlocal in case of concurrent access
double elemsTable[1<<20];
int whenUsed[1<<20] = {0};
int usedGeneration = 0;
double CalcDotProduct(std::pair<int, double> vecA[n], vecB[m]) {
usedGeneration++; //clear used array in O(1)
for (int i = 0; i < n; i++) {
elemsTable[vecA[i].first] = vecA[i].second;
whenUsed[vecA[i].first] = usedGeneration;
}
double dotP = 0.0;
for (int i = 0; i < m; i++)
if (whenUsed[vecB[i].first] == usedGeneration)
dotP += elemsTable[vecB[i].first] * vecB[i].second;
return dotP;
}
Note that you might need to clear whenUsed once per billion dot products.
Use Map to store each vector.
Each entry of map has index as key and value as the vector value at the particular index. Insert only the non zero values
Iterate on one map and for each entry check whether the particular key is present in the other map.If yes update the product else ignore the current key
Time Complexity : n -> vector size
O(n) - for map construction
O(n) - for iteration
Space Complexity : O(n) - for maps

Choosing k out of n

I want to choose k elements uniformly at random out of a possible n without choosing the same number twice. There are two trivial approaches to this.
Make a list of all n possibilities. Shuffle them (you don't need
to shuffle all n numbers just k of them by performing the first
k steps of Fisher Yates). Choose the first k. This approach
takes O(k) time (assuming allocating an array of size n takes
O(1) time) and O(n) space. This is a problem if k is very
small relative to n.
Store a set of seen elements. Choose a number at random from [0, n-1]. While the element is in the set then choose a new number.
This approach takes O(k) space. The run-time is a little more
complicated to analyze. If k = theta(n) then the run-time is
O(k*lg(k))=O(n*lg(n)) because it is the coupon collector's
problem. If k is small relative to n then it takes slightly
more than O(k) because of the probability (albeit low) of choosing
the same number twice. This is better than the above solution in
terms of space but worse in terms of run-time.
My question:
is there an O(k) time, O(k) space algorithm for all k and n?
With an O(1) hash table, the partial Fisher-Yates method can be made to run in O(k) time and space. The trick is simply to store only the changed elements of the array in the hash table.
Here's a simple example in Java:
public static int[] getRandomSelection (int k, int n, Random rng) {
if (k > n) throw new IllegalArgumentException(
"Cannot choose " + k + " elements out of " + n + "."
);
HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>(2*k);
int[] output = new int[k];
for (int i = 0; i < k; i++) {
int j = i + rng.nextInt(n - i);
output[i] = (hash.containsKey(j) ? hash.remove(j) : j);
if (j > i) hash.put(j, (hash.containsKey(i) ? hash.remove(i) : i));
}
return output;
}
This code allocates a HashMap of 2×k buckets to store the modified elements (which should be enough to ensure that the hash table is never rehashed), and just runs a partial Fisher-Yates shuffle on it.
Here's a quick test on Ideone; it picks two elements out of three 30,000 times, and counts the number of times each pair of elements gets chosen. For an unbiased shuffle, each ordered pair should appear approximately 5,000 (&pm;100 or so) times, except for the impossible cases where both elements would be equal.
Your second approach does not take Theta(k log k) time on average, it takes about n/(n-k+1) + n/(n-k+2) + ... + n/n operations, which is less than k(n/(n-k)) since you have k terms which are each smaller than n/(n-k). For k <= n/2, it takes under 2*k operations on average. For k>n/2, you can choose a random subset of size n-k, and take the complement. So, this is already an O(k) average time and space algorithm.
What you could use is the following algorithm (using javascript instead of pseudocode):
var k = 3;
var n = [1,2,3,4,5,6];
// O(k) iterations
for(var i = 0, tmp; i < k; ++i) {
// Random index O(1)
var index = Math.floor(Math.random() * (n.length - i));
// Output O(1)
console.log(n[index]);
// Swap and lookup O(1)
tmp = n[index];
n[index] = n[n.length - i - 1];
n[n.length - i - 1] = tmp;
}
In short, you swap the selected value with the last item and in the next iteration sample from the reduced subset. This assumes your original set is wholly unique.
The storage is O(n), if you wish to retrieve the numbers as a set, just refer to the last k entries from n.

Most efficient way of randomly choosing a set of distinct integers

I'm looking for the most efficient algorithm to randomly choose a set of n distinct integers, where all the integers are in some range [0..maxValue].
Constraints:
maxValue is larger than n, and possibly much larger
I don't care if the output list is sorted or not
all integers must be chosen with equal probability
My initial idea was to construct a list of the integers [0..maxValue] then extract n elements at random without replacement. But that seems quite inefficient, especially if maxValue is large.
Any better solutions?
Here is an optimal algorithm, assuming that we are allowed to use hashmaps. It runs in O(n) time and space (and not O(maxValue) time, which is too expensive).
It is based on Floyd's random sample algorithm. See my blog post about it for details.
The code is in Java:
private static Random rnd = new Random();
public static Set<Integer> randomSample(int max, int n) {
HashSet<Integer> res = new HashSet<Integer>(n);
int count = max + 1;
for (int i = count - n; i < count; i++) {
Integer item = rnd.nextInt(i + 1);
if (res.contains(item))
res.add(i);
else
res.add(item);
}
return res;
}
For small values of maxValue such that it is reasonable to generate an array of all the integers in memory then you can use a variation of the Fisher-Yates shuffle except only performing the first n steps.
If n is much smaller than maxValue and you don't wish to generate the entire array then you can use this algorithm:
Keep a sorted list l of number picked so far, initially empty.
Pick a random number x between 0 and maxValue - (elements in l)
For each number in l if it smaller than or equal to x, add 1 to x
Add the adjusted value of x into the sorted list and repeat.
If n is very close to maxValue then you can randomly pick the elements that aren't in the result and then find the complement of that set.
Here is another algorithm that is simpler but has potentially unbounded execution time:
Keep a set s of element picked so far, initially empty.
Pick a number at random between 0 and maxValue.
If the number is not in s, add it to s.
Go back to step 2 until s has n elements.
In practice if n is small and maxValue is large this will be good enough for most purposes.
One way to do it without generating the full array.
Say I want a randomly selected subset of m items from a set {x1, ..., xn} where m <= n.
Consider element x1. I add x1 to my subset with probability m/n.
If I do add x1 to my subset then I reduce my problem to selecting (m - 1) items from {x2, ..., xn}.
If I don't add x1 to my subset then I reduce my problem to selecting m items from {x2, ..., xn}.
Lather, rinse, and repeat until m = 0.
This algorithm is O(n) where n is the number of items I have to consider.
I rather imagine there is an O(m) algorithm where at each step you consider how many elements to remove from the "front" of the set of possibilities, but I haven't convinced myself of a good solution and I have to do some work now!
If you are selecting M elements out of N, the strategy changes depending on whether M is of the same order as N or much less (i.e. less than about N/log N).
If they are similar in size, then you go through each item from 1 to N. You keep track of how many items you've got so far (let's call that m items picked out of n that you've gone through), and then you take the next number with probability (M-m)/(N-n) and discard it otherwise. You then update m and n appropriately and continue. This is a O(N) algorithm with low constant cost.
If, on the other hand, M is significantly less than N, then a resampling strategy is a good one. Here you will want to sort M so you can find them quickly (and that will cost you O(M log M) time--stick them into a tree, for example). Now you pick numbers uniformly from 1 to N and insert them into your list. If you find a collision, pick again. You will collide about M/N of the time (actually, you're integrating from 1/N to M/N), which will require you to pick again (recursively), so you'll expect to take M/(1-M/N) selections to complete the process. Thus, your cost for this algorithm is approximately O(M*(N/(N-M))*log(M)).
These are both such simple methods that you can just implement both--assuming you have access to a sorted tree--and pick the one that is appropriate given the fraction of numbers that will be picked.
(Note that picking numbers is symmetric with not picking them, so if M is almost equal to N, then you can use the resampling strategy, but pick those numbers to not include; this can be a win, even if you have to push all almost-N numbers around, if your random number generation is expensive.)
My solution is the same as Mark Byers'. It takes O(n^2) time, hence it's useful when n is much smaller than maxValue. Here's the implementation in python:
def pick(n, maxValue):
chosen = []
for i in range(n):
r = random.randint(0, maxValue - i)
for e in chosen:
if e <= r:
r += 1
else:
break;
bisect.insort(chosen, r)
return chosen
The trick is to use a variation of shuffle or in other words a partial shuffle.
function random_pick( a, n )
{
N = len(a);
n = min(n, N);
picked = array_fill(0, n, 0); backup = array_fill(0, n, 0);
// partially shuffle the array, and generate unbiased selection simultaneously
// this is a variation on fisher-yates-knuth shuffle
for (i=0; i<n; i++) // O(n) times
{
selected = rand( 0, --N ); // unbiased sampling N * N-1 * N-2 * .. * N-n+1
value = a[ selected ];
a[ selected ] = a[ N ];
a[ N ] = value;
backup[ i ] = selected;
picked[ i ] = value;
}
// restore partially shuffled input array from backup
// optional step, if needed it can be ignored
for (i=n-1; i>=0; i--) // O(n) times
{
selected = backup[ i ];
value = a[ N ];
a[ N ] = a[ selected ];
a[ selected ] = value;
N++;
}
return picked;
}
NOTE the algorithm is strictly O(n) in both time and space, produces unbiased selections (it is a partial unbiased shuffling) and does not need hasmaps (which may not be available and/or usualy hide a complexity behind their implementation, e.g fetch time is not O(1), it might even be O(n) in worst case)
adapted from here
Linear congruential generator modulo maxValue+1. I'm sure I've written this answer before, but I can't find it...
UPDATE: I am wrong. The output of this is not uniformly distributed. Details on why are here.
I think this algorithm below is optimum. I.e. you cannot get better performance than this.
For choosing n numbers out of m numbers, the best offered algorithm so far is presented below. Its worst run time complexity is O(n), and needs only a single array to store the original numbers. It partially shuffles the first n elements from the original array, and then you pick those first n shuffled numbers as your solution.
This is also a fully working C program. What you find is:
Function getrand: This is just a PRNG that returns a number from 0 up to upto.
Function randselect: This is the function that randmoly chooses n unique numbers out of m many numbers. This is what this question is about.
Function main: This is only to demonstrate a use for other functions, so that you could compile it into a program and have fun.
#include <stdio.h>
#include <stdlib.h>
int getrand(int upto) {
long int r;
do {
r = rand();
} while (r > upto);
return r;
}
void randselect(int *all, int end, int select) {
int upto = RAND_MAX - (RAND_MAX % end);
int binwidth = upto / end;
int c;
for (c = 0; c < select; c++) {
/* randomly choose some bin */
int bin = getrand(upto)/binwidth;
/* swap c with bin */
int tmp = all[c];
all[c] = all[bin];
all[bin] = tmp;
}
}
int main() {
int end = 1000;
int select = 5;
/* initialize all numbers up to end */
int *all = malloc(end * sizeof(int));
int c;
for (c = 0; c < end; c++) {
all[c] = c;
}
/* select select unique numbers randomly */
srand(0);
randselect(all, end, select);
for (c = 0; c < select; c++) printf("%d ", all[c]);
putchar('\n');
return 0;
}
Here is the output of an example code where I randomly output 4 permutations out of a pool of 8 numbers for 100,000,000 many times. Then I use those many permutations to compute the probability of having each unique permutation occur. I then sort them by this probability. You notice that the numbers are fairly close, which I think means that it is uniformly distributed. The theoretical probability should be 1/1680 = 0.000595238095238095. Note how the empirical test is close to the theoretical one.

Resources