You will be given a stream of integers - algorithm

You will be given a stream of integers, and a integer k for window size, you will only receive the streams integers one by one. whenever you receive an integer, you have to return the maximum number among last k integers inclusive of current entry.
Interviewer was expecting O(N) Time + O(1) avg case space complexity solution for N tasks and integers are not given in a array, every time only one integer will be passed as input to your method.
I tried solving it but couldn't come up with O(N) solution. Can anybody tell me how can we do it.

Assuming k is small and not part of the scaling parameter N (the question says N is the number of Tasks, but it's not quite clear what that means).
Implement a FIFO, with insertion and deletion costing O(1) in time, and O(k) in memory.
Also implement a max variable
Pop the oldest value, see if it is equal to max, if is (which unlikely for all values of max), then run over k elements of FIFO and recalculate max, otherwise don't. Amortized, this is O(1).
Compare new value with max and update max if necessary.
push max into FIFO.
Then time is O(1) and memory is O(k+1). I don't see how you not have storage requirements at least O(k). The time to process N integers is them O(N).

O(N) time is easy, but O(1) average space is impossible.
Here's what we absolutely need to store. For any number x we've seen in the last k inputs, if we've seen a bigger number since x, we can forget about x, since we'll never need to return it or compare it to anything again. If we haven't seen a bigger number since x, we need to store x, since we might have to return it at some point. Thus, we need to store the biggest number in the last k items, and the biggest after that, and the biggest after that, all the way up to the current input. In the worst case, the input is descending, and we always need to store all of the last k inputs. In the average case, at any time, we'll need to keep track of O(log(k)) items; however, the peak memory usage will be greater than this.
The algorithm we use is to simply keep track of a deque of all the numbers we just said we need to store, in their natural, descending order, along with when we saw them*. When we receive an input, we pop everything lower than it from the right of the deque and push the input on the right of the deque. We peek-left, and if we see that the item on the left is older than the window size, we pop-left. Finally, we peek-left, and the number we see is the sliding window maximum.
This algorithm processes each input in amortized constant time. The only part of processing an input that isn't constant time is the part where we pop-right potentially all of the deque, but since each input is only popped once over the course of the algorithm, that's still amortized constant. Thus, the algorithm takes O(N) time to process all input.
*If N is ridiculously huge, we can keep track of the indices at which we saw things mod k to avoid overflow problems.

First of all, most interviewers will consider memory O(k) to be O(1).
Given that little detail, you can just implement a ring buffer :
int numbers[k]; // O(1) because k is constant
int i = 0, next_number;
while (next_number= new_number()) {
numbers[i % k]= next_number;
i++;
if (i >= k) {
int max= MIN_INT;
for (int j= 0; j < k; j++) { // O(1) because k is constant
if (numbers[j] > max) max = numbers[j];
}
yield(max);
}
}
Of course, if they don't consider O(k) to be O(1), there is no solution to the problem and they either screwed up their question or were hoping for you to say that the question is wrong.
A fifo/deque is usually faster here (for k above some number), I'm just demonstrating the simplest answer to a dumb question.

This is the answer that comes to mind (C++)
#include <iostream>
#include <limits>
using namespace std;
int main()
{
cout << "Enter K: " << endl;
int K;
cin >> K;
cout << "Enter stream." << endl;
int n, counter = 1, max = std::numeric_limits<int>::min();
while (cin >> n){
if (counter % K == 0)
max = std::numeric_limits<int>::min();
if (n > max)
max = n;
cout << max << endl;
++counter;
}
}

Related

Best O(n) algorithm to find often appearing numbers?

This is a example. Each number is a value
in the range between [0..k]. A number x is said to appear often in A if at least 1/3 of the numbers
in the array are equal to x.
What would be an O(n) algorithm finding the often appearing numbers for the
case when k is orders of magnitude larger than n?
Why not use a hash map, i.e. a hash-based mapping (dictionary) from integers to integers? Then just iterate over your input array and compute the counters. In imperative pseudo-code:
const int often = ceiling(n/3);
hashmap m;
for int i = 1 to n do {
if m.contains(A[i])
m[A[i]] += 1;
else
m[A[i]] = 1;
if m[A[i]] >= often
// A[i] is appearing often
// print it or store it in the result set, etc.
}
This is O(n) in terms of time (expected) and space.

I'm confused about space complexity

I'm a little confused about the space complexity.
int fn_sum(int a[], int n){
int result =0;
for(int i=0; i<n ; i++){
result += a[i];
}
return result;
}
In this case, is the space complexity O(n) or O(1)?
I think it uses only result,i variables so it is O(1). What's the answer?
(1) Space Complexity: how many memory do your algorithm allocate according to input size?
int fn_sum(int a[], int n){
int result = 0; //here you have 1 variable allocated
for(int i=0; i<n ; i++){
result += a[i];
}
return result;
}
as the variable you created (result) is a single value (it's not a list, an array, etc.), your space complexity is O(1), since the space usage is constant, which means: it doesn't change according to the size of the inputs, it's just a single and constant value.
(2) Time Complexity: how do the number of operations of your algorithm relates to the size of the input?
int fn_sum(int a[], int n){ //the input is an array of size n
int result = 0; //1 variable definition operation = O(1)
for(int i=0; i<n ; i++){ //loop that will run n times whatever it has inside
result += a[i]; //1 sum operation = O(1) that runs n times = n * O(1) = O(n)
}
return result; //1 return operation = O(1)
}
all the operations you do take O(1) + O(n) + O(1) = O(n + 2) = O(n) time, following the rules of removing multiplicative and additive constants from the function.
I answer bit differently:
Since memory space consumed by int fn_sum(int a[], int n) doesn't correlate with the number of input items its algorithmic complexity in this regard is O(1).
However runtime complexity is O(N) since it iterates over N items.
And yes, there are algorithms that consume more memory and get faster. Classic one is caching operations.
https://en.wikipedia.org/wiki/Space_complexity
If int means the 32-bit signed integer type, the space complexity is O(1) since you always allocate, use and return the same number of bits.
If this is just pseudocode and int means integers represented in their binary representations with no leading zeroes and maybe an extra sign bit (imagine doing this algorithm by hand), the analysis is more complicated.
If negatives are allowed, the best case is alternating positive and negative numbers so that the result never grows beyond a constant size - O(1) space.
If zero is allowed, an equally good case is to put zero in the whole array. This is also O(1).
If only positive numbers are allowed, the best case is more complicated. I expect the best case will see some number repeated n times. For the best case, we'll want the smallest representable number for the number of bits involved; so, I expect the number to be a power of 2. We can work out the sum in terms of n and the repeated number:
result = n * val
result size = log(result) = log(n * val) = log(n) + log(val)
input size = n*log(val) + log(n)
As val grows without bound, the log(val) term dominates in result size, and the n*log(val) term dominates in the input size; the best-case is thus like the multiplicative inverse of the input size, so also O(1).
The worst case should be had by choosing val to be as small as possible (we choose val = 1) and letting n grow without bound. In that case:
result = n
result size = log(n)
input size = 2 * log(n)
This time, the result size grows like half the input size as n grows. The worst-case space complexity is linear.
Another way to calculate space complexity is to analyze whether the memory required by your code scales/increases according to the input given.
Your input is int a[] with size being n. The only variable you have declared is result.
No matter what the size of n is, result is declared only once. It does not depend on the size of your input n.
Hence you can conclude your space complexity to be O(1).

Big O - is n always the size of the input?

I made up my own interview-style problem, and have a question on the big O of my solution. I will state the problem and my solution below, but first let me say that the obvious solution involves a nested loop and is O(n2). I believe I found a O(n) solution, but then I realized it depends not only on the size of the input, but the largest value of the input. It seems like my running time of O(n) is only a technicality, and that it could easily run in O(n2) time or worse in real life.
The problem is:
For each item in a given array of positive integers, print all the other items in the array that are multiples of the current item.
Example Input:
[2 9 6 8 3]
Example Output:
2: 6 8
9:
6:
8:
3: 9 6
My solution (in C#):
private static void PrintAllDivisibleBy(int[] arr)
{
Dictionary<int, bool> dic = new Dictionary<int, bool>();
if (arr == null || arr.Length < 2)
return;
int max = arr[0];
for(int i=0; i<arr.Length; i++)
{
if (arr[i] > max)
max = arr[i];
dic[arr[i]] = true;
}
for(int i=0; i<arr.Length; i++)
{
Console.Write("{0}: ", arr[i]);
int multiplier = 2;
while(true)
{
int product = multiplier * arr[i];
if (dic.ContainsKey(product))
Console.Write("{0} ", product);
if (product >= max)
break;
multiplier++;
}
Console.WriteLine();
}
}
So, if 2 of the array items are 1 and n, where n is the array length, the inner while loop will run n times, making this equivalent to O(n2). But, since the performance is dependent on the size of the input values, not the length of the list, that makes it O(n), right?
Would you consider this a true O(n) solution? Is it only O(n) due to technicalities, but slower in real life?
Good question! The answer is that, no, n is not always the size of the input: You can't really talk about O(n) without defining what the n means, but often people use imprecise language and imply that n is "the most obvious thing that scales here". Technically we should usually say things like "This sort algorithm performs a number of comparisons that is O(n) in the number of elements in the list": being specific about both what n is, and what quantity we are measuring (comparisons).
If you have an algorithm that depends on the product of two different things (here, the length of the list and the largest element in it), the proper way to express that is in the form O(m*n), and then define what m and n are for your context. So, we could say that your algorithm performs O(m*n) multiplications, where m is the length of the list and n is the largest item in the list.
An algorithm is O(n) when you have to iterate over n elements and perform some constant time operation in each iteration. The inner while loop of your algorithm is not constant time as it depends on the hugeness of the biggest number in your array.
Your algorithm's best case run-time is O(n). This is the case when all the n numbers are same.
Your algorithm's worst case run-time is O(k*n), where k = the max value of int possible on your machine if you really insist to put an upper bound on k's value. For 32 bit int the max value is 2,147,483,647. You can argue that this k is a constant, but this constant is clearly
not fixed for every case of input array; and,
not negligible.
Would you consider this a true O(n) solution?
The runtime actually is O(nm) where m is the maximum element from arr. If the elements in your array are bounded by a constant you can consider the algorithm to be O(n)
Can you improve the runtime? Here's what else you can do. First notice that you can ensure that the elements are different. ( you compress the array in hashmap which stores how many times an element is found in the array). Then your runtime would be max/a[0]+max/a[1]+max/a[2]+...<= max+max/2+...max/max = O(max log (max)) (assuming your array arr is sorted). If you combine this with the obvious O(n^2) algorithm you'd get O(min(n^2, max*log(max)) algorithm.

Median Algorithm in O(log n)

How can we remove the median of a set with time complexity O(log n)? Some idea?
If the set is sorted, finding the median requires O(1) item retrievals. If the items are in arbitrary sequence, it will not be possible to identify the median with certainty without examining the majority of the items. If one has examined most, but not all, of the items, that will allow one to guarantee that the median will be within some range [if the list contains duplicates, the upper and lower bounds may match], but examining the majority of the items in a list implies O(n) item retrievals.
If one has the information in a collection which is not fully ordered, but where certain ordering relationships are known, then the time required may require anywhere between O(1) and O(n) item retrievals, depending upon the nature of the known ordering relation.
For unsorted lists, repeatedly do O(n) partial sort until the element located at the median position is known. This is at least O(n), though.
Is there any information about the elements being sorted?
For a general, unsorted set, it is impossible to reliably find the median in better than O(n) time. You can find the median of a sorted set in O(1), or you can trivially sort the set yourself in O(n log n) time and then find the median in O(1), giving an O(n logn n) algorithm. Or, finally, there are more clever median selection algorithms that can work by partitioning instead of sorting and yield O(n) performance.
But if the set has no special properties and you are not allowed any pre-processing step, you will never get below O(n) by the simple fact that you will need to examine all of the elements at least once to ensure that your median is correct.
Here's a solution in Java, based on TreeSet:
public class SetWithMedian {
private SortedSet<Integer> s = new TreeSet<Integer>();
private Integer m = null;
public boolean contains(int e) {
return s.contains(e);
}
public Integer getMedian() {
return m;
}
public void add(int e) {
s.add(e);
updateMedian();
}
public void remove(int e) {
s.remove(e);
updateMedian();
}
private void updateMedian() {
if (s.size() == 0) {
m = null;
} else if (s.size() == 1) {
m = s.first();
} else {
SortedSet<Integer> h = s.headSet(m);
SortedSet<Integer> t = s.tailSet(m + 1);
int x = 1 - s.size() % 2;
if (h.size() < t.size() + x)
m = t.first();
else if (h.size() > t.size() + x)
m = h.last();
}
}
}
Removing the median (i.e. "s.remove(s.getMedian())") takes O(log n) time.
Edit: To help understand the code, here's the invariant condition of the class attributes:
private boolean isGood() {
if (s.isEmpty()) {
return m == null;
} else {
return s.contains(m) && s.headSet(m).size() + s.size() % 2 == s.tailSet(m).size();
}
}
In human-readable form:
If the set "s" is empty, then "m" must be
null.
If the set "s" is not empty, then it must
contain "m".
Let x be the number of elements
strictly less than "m", and let y be
the number of elements greater than
or equal "m". Then, if the total
number of elements is even, x must be
equal to y; otherwise, x+1 must be
equal to y.
Try a Red-black-tree. It should work quiet good and with a binary search you get ur log(n). It has aswell a remove and insert time of log(n) and rebalancing is done in log(n) aswell.
As mentioned in previous answers, there is no way to find the median without touching every element of the data structure. If the algorithm you look for must be executed sequentially, then the best you can do is O(n). The deterministic selection algorithm (median-of-medians) or BFPRT algorithm will solve the problem with a worst case of O(n). You can find more about that here: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
However, the median of medians algorithm can be made to run faster than O(n) making it parallel. Due to it's divide and conquer nature, the algorithm can be "easily" made parallel. For instance, when dividing the input array in elements of 5, you could potentially launch a thread for each sub-array, sort it and find the median within that thread. When this step finished the threads are joined and the algorithm is run again with the newly formed array of medians.
Note that such design would only be beneficial in really large data sets. The additional overhead that spawning threads has and merging them makes it unfeasible for smaller sets. This has a bit of insight: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html
Note that you can find asymptotically faster algorithms out there, however they are not practical enough for daily use. Your best bet is the already mentioned sequential median-of-medians algorithm.
Master Yoda's randomized algorithm has, of course, a minimum complexity of n like any other, an expected complexity of n (not log n) and a maximum complexity of n squared like Quicksort. It's still very good.
In practice, the "random" pivot choice might sometimes be a fixed location (without involving a RNG) because the initial array elements are known to be random enough (e.g. a random permutation of distinct values, or independent and identically distributed) or deduced from an approximate or exactly known distribution of input values.
I know one randomize algorithm with time complexity of O(n) in expectation.
Here is the algorithm:
Input: array of n numbers A[1...n] [without loss of generality we can assume n is even]
Output: n/2th element in the sorted array.
Algorithm ( A[1..n] , k = n/2):
Pick a pivot - p universally at random from 1...n
Divided array into 2 parts:
L - having element <= A[p]
R - having element > A[p]
if(n/2 == |L|) A[|L| + 1] is the median stop
if( n/2 < |L|) re-curse on (L, k)
else re-curse on (R, k - (|L| + 1)
Complexity:
O( n)
proof is all mathematical. One page long. If you are interested ping me.
To expand on rwong's answer: Here is an example code
// partial_sort example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
int main () {
int myints[] = {9,8,7,6,5,4,3,2,1};
vector<int> myvector (myints, myints+9);
vector<int>::iterator it;
partial_sort (myvector.begin(), myvector.begin()+5, myvector.end());
// print out content:
cout << "myvector contains:";
for (it=myvector.begin(); it!=myvector.end(); ++it)
cout << " " << *it;
cout << endl;
return 0;
}
Output:
myvector contains: 1 2 3 4 5 9 8 7 6
The element in the middle would be the median.

Find the most common entry in an array

You are given a 32-bit unsigned integer array with length up to 232, with the property that more than half of the entries in the array are equal to N, for some 32-bit unsigned integer N. Find N looking at each number in the array only once and using at most 2 kB of memory.
Your solution must be deterministic, and guaranteed to find N.
Keep one integer for each bit, and increment this collection appropriately for each integer in the array.
At the end, some of the bits will have a count higher than half the length of the array - those bits determine N. Of course, the count will be higher than the number of times N occurred, but that doesn't matter. The important thing is that any bit which isn't part of N cannot occur more than half the times (because N has over half the entries) and any bit which is part of N must occur more than half the times (because it will occur every time N occurs, and any extras).
(No code at the moment - about to lose net access. Hopefully the above is clear enough though.)
Boyer and Moore's "Linear Time Majority Vote Algorithm" - go down the array maintaining your current guess at the answer.
You can do this with only two variables.
public uint MostCommon(UInt32[] numberList)
{
uint suspect = 0;
int suspicionStrength = -1;
foreach (uint number in numberList)
{
if (number==suspect)
{
suspicionStrength++;
}
else
{
suspicionStrength--;
}
if (suspicionStrength<=0)
{
suspect = number;
}
}
return suspect;
}
Make the first number the suspect number, and continue looping through the list. If the number matches, increase the suspicion strength by one; if it doesn't match, lower the suspicion strength by one. If the suspicion strength hits 0 the current number becomes the suspect number. This will not work to find the most common number, only a number that is more than 50% of the group. Resist the urge to add a check if suspicionStrength is greater than half the list length - it will always result in more total comparisons.
P.S. I have not tested this code - use it at your own peril.
Pseudo code (notepad C++ :-)) for Jon's algorithm:
int lNumbers = (size_of(arrNumbers)/size_of(arrNumbers[0]);
for (int i = 0; i < lNumbers; i++)
for (int bi = 0; bi < 32; bi++)
arrBits[i] = arrBits[i] + (arrNumbers[i] & (1 << bi)) == (1 << bi) ? 1 : 0;
int N = 0;
for (int bc = 0; bc < 32; bc++)
if (arrBits[bc] > lNumbers/2)
N = N | (1 << bc);
Notice that if the sequence a0, a1, . . . , an−1 contains a leader, then after removing a pair of
elements of different values, the remaining sequence still has the same leader. Indeed, if we
remove two different elements then only one of them could be the leader. The leader in the
new sequence occurs more than n/2 − 1 = (n−2)/2
times. Consequently, it is still the leader of the
new sequence of n − 2 elements.
Here is a Python implementation, with O(n) time complexity:
def goldenLeader(A):
n = len(A)
size = 0
for k in xrange(n):
if (size == 0):
size += 1
value = A[k]
else:
if (value != A[k]):
size -= 1
else:
size += 1
candidate = -1
if (size > 0):
candidate = value
leader = -1
count = 0
for k in xrange(n):
if (A[k] == candidate):
count += 1
if (count > n // 2):
leader = candidate
return leader
This is a standard problem in streaming algorithms (where you have a huge (potentially infinite) stream of data) and you have to calculate some statistics from this stream, passing through this stream once.
Clearly you can approach it with hashing or sorting, but with potentially infinite stream you clearly run out of memory. So you have to do something smart here.
The majority element is the element that occurs more than half of the size of the array. This means that the majority element occurs more than all other elements combined or if you count the number of times, majority element appears, and subtract the number of all other elements, you will get a positive number.
So if you count the number of some element, and subtract the number of all other elements and get the number 0 - then your original element can't be a majority element. This if the basis for a correct algorithm:
Have two variables, counter and possible element. Iterate the stream, if the counter is 0 - your overwrite the possible element and initialize the counter, if the number is the same as possible element - increase the counter, otherwise decrease it. Python code:
def majority_element(arr):
counter, possible_element = 0, None
for i in arr:
if counter == 0:
possible_element, counter = i, 1
elif i == possible_element:
counter += 1
else:
counter -= 1
return possible_element
It is clear to see that the algorithm is O(n) with a very small constant before O(n) (like 3). Also it looks like the space complexity is O(1), because we have only three variable initialized. The problem is that one of these variables is a counter which potentially can grow up to n (when the array consists of the same numbers). And to store the number n you need O(log (n)) space. So from theoretical point of view it is O(n) time and O(log(n)) space. From practical, you can fit 2^128 number in a longint and this number of elements in the array is unimaginably huge.
Also note that the algorithm works only if there is a majority element. If such element does not exist it will still return some number, which will surely be wrong. (it is easy to modify the algorithm to tell whether the majority element exists)
History channel: this algorithm was invented somewhere in 1982 by Boyer, Moore and called Boyer–Moore majority vote algorithm.
I have recollections of this algorithm, which might or might not follow the 2K rule. It might need to be rewritten with stacks and the like to avoid breaking the memory limits due to function calls, but this might be unneeded since it only ever has a logarithmic number of such calls. Anyhow, I have vague recollections from college or a recursive solution to this which involved divide and conquer, the secret being that when you divide the groups in half, at least one of the halves still has more than half of its values equal to the max. The basic rule when dividing is that you return two candidate top values, one of which is the top value and one of which is some other value (that may or may not be 2nd place). I forget the algorithm itself.
Proof of correctness for buti-oxa / Jason Hernandez's answer, assuming Jason's answer is the same as buti-oxa's answer and both work the way the algorithm described should work:
We define adjusted suspicion strength as being equal to suspicion strength if top value is selected or -suspicion strength if top value is not selected. Every time you pick the right number, the current adjusted suspicion strength increases by 1. Each time you pick a wrong number, it either drops by 1 or increases by 1, depending on if the wrong number is currently selected. So, the minimum possible ending adjusted suspicion strength is equal to number-of[top values] - number-of[other values]

Resources