Can this be properly modeled with segment trees? - algorithm

The problem I'm working on requires processing several queries on an array (the size of the array is less than 10k, the largest element is certainly less than 10^9).
A query consists of two integers, and one must find the total count of subarrays that have an equal count of these integers. There may be up to 5 * 10^5 queries.
For instance, given the array [1, 2, 1], and the query 1 2 we find that there are two subarrays with equal counts of 1 and 2, namely [1, 2] and [2, 1].
My initial approach was using dynamic programming in order to construct a map, such that memo[i][j] = the number of times the number i appears in the array, until index j. I would use this in a similar way one would use prefix sums, but instead frequencies would accumulate.
Constructing this map took me O(n^2). For each query, I'd do an O(1) processing for each interval and increment the answer. This leads to a complexity of O((q + 1)n * (n - 1) / 2)) [q is the number of queries], which is to say O(n^2), but I also wanted to emphasize that daunting constant factor.
After some rearrangement, I'm trying to find out if there's a way to determine for every subarray the frequency count of each element. I strongly feel this problem is about segment trees and I've struggled with coming up with a proper model and this was the only thing I could think of.
However my approach doesn't seem to be too useful in this case, considering the complexity of combining nodes holding such a great amount of information, not to mention the memory overhead.
How can this be solved efficiently?

Idea 1
You can reduce the time for each query from O(n^2) to O(n) by computing the frequency count of the cumulative count difference:
from collections import defaultdict
def query(A,a,b):
t = 0
freq = defaultdict(int)
freq[0] = 1
for x in A:
if x==a:
t+=1
elif x==b:
t-=1
freq[t] += 1
return sum(count*(count-1)/2 for count in freq.values())
print query([1,2,1],1,2)
The idea is that t represents the total discrepancy between the count of the two elements.
If we find two positions in the array with the same total discrepancy we can conclude that the subarray between these positions must have an equal number.
The expression count*(count-1)/2 simply counts the number of ways of choosing two positions from the count which have the same discrepancy.
Example
For example, suppose we have the array [1,1,1,2,2,2]. The values for the cumulative discrepancy (number of 1's take away number of 2's) will be:
0,1,2,3,2,1,0
Each pair with the same number, corresponds to a subarray with equal count. e.g. looking at the pair of 2s we find that the range from position 2 to position 4 has equal count.
Idea 2
If this is still not fast enough, you could optimize the query function to quickly skip over all elements that are not equal to a or b. For example, you could prepare a list for each element value that contains all the locations of that element.
Once you have this list, you can then instantly jump to the next location of either a or b. For all intermediate values we know the discrepancy will not change, so you can update the frequency by the number of skipped elements (instead of always adding just 1 to the count).

Related

How to find 2 special elements in the array in O(n)

Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.

find 4th smallest element in linear time

So i had an exercise given to me about 2 months ago, that says the following:
Given n (n>=4) distinct elements, design a divide & conquer algorithm to compute the 4th smallest element. Your algorithm should run in linear time in the worst case.
I had an extremely hard time with this problem, and could only find relevant algorithms that runs in the worst case O(n*k). After several weeks of trying, we managed, with the help of our teacher, "solve" this problem. The final algorithm is as follows:
Rules: The input size can only be of size 2^k
(1): Divide input into n/2. One left array, one right array.
(2): If input size == 4, sort the arrays using merge sort.
(2.1) Merge left array with right array into a new result array with length 4.
(2.2) Return element at index [4-1]
(3): Repeat step 1
This is solved recursively and our base case is at step 2. Step 2.2 means that for all
of our recursive calls that we did, we will get a final result array of length 4, and at that
point, we can justr return the element at index [4-1].
With this algorithm, my teacher claims that this runs in linear time. My problem with that statement is that we are diving the input until we reach sub-arrays with an input size of 4, and then that is sorted. So for an input size of 8, we would sort 2 sub-arrays with length 4, since 8/4 = 2. How is this in any case linear time? We are still sorting the whole input size but in blocks aren't we? This really does not make sense to me. It doesn't matter if we sort the whole input size at it is, or divide it into sub-arrays with size of 4,and sort them like that? It will still be a worst time of O(n*log(n))?
Would appreciate some explanations on this !
To make proving that algorithm runs in linear time, let's modify it a bit (we will only change an order of dividing and merging blocks, nothing more):
(1): Divide input into n/4 blocks, each has size 4.
(2): Until there is more than one block, repeat:
Merge each pair of adjacent blocks into one block of size 4.
(For example, if we have 4 blocks, we will split them in 2 pairs -
first pair contains first and second blocks,
second pair contains third and fourth blocks.
After merging we will have 2 blocks -
the first one contains 4 least elements from blocks 1 and 2,
the second one contains 4 least elements from blocks 3 and 4).
(3): The answer is the last element of that one block left.
Proof: It's a fact that array of constant length (in your case, 4) can be sorted in constant time. Let k = log(n). Loop (2) runs k-2 iterations (on each iteration the count of elements left is divided by 2, until 4 elements are left).
Before i-th iteration (0 <= i < k-2) there are (2^(k-i)) elements left, so there are 2^(k-i-2) blocks and we will merge 2^(k-i-3) pairs of blocks. Let's find how many pairs we will merge in all iterations. Count of merges equals
mergeOperationsCount = 2^(k-3) + 2^(k-4) + .... + 2^(k-(k-3)) =
= 2^(k-3) * (1 + 1/2 + 1/4 + 1/8 + .....) < 2^(k-2) = O(2^k) = O(n)
Since we can merge each pair in constant time (because thay have constant size), and the only operation we make is merging pairs, the algorithm runs in O(n).
And after this proof, I want to notice that there is another linear algorithm which is trivial, but it is not divide-and-conquer.

Find most frequent combination of numbers in a set

This question is related to the following questions:
How to find most frequent combinations of numbers in a list
Most frequently occurring combinations
My problem is:
Scenario:
I have a set of numbers, EACH COMBINATION IS UNIQUE in this set and each number in the combination appears only once:
Goal:
Find frequency of appears of combination (size of 2) in this set.
Example:
The frequency threshold is 2.
Set = {1,12,13,134,135,235,2345,12345}
The frequency of degree of 2 combination is(show all combinations that appears more than 2 times):
13 - appear 4 times
14 - appear 3 times
23 - appear 3 times
12 - appear 2 times
...
The time complexity of exhaustive searching for all possible combinations grow exponentially.
Can any one help me to think a algorithm that can solve this problem faster? (hash table, XOR, tree search....)
Thank you
PS.
Don't worry about the space complexity
Solution and conclusion:
templatetypedef's answer is good for substring' length more than 3
If substring's length is 2, btilly's answer is straight forward and easy to implement (also have a good performance on time)
Here is pseudo-code whose running time should be O(n * m * m) where n is the size of the set, and m is the size of the things in that set:
let counts be a hash mapping a pair of characters to a count
foreach number N in list:
foreach pair P of characters in N:
if exists counts[P]:
counts[P] = counts[P] + 1
else:
counts[P] = 1
let final be an array of (pair, count)
foreach P in keys of counts:
if 1 < counts[P]:
add (P, counts[P]) to final
sort final according to the counts
output final
#templatetypedef's answer is going to eventually be more efficient if you're looking for combinations of 3, 4, etc characters. But this should be fine for the stated problem.
You can view this problem as a string problem: given a collection of strings, return all substrings of the collection that appear at least k times. Fortunately, there's a polynomial-time algorithm for this problem That uses generalized suffix trees.
Start by constructing a generalized suffix tree for the string representations of your numbers, which takes time linear in the number of digits across all numbers. Then, do a DFS and annotate each node with the number of leaf nodes in its subtree (equivalently, the number of times the string represented by the node appears in the input set), and in the course of doing so output each string discovered this way to appear at least k times. The runtime for this operation is O(d + z), where d is the number of total digits in the input and z is the total number of digits produced as output.
Hope this helps!

Data structure that supports range based most frequently occuring element query

I'm looking for a data structure with which I can find the most frequently occuring number (among an array of numbers) in a given, variable range.
Let's consider the following 1 based array:
1 2 3 1 1 3 3 3 3 1 1 1 1
If I query the range (1,4), the data structure must retun 1, which occurs twice.
Several other examples:
(1,13) = 1
(4,9) = 3
(2,2) = 2
(1,3) = 1 (all of 1,2,3 occur once, so return the first/smallest one. not so important at the moment)
I have searched, but could not find anything similar. I'm looking (ideally) a data structure with minimal space requirement, fast preprocessing, and/or query complexities.
Thanks in advance!
Let N be the size of the array and M the number of different values in that array.
I'm considering two complexities : pre-processing and querying an interval of size n, each must be spacial and temporal.
Solution 1 :
Spacial : O(1) and O(M)
Temporal : O(1) and O(n + M)
No pre-processing, we look at all values of the interval and find the most frequent one.
Solution 2 :
Spacial : O(M*N) and O(1)
Temporal : O(M*N) and O(min(n,M))
For each position of the array, we have an accumulative array that gives us for each value x, how many times x is in the array before that position.
Given an interval we just need for each x to subtract 2 values to find the number of x in that interval. We iterate over each x and find the maximum value. If n < M we iterate over each value of the interval, otherwise we iterate over all possible values for x.
Solution 3 :
Spacial : O(N) and O(1)
Temporal : O(N) and O(min(n,M)*log(n))
For each value x build a binary heap of all the position in the array where x is present. The key in your heap is the position but you also store the total number of x between this position and the begin of the array.
Given an interval we just need for each x to subtract 2 values to find the number of x in that interval : in O(log(N)) we can ask the x's heap to find the two positions just before the start/end of the interval and substract the numbers. Basically it needs less space than a histogram but the query in now in O(log(N)).
You could create a binary partition tree where each node represents a histogram map of {value -> frequency} for a given range, and has two child nodes which represent the upper half and lower half of the range.
Querying is then just a case of recursively adding together a small number of these histograms to cover the range required, and scanning the resulting histogram once to find the highest occurrence count.
Useful optimizations include:
Using a histogram with mutable frequency counts as an "accumulator" while you add histograms together
Stop using precomputed histograms once you get down to a certain size (maybe a range less than the total number of possible values M) and just counting the numbers directly. It's a time/space trade-off that I think will pay off a lot of the time.
If you have a fixed small number of possible values, use an array rather than a map to store the frequency counts at each node
UPDATE: my thinking on algorithmic complexity assuming a bounded small number of possible values M and a total of N values in the complete range:
Preprocessing is O(N log N) - basically you need to traverse the complete list and build a binary tree, building one node for every M elements in order to amortise the overhead of each node
Querying is O(M log N) - basically adding up O(log N) histograms each of size M, plus counting O(M) values on either side of the range
Space requirement is O(N) - approx. 2N/M histograms each of size M. The 2 factor is the sum from having N/M histograms at the bottom level, 0.5N/M histograms at the next level, 0.25N/M at the third level etc...

Greatest GCD between some numbers

We've got some nonnegative numbers. We want to find the pair with maximum gcd. actually this maximum is more important than the pair!
For example if we have:
2 4 5 15
gcd(2,4)=2
gcd(2,5)=1
gcd(2,15)=1
gcd(4,5)=1
gcd(4,15)=1
gcd(5,15)=5
The answer is 5.
You can use the Euclidean Algorithm to find the GCD of two numbers.
while (b != 0)
{
int m = a % b;
a = b;
b = m;
}
return a;
If you want an alternative to the obvious algorithm, then assuming your numbers are in a bounded range, and you have plenty of memory, you can beat O(N^2) time, N being the number of values:
Create an array of a small integer type, indexes 1 to the max input. O(1)
For each value, increment the count of every element of the index which is a factor of the number (make sure you don't wraparound). O(N).
Starting at the end of the array, scan back until you find a value >= 2. O(1)
That tells you the max gcd, but doesn't tell you which pair produced it. For your example input, the computed array looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4 2 1 1 2 0 0 0 0 0 0 0 0 0 1
I don't know whether this is actually any faster for the inputs you have to handle. The constant factors involved are large: the bound on your values and the time to factorise a value within that bound.
You don't have to factorise each value - you could use memoisation and/or a pregenerated list of primes. Which gives me the idea that if you are memoising the factorisation, you don't need the array:
Create an empty set of int, and a best-so-far value 1.
For each input integer:
if it's less than or equal to best-so-far, continue.
check whether it's in the set. If so, best-so-far = max(best-so-far, this-value), continue. If not:
add it to the set
repeat for all of its factors (larger than best-so-far).
Add/lookup in a set could be O(log N), although it depends what data structure you use. Each value has O(f(k)) factors, where k is the max value and I can't remember what the function f is...
The reason that you're finished with a value as soon as you encounter it in the set is that you've found a number which is a common factor of two input values. If you keep factorising, you'll only find smaller such numbers, which are not interesting.
I'm not quite sure what the best way is to repeat for the larger factors. I think in practice you might have to strike a balance: you don't want to do them quite in decreasing order because it's awkward to generate ordered factors, but you also don't want to actually find all the factors.
Even in the realms of O(N^2), you might be able to beat the use of the Euclidean algorithm:
Fully factorise each number, storing it as a sequence of exponents of primes (so for example 2 is {1}, 4 is {2}, 5 is {0, 0, 1}, 15 is {0, 1, 1}). Then you can calculate gcd(a,b) by taking the min value at each index and multiplying them back out. No idea whether this is faster than Euclid on average, but it might be. Obviously it uses a load more memory.
The optimisations I can think of is
1) start with the two biggest numbers since they are likely to have most prime factors and thus likely to have the most shared prime factors (and thus the highest GCD).
2) When calculating the GCDs of other pairs you can stop your Euclidean algorithm loop if you get below your current greatest GCD.
Off the top of my head I can't think of a way that you can work out the greatest GCD of a pair without trying to work out each pair individually (and optimise a bit as above).
Disclaimer: I've never looked at this problem before and the above is off the top of my head. There may be better ways and I may be wrong. I'm happy to discuss my thoughts in more length if anybody wants. :)
There is no O(n log n) solution to this problem in general. In fact, the worst case is O(n^2) in the number of items in the list. Consider the following set of numbers:
2^20 3^13 5^9 7^2*11^4 7^4*11^3
Only the GCD of the last two is greater than 1, but the only way to know that from looking at the GCDs is to try out every pair and notice that one of them is greater than 1.
So you're stuck with the boring brute-force try-every-pair approach, perhaps with a couple of clever optimizations to avoid doing needless work when you've already found a large GCD (while making sure that you don't miss anything).
With some constraints, e.g the numbers in the array are within a given range, say 1-1e7, it is doable in O(NlogN) / O(MAX * logMAX), where MAX is the maximum possible value in A.
Inspired from the sieve algorithm, and came across it in a Hackerrank Challenge -- there it is done for two arrays. Check their editorial.
find min(A) and max(A) - O(N)
create a binary mask, to mark which elements of A appear in the given range, for O(1) lookup; O(N) to build; O(MAX_RANGE) storage.
for every number a in the range (min(A), max(A)):
for aa = a; aa < max(A); aa += a:
if aa in A, increment a counter for aa, and compare it to current max_gcd, if counter >= 2 (i.e, you have two numbers divisible by aa);
store top two candidates for each GCD candidate.
could also ignore elements which are less than current max_gcd;
Previous answer:
Still O(N^2) -- sort the array; should eliminate some of the unnecessary comparisons;
max_gcd = 1
# assuming you want pairs of distinct elements.
sort(a) # assume in place
for ii = n - 1: -1 : 0 do
if a[ii] <= max_gcd
break
for jj = ii - 1 : -1 :0 do
if a[jj] <= max_gcd
break
current_gcd = GCD(a[ii], a[jj])
if current_gcd > max_gcd:
max_gcd = current_gcd
This should save some unnecessary computation.
There is a solution that would take O(n):
Let our numbers be a_i. First, calculate m=a_0*a_1*a_2*.... For each number a_i, calculate gcd(m/a_i, a_i). The number you are looking for is the maximum of these values.
I haven't proved that this is always true, but in your example, it works:
m=2*4*5*15=600,
max(gcd(m/2,2), gcd(m/4,4), gcd(m/5,5), gcd(m/15,15))=max(2, 2, 5, 5)=5
NOTE: This is not correct. If the number a_i has a factor p_j repeated twice, and if two other numbers also contain this factor, p_j, then you get the incorrect result p_j^2 insted of p_j. For example, for the set 3, 5, 15, 25, you get 25 as the answer instead of 5.
However, you can still use this to quickly filter out numbers. For example, in the above case, once you determine the 25, you can first do the exhaustive search for a_3=25 with gcd(a_3, a_i) to find the real maximum, 5, then filter out gcd(m/a_i, a_i), i!=3 which are less than or equal to 5 (in the example above, this filters out all others).
Added for clarification and justification:
To see why this should work, note that gcd(a_i, a_j) divides gcd(m/a_i, a_i) for all j!=i.
Let's call gcd(m/a_i, a_i) as g_i, and max(gcd(a_i, a_j),j=1..n, j!=i) as r_i. What I say above is g_i=x_i*r_i, and x_i is an integer. It is obvious that r_i <= g_i, so in n gcd operations, we get an upper bound for r_i for all i.
The above claim is not very obvious. Let's examine it a bit deeper to see why it is true: the gcd of a_i and a_j is the product of all prime factors that appear in both a_i and a_j (by definition). Now, multiply a_j with another number, b. The gcd of a_i and b*a_j is either equal to gcd(a_i, a_j), or is a multiple of it, because b*a_j contains all prime factors of a_j, and some more prime factors contributed by b, which may also be included in the factorization of a_i. In fact, gcd(a_i, b*a_j)=gcd(a_i/gcd(a_i, a_j), b)*gcd(a_i, a_j), I think. But I can't see a way to make use of this. :)
Anyhow, in our construction, m/a_i is simply a shortcut to calculate the product of all a_j, where j=1..1, j!=i. As a result, gcd(m/a_i, a_i) contains all gcd(a_i, a_j) as a factor. So, obviously, the maximum of these individual gcd results will divide g_i.
Now, the largest g_i is of particular interest to us: it is either the maximum gcd itself (if x_i is 1), or a good candidate for being one. To do that, we do another n-1 gcd operations, and calculate r_i explicitly. Then, we drop all g_j less than or equal to r_i as candidates. If we don't have any other candidate left, we are done. If not, we pick up the next largest g_k, and calculate r_k. If r_k <= r_i, we drop g_k, and repeat with another g_k'. If r_k > r_i, we filter out remaining g_j <= r_k, and repeat.
I think it is possible to construct a number set that will make this algorithm run in O(n^2) (if we fail to filter out anything), but on random number sets, I think it will quickly get rid of large chunks of candidates.
pseudocode
function getGcdMax(array[])
arrayUB=upperbound(array)
if (arrayUB<1)
error
pointerA=0
pointerB=1
gcdMax=0
do
gcdMax=MAX(gcdMax,gcd(array[pointera],array[pointerb]))
pointerB++
if (pointerB>arrayUB)
pointerA++
pointerB=pointerA+1
until (pointerB>arrayUB)
return gcdMax

Resources