A data structure for counting integers within some range? - algorithm

Question:
Given n integers in the range [1, k], preprocesses its input and then
answers any query about how many of the n integers have values between a and b, where 1 ≤ a, b ≤ k
are two given parameters. Your algorithm should use O(n + k) preprocessing time.

Your algorithm is reasonably good, but it can be made much faster. Specifically, your algorithm has O(1) preprocessing time, but then spends O(n) time per query because of the linear cost of the time required to do the partitioning step.
Let's consider an alternative approach. Suppose that all of your values were in sorted order. In this case, you could find the number of elements in a range very quickly by just doing two binary searches - a first binary search to find the index of the lower bound, and a second search to find the upper bound - and could just subtract the indices. This would take time O(log n). If you can preprocess the input array to sort it in time O(n + k), then this approach will result in exponentially faster lookup times.
To do this sorting, as #minitech has pointed out, you can use the counting sort algorithm, which sorts in time O(n + k) for integers between 1 and k. Consequently, using both counting sort and the binary search together gives O(n + k) setup time and O(log n) query time.
If you are willing to trade memory for efficiency, though, you can speed this up even further. Let's suppose that k is a reasonably small number (say, not more than 100). Then if you are okay using O(k) space, you can answer these queries in O(1) time. The idea is as follows: build up a table of k elements that represents, for each element k, how many elements of the original array are smaller than k. If you have this array, you can find the total number of elements in some subrange by looking up how many elements are less than b and how many elements are less than a (each in O(1) time), then subtracting them.
Of course, to do this, you have to actually build up this table in time O(n + k). This can be done as follows. First, create an array of k elements, then iterate across the original n-element array and for each element increment the spot in the table corresponding to this number. When you're done (in time O(n + k)), you will have filled in this table with the number of times that each of the values in the range 1 - k exists in the original array (this is, incidentally, how counting sort works). Next, create a second table of k elements that will hold the cumulative frequency. Then, iterate across the histogram you built in the first step, and fill in the cumulative frequency table with the cumulative total number of elements encountered so far as you walk across the histogram. This last step takes time O(k), for a grand total of time O(n + k) for setup. You can now answer queries in time O(1).
Hope this helps!

Here is another simple algorithm:
First allocate an array A of size k, then iterate over n elements and for each integer x increment A[x] by one. this will take O(n) time.
Then compute prefix sum of array A, and store them as array B. this will take O(k).
now for any query of points(a, b) you can simply return: B[b]-B[a]+A[a]

Related

repetition detection in O(Log n) in sorted array

Given a sorted integer array A of size n, where n is a multiple of 4. Could someone help me how to find an algorithm that decides whether not there exists an element that repeats at least n/4 times in the array in O(log n) time.
If there is an element that repeats n/4 times, it must also occupy one of the following indices: n/4, 2n/4, 3n/4, n.
For each of these elements, do two binary searches to find the first index it occupies and the last one.
This totals in 4*2 binary searches, each taking O(logn) time. This gives you total run time of O(8*logn) = O(logn)

Radix sort explanation

Based on this radix sort article http://www.geeksforgeeks.org/radix-sort/ I'm struggling to understand what is being explained in terms of the time complexity of certain methods in the sort.
From the link:
Let there be d digits in input integers. Radix Sort takes O(d*(n+b)) time where b is the base for representing numbers, for example, for decimal system, b is 10. What is the value of d? If k is the maximum possible value, then d would be O(log_b(k)). So overall time complexity is O((n+b)*logb(k)). Which looks more than the time complexity of comparison based sorting algorithms for a large k. Let us first limit k. Let k≤nc where c is a constant. In that case, the complexity becomes O(nlogb(n)).
So I do understand that the sort takes O(d*n) since there are d digits therefore d passes, and you have to process all n elements, but I lost it from there. A simple explanation would be really helpful.
Assuming we use bucket sort for the sorting on each digit: for each digit (d), we process all numbers (n), placing them in buckets for all possible values a digit may have (b).
We then need to process all the buckets, recreating the original list. Placing all items in the buckets takes O(n) time, recreating the list from all the buckets takes O(n + b) time (we have to iterate over all buckets and all elements inside them), and we do this for all digits, giving a running time of O(d * (n + b)).
This is only linear if d is a constant and b is not asymptotically larger than n. So indeed, if you have numbers of log n bits, it will take O(n log n) time.

Sorting m sets of total O(n) elements in O(n)

Suppose we have m sets S1,S2,...,Sm of elements from {1...n}
Given that m=O(n) , |S1|+|S2|+...+|Sm|=O(n)
sort all the sets in O(n) time and O(n) space.
I was thinking to use counting sort algorithm on each set.
Counting sort on each set will be O(S1)+O(S2)+...+O(Sm) < O(n)
and because that in it's worst case if one set consists of n elements it will still take O(n).
But will it solve the problem and still hold that it uses only O(n) space?
Your approach won't necessarily work in O(n) time. Imagine you have n sets of one element each, where each set just holds n. Then each iteration of counting sort will take time Θ(n) to complete, so the total runtime will be Θ(n2).
However, you can use a modified counting sort to solve this by effectively doing counting sort on all sets at the same time. Create an array of length n that stores lists of numbers. Then, iterate over all the sets and for each element, if the value is k and the set number is r, append the number r to array k. This process essentially builds up a histogram of the distribution of the elements in the sets, where each element is annotated with the set that it came from. Then, iterate over the arrays and reconstruct the sets in sorted order using logic similar to counting sort.
Overall, this algorithm takes time Θ(n), since it takes time Θ(n) to initialize the array, O(n) total time to distribute the elements, and O(n) time to write them back. It also uses only Θ(n) space, since there are n total arrays and across all the arrays there are a total of n elements distributed.
Hope this helps!

Prepare array in linear time to find k smallest elements in O(k)

This is an interesting question I have found on the web. Given an array containing n numbers (with no information about them), we should pre-process the array in linear time so that we can return the k smallest elements in O(k) time, when we are given a number 1 <= k <= n
I have been discussing this problem with some friends but no one could find a solution; any help would be appreciated!
For the pre-processing step, we will use the partition-based selection several times on the same data set.
Find the n/2-th number with the algorithm.. now the dataset is partitioned into two half, lower and upper. On the lower half find again the middlepoint. On its lower partition do the same thing and so on... Overall this is O(n) + O(n/2) + O(n/4) + ... = O(n).
Now when you have to return the k smallest elements, search for the nearest x < k, where x is a partition boundary. Everything below it can be returned, and from the next partition you have to return k - x numbers. Since the next partition's size is O(k), running another selection algorithm for the k - x th number will return the rest.
We can find the median of a list and partition around it in linear time.
Then we can use the following algorithm: maintain a buffer of size 2k.
Every time the buffer gets full, we find the median and partition around it, keeping only the lowest k elements.
This requires n/k find-median-and-partition steps, each of which take O(k) time with a traditional quickselect. this approach requires only O(n) time.
Additionally if you need the sorted output.
Which adds an additional O(k log k) time. In total, this approach requires only O(n + k log k) time and O(k) space.

Sorting problem- n/k intervals of size k each

Given is an array of size n which was divided to n/k intervals of size k each. The values in each interval are bigger than the ones in the interval to its left and smaller than the ones in the interval to its right. I want to sort those values in the minimum time that I can.
The naive solution that I thought of is just to sort all the values in each interval which will "cost" O(k log k), for a total cost for all the n/k intervals of O(n log k). I wonder if there's something more efficient.
Now I know that in each interval I have no more than log log k different values, I need to come up with a quicker algorithm. I'd love your help with this.
Thanks!
Here's an extremely ugly answer:
1. Take the first interval;
2. Since logK should be small, we allocate logK binary tree nodes, and we place the first element in the middle;
3. For the rest of the elements, we use method similar to binary search to search if it is already included, or we add this element;
4. Produce a sorted list with all the values in the interval;
5. Use Counting Sort with this list on the interval;
6. Do this for all the intervals.
The time used for 2,3 is O(K*logloglogK) since the search takes at most logloglogK (log on the loglogK elements) and repeated for K times. 4 use at most O(loglogK) time to walk through all the nodes with values. 5 takes O(K) time, similar to the Counting Sort. So the total time should be O(nlogloglogK).
Any question is welcomed since I am really sleepy and cannot guarantee that I am thinking straightly.
You could use counting sort or bucket sort on each interval costing O(k)
for each one, for a total cost of O(n/k * k) = O(n)
Then merge each interval together costing O(n) in total. Your algorithm would then be a O(n) + O(n) = O(n) algorithm.
Note: if you could take advantage of parallelism, you could sort all of the intervals in parallel for a total cost of O(k). Although your algorithm would still be O(n) (because of the merge), it will have smaller constant factors.

Resources