Question about the number of zeros in the interval - algorithm

There is a task where at first a sequence of numbers is given, and then several queries. In each request, an interval is indicated and it is necessary to say how many zeros are on it. There is a simple and efficient algorithm that uses prefix sums. But is there something better? Sorry for such a possibly stupid question from a newbie. (The improvement could be, for example, in speed, memory, or that the original sequence can be easily changed)

If you can afford linear space, you can create a second array of the same size where the i'th entry has the number of zeros up to (and including) i in the original array. Then the query to get zeroes from j to k inclusive is arr[k] - arr[j-1] if j > 0, or simply arr[k] if j is zero.
This may be the simple and efficient approach your allude to. Each query is O(1), and the up-front work is O(n) time & O(n) space.
Potential improvements:
If you don't care about preserving the input array, you could save space by overwriting it with the count of zeros array.
If the size of the input is large relative to the (avg query size) * (number of queries), your total time will be less if you don't precompute anything and instead just scan the portion of the input being queried on each query. Total time will be improved by query-time will be slower.
If the zeros are distributed according to some known formula, e.g., every even index has a zero, you can calculate the answer to a query in O(1) with no precomputation.

Related

How can a HashSet offer constant time add operation?

I was reading the javadocs on HashSet when I came across the interesting statement:
This class offers constant time performance for the basic operations (add, remove, contains and size)
This confuses me greatly, as I don't understand how one could possibly get constant time, O(1), performance for a comparison operation. Here are my thoughts:
If this is true, then no matter how much data I'm dumping into my HashSet, I will be able to access any element in constant time. That is, if I put 1 element in my HashSet, it will take the same amount of time to find it as if I had a googolplex of elements.
However, this wouldn't be possible if I had a constant number of buckets, or a consistent hash function, since for any fixed number of buckets, the number of elements in that bucket will grow linearly (albeit slowly, if the number is big enough) with the number of elements in the set.
Then, the only way for this to work is to have a changing hash function every time you insert an element (or every few times). A simple hash function that never any collisions would satisfy this need. One toy example for strings could be: Take the ASCII value of the strings and concatenate them together (because adding could result in a conflict).
However, this hash function, and any other hash function of this sort will likely fail for large enough strings or numbers etc. The number of buckets that you can form is immediately limited by the amount of stack/heap space you have, etc. Thus, skipping locations in memory can't be allowed indefinitely, so you'll eventually have to fill in the gaps.
But if at some point there's a recalculation of the hash function, this can only be as fast as finding a polynomial which passes through N points, or O(nlogn).
Thus arrives my confusion. While I will believe that the HashSet can access elements in O(n/B) time, where B is the number of buckets it has decided to use, I don't see how a HashSet could possibly perform add or get functions in O(1) time.
Note: This post and this post both don't address the concerns I listed..
The number of buckets is dynamic, and is approximately ~2n, where n is the number of elements in the set.
Note that HashSet gives amortized and average time performance of O(1), not worst case. This means, we can suffer an O(n) operation from time to time.
So, when the bins are too packed up, we just create a new, bigger array, and copy the elements to it.
This costs n operations, and is done when number of elements in the set exceeds 2n/2=n, so it means, the average cost of this operation is bounded by n/n=1, which is a constant.
Additionally, the number of collisions a HashMap offers is also constant on average.
Assume you are adding an element x. The probability of h(x) to be filled up with one element is ~n/2n = 1/2. The probability of it being filled up with 3 elements, is ~(n/2n)^2 = 1/4 (for large values of n), and so on and so on.
This gives you an average running time of 1 + 1/2 + 1/4 + 1/8 + .... Since this sum converges to 2, it means this operation takes constant time on average.
What I know about hashed structures is that to keep a O(1) complexity for insertion removal you need to have a good hash function to avoid collisions and the structure should not be full ( if the structure is full you will have collisions).
Normally hashed structures define a kind of fill limit, by example 70%.
When the number of object make the structure be filled more than this limit than you should extend it size to stay below the limit and warranty performances. Generally you double the size of the structure when reaching the limit so that structure size grow faster than number of elements and reduce the number of resize/maintenance operations to perform
This is a kind of maintenance operation that consists on rehashing all elements contained int he structure to redistribute them in the resized structure. For sure this has a cost whose complexity is O(n) with n the number of elements stored in the structure but this cost is not integrated in the add function that will make the maintenance operation needed
I think this is what disturb you.
I learned also that the hash function generally depends on size of the structure that is used as parameter (there was something like max number of elements to reach the limit is a prime number of structure size to reduce the probability of collision or something like that) meaning that you don't change the hash function itself, you just change on of its parameters.
To answer to your comment there is not warranty if bucket 0 or 1 was filled that when you resize to 4 new elements will go inside bucket 3 and 4. Perhaps resizing to 4 make elements A and B now be in buckets 0 and 3
For sure all above is theorical and in real life you don`t have infinite memory, you can have collisions and maintenance has a cost etc so that's why you need to have an idea about the number of objects that you will store and do a trade off with available memory to try to choose an initial size of hashed structure that will limit the need to perform maintenance operations and allow you to stay in the O(1) performances

The space complexity of this algorithm that loops over an array

I am asked to provide the asymptotic space and time complexity of the below algorithm in appropriate terms for an initial input number of arbitrary length (as distinct from a constant 12 digits).
1 for i = 2 to 12
2 if d[i] == d[i-1]
3 d[i] = (d[i] + 3) mod 10
This algorithm is apply for a number that has 12 digits, each digit is stored in the array d[] (so we have d[1], d[2],... d[12])
I figured out the time complexity is O(n) but how do I determine the space complexity?
Typically, to measure space complexity, you want to look at how much extra space is required to hold all the information necessary to execute the code. The challenge is that you can often reuse space across the lifetime of the code execution.
In this case, the code needs extra space to store
The values of temporary calculations in the loop,
The value of i, and
Miscellaneous data like the program counter, etc.
The last two of these take up O(1) space, since there's only one i and constant space for auxiliary data like the stack pointer, etc. So what about the first? Well, each iteration of the loop will need O(1) space for temporary variables, but notice that this space can get reused because after each loop iteration finishes, the space for those temporaries isn't needed anymore and can be reused in the next iteration. Therefore, the total space needed is just O(1).
(A note... are you sure the time complexity is O(n)? Notice that the number of iterations is a constant regardless of how large the array is.)
Hope this helps!

Range query for a semigroup operator (union)

I'm looking to implement an algorithm, which is given an array of integers and a list of ranges (intervals) in that array, returns the number of distinct elements in each interval. That is, given the array A and a range [i,j] returns the size of the set {A[i],A[i+1],...,A[j]}.
Obviously, the naive approach (iterate from i to j and count ignoring duplicates) is too slow. Range-Sum seems inapplicable, since A U B - B isn't always equal to B.
I've looked up Range Queries in Wikipedia, and it hints that Yao (in '82) showed an algorithm that does this for semigroup operators (which union seems to be) with linear preprocessing time and space and almost constant query time. The article, unfortunately, is not available freely.
Edit: it appears this exact problem is available at http://www.spoj.com/problems/DQUERY/
There's rather simple algorithm which uses O(N log N) time and space for preprocessing and O(log N) time per query. At first, create a persistent segment tree for answering range sum query(initially, it should contain zeroes at all the positions). Then iterate through all the elements of the given array and store the latest position of each number. At each iteration create a new version of the persistent segment tree putting 1 to the latest position of each element(at each iteration the position of only one element can be updated, so only one position's value in segment tree changes so update can be done in O(log N)). To answer a query (l, r) You just need to find sum on (l, r) segment for the version of the tree which was created when iterating through the r's element of the initial array.
Hope this algorithm is fast enough.
Upd. There's a little mistake in my explanation: at each step, at most two positions' values in the segment tree might change(because it's necessary to put 0 to a previous latest position of a number if it's updated). However, it doesn't change the complexity.
You can answer any of your queries in constant time by performing a quadratic-time precomputation:
For every i from 0 to n-1
S <- new empty set backed by hashtable;
C <- 0;
For every j from i to n-1
If A[j] does not belong to S, increment C and add A[j] to S.
Stock C as the answer for the query associated to interval i..j.
This algorithm takes quadratic time since for each interval we perform a bounded number of operations, each one taking constant time (note that the set S is backed by a hashtable), and there's a quadratic number of intervals.
If you don't have additional information about the queries (total number of queries, distribution of intervals), you cannot do essentially better, since the total number of intervals is already quadratic.
You can trade off the quadratic precomputation by n linear on-the-fly computations: after receiving a query of the form A[i..j], precompute (in O(n) time) the answer for all intervals A[i..k], k>=i. This will guarantee that the amortized complexity will remain quadratic, and you will not be forced to perform the complete quadratic precomputation at the beginning.
Note that the obvious algorithm (the one you call obvious in the statement) is cubic, since you scan every interval completely.
Here is another approach which might be quite closely related to the segment tree. Think of the elements of the array as leaves of a full binary tree. If there are 2^n elements in the array there are n levels of that full tree. At each internal node of the tree store the union of the points that lie in the leaves beneath it. Each number in the array needs to appear once in each level (less if there are duplicates). So the cost in space is a factor of log n.
Consider a range A..B of length K. You can work out the union of points in this range by forming the union of sets associated with leaves and nodes, picking nodes as high up the tree as possible, as long as the subtree beneath those nodes is entirely contained in the range. If you step along the range picking subtrees that are as big as possible you will find that the size of the subtrees first increases and then decreases, and the number of subtrees required grows only with the logarithm of the size of the range - at the beginning if you could only take a subtree of size 2^k it will end on a boundary divisible by 2^(k+1) and you will have the chance of a subtree of size at least 2^(k+1) as the next step if your range is big enough.
So the number of semigroup operations required to answer a query is O(log n) - but note that the semigroup operations may be expensive as you may be forming the union of two large sets.

Find the 10,000 largest out of 1,000,000 total values

I have a file that has 1,000,000 float values in it. I need to find the 10,000 largest values.
I was thinking of:
Reading the file
Converting the strings to floats
Placing the floats into a max-heap (a heap where the largest value is the root)
After all values are in the heap, removing the root 10,000 times and adding those values to a list/arraylist.
I know I will have
1,000,000 inserts into the heap
10,000 removals from the heap
10,000 inserts into the return list
Would this be a good solution? This is for a homework assignment.
Your solution is mostly good. It's basically a heapsort that stops after getting K elements, which improves the running time from O(NlogN) (for a full sort) to O(N + KlogN). Here N = 1000000 and K = 10000.
However, you should not do N inserts to the heap initially, as this would take O(NlogN) - instead, use a heapify operation which turns an array to a heap in linear time.
If the K numbers don't need to be sorted, you can find the Kth largest number in linear time using a selection algorithm, and then output all numbers larger than it. This gives an O(n) solution.
How about using mergesort(log n operations in worst case scenario) to sort the 1,000,000 integers into an array then get the last 10000 directly?
Sorting is expensive, and your input set is not small. Fortunately, you don't care about order. All you need is to know that you have the top X numbers. So, don't sort.
How would you do this problem if, instead of looking for the top 10,000 out of 1,000,000, you were looking for the top 1 (i.e. the single largest value) out of 100? You'd only need to keep track of the largest value you'd seen so far, and compare it to the next number and the next one until you found a larger one or you ran out of input. Could you expand that idea back out to the input size you're looking at? What would be the big-O (hint: you'd only be looking at each input number one time)?
Final note since you said this was homework: if you've just been learning about heaps in class, and you think your teacher/professor is looking for a heap solution, then yes, your idea is good.
Could you merge sort the values in the array after you have read them all in? This is a fast way to sort the values. Then you could request your_array[10000] and you would know that it is the 10000th largest. Merge sort sounds like what you want. Also if you really need speed, you could look into format your values for radix sort, that would take a bit of formatting but it sounds like that would be the absolute fastest way to solve this problem.

Find a number with even number of occurrences

Given an array where number of occurrences of each number is odd except one number whose number of occurrences is even. Find the number with even occurrences.
e.g.
1, 1, 2, 3, 1, 2, 5, 3, 3
Output should be:
2
The below are the constraints:
Numbers are not in range.
Do it in-place.
Required time complexity is O(N).
Array may contain negative numbers.
Array is not sorted.
With the above constraints, all my thoughts failed: comparison based sorting, counting sort, BST's, hashing, brute-force.
I am curious to know: Will XORing work here? If yes, how?
This problem has been occupying my subway rides for several days. Here are my thoughts.
If A. Webb is right and this problem comes from an interview or is some sort of academic problem, we should think about the (wrong) assumptions we are making, and maybe try to explore some simple cases.
The two extreme subproblems that come to mind are the following:
The array contains two values: one of them is repeated an even number of times, and the other is repeated an odd number of times.
The array contains n-1 different values: all values are present once, except one value that is present twice.
Maybe we should split cases by complexity of number of different values.
If we suppose that the number of different values is O(1), each array would have m different values, with m independent from n. In this case, we could loop through the original array erasing and counting occurrences of each value. In the example it would give
1, 1, 2, 3, 1, 2, 5, 3, 3 -> First value is 1 so count and erase all 1
2, 3, 2, 5, 3, 3 -> Second value is 2, count and erase
-> Stop because 2 was found an even number of times.
This would solve the first extreme example with a complexity of O(mn), which evaluates to O(n).
There's better: if the number of different values is O(1), we could count value appearances inside a hash map, go through them after reading the whole array and return the one that appears an even number of times. This woud still be considered O(1) memory.
The second extreme case would consist in finding the only repeated value inside an array.
This seems impossible in O(n), but there are special cases where we can: if the array has n elements and values inside are {1, n-1} + repeated value (or some variant like all numbers between x and y). In this case, we sum all the values, substract n(n-1)/2 from the sum, and retrieve the repeated value.
Solving the second extreme case with random values inside the array, or the general case where m is not constant on n, in constant memory and O(n) time seems impossible to me.
Extra note: here, XORing doesn't work because the number we want appears an even number of times and others appear an odd number of times. If the problem was "give the number that appears an odd number of times, all other numbers appear an even number of times" we could XOR all the values and find the odd one at the end.
We could try to look for a method using this logic: we would need something like a function, that applied an odd number of times on a number would yield 0, and an even number of times would be identity. Don't think this is possible.
Introduction
Here is a possible solution. It is rather contrived and not practical, but then, so is the problem. I would appreciate any comments if I have holes in my analysis. If this was a homework or challenge problem with an “official” solution, I’d also love to see that if the original poster is still about, given that more than a month has passed since it was asked.
First, we need to flesh out a few ill-specified details of the problem. Time complexity required is O(N), but what is N? Most commentators appear to be assuming N is the number of elements in the array. This would be okay if the numbers in the array were of fixed maximum size, in which case Michael G’s solution of radix sort would solve the problem. But, I interpret constraint #1, in absence of clarification by the original poster, as saying the maximum number of digits need not be fixed. Therefore, if n (lowercase) is the number of elements in the array, and m the average length of the elements, then the total input size to contend with is mn. A lower bound on the solution time is O(mn) because this is the read-through time of the input needed to verify a solution. So, we want a solution that is linear with respect to total input size N = nm.
For example, we might have n = m, that is sqrt(N) elements of sqrt(N) average length. A comparison sort would take O( log(N) sqrt(N) ) < O(N) operations, but this is not a victory, because the operations themselves on average take O(m) = O(sqrt(N)) time, so we are back to O( N log(N) ).
Also, a radix sort would take O(mn) = O(N) if m were the maximum length instead of average length. The maximum and average length would be on the same order if the numbers were assumed to fall in some bounded range, but if not we might have a small percentage with a large and variable number of digits and a large percentage with a small number of digits. For example, 10% of the numbers could be of length m^1.1 and 90% of length m*(1-10%*m^0.1)/90%. The average length would be m, but the maximum length m^1.1, so the radix sort would be O(m^1.1 n) > O(N).
Lest there be any concern that I have changed the problem definition too dramatically, my goal is still to describe an algorithm with time complexity linear to the number of elements, that is O(n). But, I will also need to perform operations of linear time complexity on the length of each element, so that on average over all the elements these operations will be O(m). Those operations will be multiplication and addition needed to compute hash functions on the elements and comparison. And if indeed this solution solves the problem in O(N) = O(nm), this should be optimal complexity as it takes the same time to verify an answer.
One other detail omitted from the problem definition is whether we are allowed to destroy the data as we process it. I am going to do so for the sake of simplicity, but I think with extra care it could be avoided.
Possible Solution
First, the constraint that there may be negative numbers is an empty one. With one pass through the data, we will record the minimum element, z, and the number of elements, n. On a second pass, we will add (3-z) to each element, so the smallest element is now 3. (Note that a constant number of numbers might overflow as a result, so we should do a constant number of additional passes through the data first to test these for solutions.) Once we have our solution, we simply subtract (3-z) to return it to its original form. Now we have available three special marker values 0, 1, and 2, which are not themselves elements.
Step 1
Use the median-of-medians selection algorithm to determine the 90th percentile element, p, of the array A and partition the array into set two sets S and T where S has the 10% of n elements greater than p and T has the elements less than p. This takes O(n) steps (with steps taking O(m) on average for O(N) total) time. Elements matching p could be placed either into S or T, but for the sake of simplicity, run through array once and test p and eliminate it by replacing it with 0. Set S originally spans indexes 0..s, where s is about 10% of n, and set T spans the remaining 90% of indexes s+1..n.
Step 2
Now we are going to loop through i in 0..s and for each element e_i we are going to compute a hash function h(e_i) into s+1..n. We’ll use universal hashing to get uniform distribution. So, our hashing function will do multiplication and addition and take linear time on each element with respect to its length.
We’ll use a modified linear probing strategy for collisions:
h(e_i) is occupied by a member of T (meaning A[ h(e_i) ] < p but is not a marker 1 or 2) or is 0. This is a hash table miss. Insert e_i by swapping elements from slots i and h(e_i).
h(e_i) is occupied by a member of S (meaning A[ h(e_i) ] > p) or markers 1 or 2. This is a hash table collision. Do linear probing until either encountering a duplicate of e_i or a member of T or 0.
If a member of T, this is a again a hash table miss, so insert e_i as in (1.) by swapping to slot i.
If a duplicate of e_i, this is a hash table hit. Examine the next element. If that element is 1 or 2, we’ve seen e_i more than once already, change 1s into 2s and vice versa to track its change in parity. If the next element is not 1 or 2, then we’ve only seen e_i once before. We want to store a 2 into the next element to indicate we’ve now seen e_i an even number of times. We look for the next “empty” slot, that is one occupied by a member of T which we’ll move to slot i, or a 0, and shift the elements back up to index h(e_i)+1 down so we have room next to h(e_i) to store our parity information. Note we do not need to store e_i itself again, so we’ve used up no extra space.
So basically we have a functional hash table with 9-fold the number of slots as elements we wish to hash. Once we start getting hits, we begin storing parity information as well, so we may end up with only 4.5-fold number of slots, still a very low load factor. There are several collision strategies that could work here, but since our load factor is low, the average number of collisions should be also be low and linear probing should resolve them with suitable time complexity on average.
Step 3
Once we finished hashing elements of 0..s into s+1..n, we traverse s+1..n. If we find an element of S followed by a 2, that is our goal element and we are done. Any element e of S followed by another element of S indicates e was encountered only once and can be zeroed out. Likewise e followed by a 1 means we saw e an odd number of times, and we can zero out the e and the marker 1.
Rinse and Repeat as Desired
If we have not found our goal element, we repeat the process. Our 90th percentile partition will move the 10% of n remaining largest elements to the beginning of A and the remaining elements, including the empty 0-marker slots to the end. We continue as before with the hashing. We have to do this at most 10 times as we process 10% of n each time.
Concluding Analysis
Partitioning via the median-of-medians algorithm has time complexity of O(N), which we do 10 times, still O(N). Each hash operation takes O(1) on average since the hash table load is low and there are O(n) hash operations in total performed (about 10% of n for each of the 10 repetitions). Each of the n elements have a hash function computed for them, with time complexity linear to their length, so on average over all the elements O(m). Thus, the hashing operations in aggregate are O(mn) = O(N). So, if I have analyzed this properly, then on whole this algorithm is O(N)+O(N)=O(N). (It is also O(n) if operations of addition, multiplication, comparison, and swapping are assumed to be constant time with respect to input.)
Note that this algorithm does not utilize the special nature of the problem definition that only one element has an even number of occurrences. That we did not utilize this special nature of the problem definition leaves open the possibility that a better (more clever) algorithm exists, but it would ultimately also have to be O(N).
See the following article: Sorting algorithm that runs in time O(n) and also sorts in place,
assuming that the maximum number of digits is constant, we can sort the array in-place in O(n) time.
After that it is a matter of counting each number's appearences, which will take in average n/2 time to find one number whose number of occurrences is even.

Resources