Efficiently counting matching inversions between multiple lists at once - algorithm

This is a question about taking multiple rankings of the same elements and computing a statistic that depends on whether each possible inversion matches between one or more lists or not.
Given
L equal-length lists of length N comprising the first N integers in various orders
an arbitrary set of L weights , one corresponding to each list
some arbitrary non-linear scalar function f
I'd like to compute in O(L N log N) time (it's trivial to do in O(L N^2) time) the following quantity:
where
represents whether the integer i appears before the integer j in list l .
Is there a known algorithm for doing this? Or is it known to be impossible in O(L N log N) time?
So far I've found a binary-indexed trees inversion counting algorithm and I'm wondering if I could get what I need by building upon this idea.

Related

number of possible arrays with certain conditions using dp

an array of increasing natural numbers between 1 and n is called beautiful if each number in the array is divisible by it's previous number. using dynamic programming, the question is to find the number of beautiful arrays with size k with the given time complexity:
O(n*k*root(n))
O(n*k*log(n))
what I could think of for the first one is that the number of divisors of a number can be found with O(root(n)) time complexity. I want to design a recursive algorithm that calculates the number of possible arrays for each i < k but I can't figure out how.
This problem can be broken into two parts:
Find the divisor DAG (nodes 1…n, arcs a → b iff a divides
b). Trial division will do this in Θ(n √n); enumerating
multiples, in Θ(n log n). The graph has Θ(n log n) arcs.
Count the number of paths of length k in a DAG. This is a basic
Θ((m + n) k)-time dynamic program. There is one path of length
0 from each node. The number of paths of length ℓ from each node is
the sum of the number of paths of length ℓ−1 from its successors.

Radix sort explanation

Based on this radix sort article http://www.geeksforgeeks.org/radix-sort/ I'm struggling to understand what is being explained in terms of the time complexity of certain methods in the sort.
From the link:
Let there be d digits in input integers. Radix Sort takes O(d*(n+b)) time where b is the base for representing numbers, for example, for decimal system, b is 10. What is the value of d? If k is the maximum possible value, then d would be O(log_b(k)). So overall time complexity is O((n+b)*logb(k)). Which looks more than the time complexity of comparison based sorting algorithms for a large k. Let us first limit k. Let k≤nc where c is a constant. In that case, the complexity becomes O(nlogb(n)).
So I do understand that the sort takes O(d*n) since there are d digits therefore d passes, and you have to process all n elements, but I lost it from there. A simple explanation would be really helpful.
Assuming we use bucket sort for the sorting on each digit: for each digit (d), we process all numbers (n), placing them in buckets for all possible values a digit may have (b).
We then need to process all the buckets, recreating the original list. Placing all items in the buckets takes O(n) time, recreating the list from all the buckets takes O(n + b) time (we have to iterate over all buckets and all elements inside them), and we do this for all digits, giving a running time of O(d * (n + b)).
This is only linear if d is a constant and b is not asymptotically larger than n. So indeed, if you have numbers of log n bits, it will take O(n log n) time.

Find triplets in an array whose sum is some integer X

Given a sorted array[1..n] where each element ranging from 1 to 2n. Is there a way to find triplet whose sum is given integer x. I know O(n^2) solution. Is there any algorithm better than n^2 ones.
It is possible to achieve an O(n log n) time complexity using that fact the maximum value of each element is O(n).
For each 1 <= y <= 4 * n, let's find the number of pairs of elements that sum up to y. We can create a polynomial of 2 * n power, where the i-th coefficient of this polynomial is the number of occurrences of the number i in the given array. Now we can find the square(I'll call it s) of this polynomial in O(n log n) time using Fourier's Fast Transform. The i-th coefficient of s is exactly the number of pairs of elements that sum up to i.
Now we can iterate over the given array. Let's assume that the current element is a. Then we just need to check the number of pairs that sum up to X - a. We have already done it in the step 1).
If all triplets must consist of different elements, we need to subtract the number of such triplets that sum up to X but contain duplicates. We can do it in O(n log n) time, too(for triplets that consist of three equal elements, we just need to subtract the number of occurrences of X / 3 in the given array. For triplets with one duplicate, we can just iterate over the element that is repeated twice(a) and subtract the number of occurrences of X - 2 * a).
If we need to find a triplet itself, not just count them, we can do the following:
Count the number of triplets and pairs as suggested above.
Find such an element that there is a pair that sums up to X with it.
Find two elements that sum up to the desired sum of this pair.
All these steps can be accomplished in linear time(using the fact that all sums are O(n)).
Your problem is apparently the non-zero sum variant of the 3SUM problem.
Because you know the possible range of the integers beforehand, you can achieve lower bounds than the general case, according to the second paragraph of the article:
When the elements are integers in the range [-N, ..., N], 3SUM can be
solved in O(n + N log N) time by representing the input set S as a
bit vector, computing the set S + S of all pairwise sums as a discrete
convolution using the Fast Fourier transform, and finally comparing
this set to -S.
In your case, you would have to pre-process the array by subtracting n + X/3 before running the algorithm.
One thing to note is that the algorithm assumes you're working with a set of numbers, and I'm not sure what (if any) implications there may be on running time if your array may include duplicates.

Is it possible to compute the minimum of a set of numbers modulo a given number in amortized sublinear time?

Is there a data structure representing a large set S of (64-bit) integers, that starts out empty and supports the following two operations:
insert(s) inserts the number s into S;
minmod(m) returns the number s in S such that s mod m is minimal.
An example:
insert(11)
insert(15)
minmod(7) -> the answer is 15 (which mod 7 = 1)
insert(14)
minmod(7) -> the answer is 14 (which mod 7 = 0)
minmod(10) -> the answer is 11 (which mod 10 = 1)
I am interested in minimizing the maximal total time spent on a sequence of n such operations. It is obviously possible to just maintain a list of elements for S and iterate through them for every minmod operation; then insert is O(1) and minmod is O(|S|), which would take O(n^2) time for n operations (e.g., n/2 insert operations followed by n/2 minmod operations would take roughly n^2/4 operations).
So: is it possible to do better than O(n^2) for a sequence of n operations? Maybe O(n sqrt(n)) or O(n log(n))? If this is possible, then I would also be interested to know if there are data structures that additionally admit removing single elements from S, or removing all numbers within an interval.
Another idea based on balanced binary search tree, as in Keith's answer.
Suppose all inserted elements so far are stored in balanced BST, and we need to compute minmod(m). Consider our set S as a union of subsets of numbers, lying in intervals [0,m-1], [m, 2m-1], [2m, 3m-1] .. etc. The answer will obviously be among the minimal numbers we have in each of that intervals. So, we can consequently lookup the tree to find the minimal numbers of that intervals. It's easy to do, for example if we need to find the minimal number in [a,b], we'll move left if current value is greater than a, and right otherwise, keeping track of the minimal value in [a,b] we've met so far.
Now if we suppose that m is uniformly distributed in [1, 2^64], let's calculate the mathematical expectation of number of queries we'll need.
For all m in [2^63, 2^64-1] we'll need 2 queries. The probability of this is 1/2.
For all m in [2^62, 2^63-1] we'll need 4 queries. The probability of this is 1/4.
...
The mathematical expectation will be sum[ 1/(2^k) * 2^k ], for k in [1,64], which is 64 queries.
So, to sum up, the average minmod(m) query complexity will be O(64*logn). In general, if we m has unknown upper bound, this will be O(logmlogn). The BST update is, as known, O(logn), so the overall complexity in case of n queries will be O(nlogm*logn).
Partial answer too big for a comment.
Suppose you implement S as a balanced binary search tree.
When you seek S.minmod(m), naively you walk the tree and the cost is O(n^2).
However, at a given time during the walk, you have the best (lowest) result so far. You can use this to avoid checking whole sub-trees when:
bestSoFar < leftChild mod m
and
rightChild - leftChild < m - leftChild mod m
This will only help much if a common spacing b/w the numbers in the set is smaller than common values of m.
Update the next morning...
Grigor has better and more fully articulated my idea and shown how it works well for "large" m. He also shows how a "random" m is typically "large", so works well.
Grigor's algorithm is so efficient for large m that one needs to think about the risk for much smaller m.
So it is clear that you need to think about the distribution of m and optimise for different cases if need be.
For example, it might be worth simply keeping track of the minimal modulus for very small m.
But suppose m ~ 2^32? Then the search algorithm (certainly as given but also otherwise) needs to check 2^32 intervals, which may amount to searching the whole set anyway.

Average number of intervals from an input in 0..N

The question sprang up when examining the "Find the K missing numbers in this set supposed to cover [0..N]" question.
The author of the question asked for CS answers instead of equation-based answers, and his proposal was to sort the input and then iterate over it to list the K missing numbers.
While this seems fine to me, it also seems wasteful. Let's take an example:
N = 200
K = 2 (we will consider K << N)
missing elements: 53, 75
The "sorted" set can be represented as: [0, 52] U [54, 74] U [76, 200], which is way more compact than enumerating all values of the set (and allows to retrieve the missing numbers in O(K) operations, to be compared with O(N) if the set is sorted).
However this is the final result, but during the construction the list of intervals might be much larger, as we feed the elements one at a time....
Let us, therefore, introduce another variable: let I be the number of elements of the set that we fed to the structure so far. Then, we may at worst have: min((N-K)/2, I) intervals (I think...)
From which we deduce that the number of intervals reached during the construction is the maximum encountered for I in [0..N], the worst case being (N-K)/2 thus O(N).
I have however a gut feeling that if the input is random, instead of being specially crafted, we might get a much lower bound... and thus the always so tricky question:
How much intervals... in average ?
Your approach vs. the proposed one with sorting seems to be a classical trade-off of which operations is cheap and which one is expensive.
I find your notation a bit confusing, so please allow me to use my own:
Let S be the set. Let n be the number of items in the set: n = |S|. Let max be the biggest number in the set: max = max(S). Let k be the number of elements not in the set: k = |{0,...,max} \ S|.
For the sorting solution, we could very cheaply insert all n elements into S using hashing. That would take expected O(n). Then for finding the k missing elements, we sort the set in O(nlogn), and then determine the missing elements in O(n).
That is, the overall cost for adding n elements and then finding the missing k elements takes expected O(n) + O(nlogn) + O(n) = O(nlogn).
You suggest a different approach in which we represent the set as a list of dense subsets of S. How would you implement such a data structure? I suggest a sorted tree (instead of a list) so that an insert becomes efficient. Because what do you have to do for an insert of a new element e? I think you have to:
Find the potential candidate subset(s) in the tree where e could be added
If a subset already contains e, nothing has to be done.
If a subset contains e+1 and another subset contains e-1, merge the subsets together and add e to the result
If a subset already contains e+1, but e-1 is not contained in S, add e to that subset
If a subset already contains e-1, but e+1 is not contained in S, add e to that subset
Otherwise, create a new subset holding only the element e and insert it into the tree.
We can expect that finding the subsets needed for the above operations takes O(logn). The operations of 4. or 5. take constant time if we represent the subsets as pairs of integers (we just have to decrement the lower or increment the upper boundary). 3. and 6. potentially require changing the tree structure, but we expect that to take at most O(logn), so the whole "insert" will not take more than O(logn).
Now with such a datastructure in place, we can easily determine the k missing numbers by traversing the tree in order and collecting the numbers not in any of the subsets. The costs are linear in the number of nodes in the tree, which is <= n/2, so the total costs are O(n) for that.
However, if we consider again the complete sequence operations, we get for n inserts O(nlogn) + O(n) for finding the k missing numbers, so the overall costs are again O(nlogn).
This is not better than the expected costs of the first algorithm.
A third solution is to use a boolean array to represent the set and a single integer max for the biggest element in the set.
If an element e is added to the Set, you set array[e] = true. You can implement the variable size of the set using table expansion, so the costs for inserting an element into the array is constant.
To retrieve the missing elements, you just collect those elements f where array[f] == false. This will take O(max).
The overall costs for inserting n elements and finding the k missing ones is thus: O(n) + O(max). However, max = n + k, and so we get as the overall costs O(n + k).
A fourth method which is a cross-over between the third one and the one using hashing is the following one, which also uses hashing, but doesn't require sorting.
Store your set S in a hash set, and also store the maximum element in S in a variable max. To find the k missing ones, first generate a result set R containing all numbers {0,...,max}. Then iterate over S and delete every element in S from R.
The costs for that are also O(n + k).

Resources