Fastest algorithm to determine if matrix has maximum rank - algorithm

What is the fastest algorithm to determine if an N x N matrix has rank N?
In my case, I need an algorithm that is optimal for N around the order of 30. Is there any better way than computing the determinant, and check if it's not zero? I have the feeling that somehow the determinant of a matrix adds some information with respect to the rank.. and I don't even want to know the rank, I just would like to see if it's maximum.

Related

Finding the weighted median in an unsorted array in linear time

This is from the practice problem in one of coursera's Algorithms courses; I've been stuck for a couple of weeks.
The problem is this:
Given an array of n distinct unsorted elements x1, x2, ..., xn ε X with positive weights w1, w2, ..., wn ε W, a weighted median is an element xk for which the total weight of all elements with values less than xk is at most (total weight)/2 and also the total weight of elements with values larger than xk is at most (total weight)/2. Observe that there are at most two weighted. Show how to compute all weighted medians in O(n) worst time
The course mostly covered divide and conquer algorithms, so I think the key to get started on this would be to identify which of the algorithms covered can be used for this problem.
One of the algorithms covered was the RSelect algorithm in the form RSelect(array X, length n, order statistic i) which for a weighted median could be written as RSelect(array X, weights W, length n, order statistic i). My issue with this approach is that it assumes I know the median value ahead of time, which seems unlikely. There's also the issue that the pivot is chosen uniformly at random, which I don't imagine is likely to work with weights without computing every weight for every entry.
Next is the DSelect algorithms, where using a median of medians approach a pivot may be computed without randomization so we can compute a proper median. This seems like the approach that could work, where I have trouble is that it also assumes that I know ahead of time the value I'm looking for.
DSelect(array A, length n, order statistic i) for an unweighted array
DSelect(array A, weights W, length n, order statistic i) for a weighted array
Am I overthinking this? Should I use DSelect assuming that I know the value of (total weight) / 2 ahead of time? I guess even if I compute it it would add only linear time to the running time. But then it would be no different from precomputing a weighted array (combine A, W into Q where qi = xi*wi) and transforming this back to an unweighted array problem where I can use RSelect (plus some accounting for cases where there are two medians)
I've found https://archive.org/details/lineartimealgori00blei/page/n3 and https://blog.nelsonliu.me/2016/07/05/gsoc-week-6-efficient-calculation-of-weighted-medians/ which describe this problem, but their approach doesn't seem to be something covered in the course (and I'm not familiar with heaps/heapsort)
This problem can be solved with a simple variant of quickselect:
Calculate the sum of all weights and divide by 2 to get the target sum
Choose a pivot and partition the array into larger and smaller elements
Sum the weights in the smaller partition, and subtract from the total to get the sum in the other partition
go back to 2 to process the appropriate partition with the appropriate target sum
Just like normal quickselect, this becomes linear in the worst case if you use the (normal, unweighted) median-of-medians approach to choose a pivot.
This average performance can be achieved with Quickselect.
The randomly chosen pivot can be chosen - with weighting - with the Reservoir Sampling Algorithm. You are correct that it is O(n) to find the first pivot, but the size of the lists that you're working with will follow a geometric series, so the total cost of finding pivots will still work out to be only O(n).

speed-up computation of sum over several subsets

Let's say I have a huge array of doubles w[] indexed from 0 to n-1.
I also have a list of m subsets of [0;n-1]. For each subset S, I am trying to compute the sums of w[i] over S.
Obviously I can compute this separately for each subset, which is going to be in O(m * n).
However is there any faster way to do this? I'm talking from a practical standpoint, as I think you can't have a lower asymptotic bound. Is it possible to pre-process all the subsets and store them in such a way that computing all the sums is faster?
Thanks!
edit :
to give some order of magnitude, my n would be around 20 millions, and m around ~200.
For subsets that are dense (or nearly dense) you may be able to speed up the computation by computing a running sum of the elements. That is, create another array in parallel with w, where each element in the parallel array contains the sum of the elements of w up to that point.
To compute the sum for a dense subset, you that the starting and ending positions in the parallel array, and subtract the running sum at the start from the running sum at the end. The difference between the two is (ignoring rounding errors) the sum for that subset.
For a nearly dense subset, you start by doing the same, then subtract off the values of the (relatively few) items in that range that aren't part of the set.
These may not produce exactly the same result as you'd get by naively summing the subset though. If you need better accuracy, you'd probably want to use Kahan summation for your array of running sums, and possibly preserve its error residual at each point, to be taken into account when doing the subtraction.

Distribution of pairwise distances between many integers

We have M unique integers between 1 and N. In real life, N is a few millions, and M is between N/10 and N/3. I need to compute a distribution of pairwise distances between the M integers.
The brute-force complexity of the problem is M^2, but the output is just N numbers. So the natural question is whether there is a faster algorithm. Even an algorithm as fast as N * sqrt(M) should be sufficient for our purposes.
The problem appeared as a subset of the following problem. We have a large virtual square symmetric matrix, few million by few million elements. Some rows and columns of the matrix are masked out. We need to find how many masked-out elements are in each diagonal of the matrix. One can easily calculate how many masked-out bins intersect each diagonal. But often a masked-out row and column would intersect right on the diagonal, thus masking out only one bin. To avoid double-counting these, we need pairwise distribution of distances between masked-out columns.
You can do this in O(NlogN) using the Fourier transform.
The idea is that you first compute a histogram H(x) of your M integers where H(x) is the number of times the value x appears in your input (which will be either 0 or 1 if all M are distinct - but this is not essential).
Then what you want to compute is A(d), where A(d) defined as the number of pairs of integers that are exactly d apart.
This can be computed as A(d) = sum(H(x)*H(x+d) for all x)
This type of function is called a convolution and can be efficiently computed by taking the Fourier transform, multiplying the output by itself, and then computing the inverse transform. Care needs to be taken to pad appropriately, for example see this question.
If you use Python, this is particularly easy as you can call scipy.signal.fftconvolve to do this operation.

Maximum sum/area submatrix

Input: nxn matrix of postitive/negative numbers and k.
Output: submatrix with maximum sum of its elements divided by its number of elements that has at least k elements.
Is there any algorithm better than O(n^4) for this problem?
An FFT-based divide-and-conquer approach to this problem:
https://github.com/thearn/maximum-submatrix-sum
It's not as efficient as Kadane's, (O(N^3) vs. O(N^3 log N)) but does give a different take on solution construction.
There is a O(n^3) 2-d kadane algorithm for finding the maximum sum submatrix (i.e. subrectangle) in an nxn matrix. (You can find posts on SO about it, or read online). Once you understand how the algorithm works, it is clear that you can get an O(n^3) time solution for your problem if you can solve the problem of finding a maximum average subinterval of length at least m in a 1-d array of n numbers in O(n) time. This is indeed possible, see the paper cs.slu.edu/~goldwasser/publications/DensityPreprint.pdf
Thus there is an O(n^3) time solution for your problem.

Algorithm for generating a size k error-correcting code on n bits

I want to generate a code on n bits for k different inputs that I want to classify. The main requirement of this code is the error-correcting criteria: that the minimum pairwise distance between any two encodings of different inputs is maximized. I don't need it to be exact - approximate will do, and ease of use and speed of computational implementation is a priority too.
In general, n will be in the hundreds, k in the dozens.
Also, is there a reasonably tight bound on the minimum hamming distance between k different n-bit binary encodings?
The problem of finding the exact best error-correcting code for given parameters is very hard, even approximately best codes are hard. On top of that, some codes don't have any decent decoding algorithms, while for others the decoding problem is quite tricky.
However, you're asking about a particular range of parameters where n ≫ k, where if I understand correctly you want a k-dimensional code of length n. (So that k bits are encoded in n bits.) In this range, first, a random code is likely to have very good minimum distance. The only problem is that decoding is anywhere from impractical to intractible, and actually calculating the minimum distance is not that easy either.
Second, if you want an explicit code for the case n ≫ k, then you can do reasonably well with a BCH code with q=2. As the Wikipedia page explains, there is a good decoding algorithm for BCH codes.
Concerning upper bounds for the minimum Hamming distance, in the range n ≫ k you should start with the Hamming bound, also known as the volume bound or the sphere packing bound. The idea of the bound is simple and beautiful: If the minimum distance is t, then the code can correct errors up to distance floor((t-1)/2). If you can correct errors out to some radius, it means that the Hamming balls of that radius don't overlap. On the other hand, the total number of possible words is 2n, so if you divide that by the number of points in one Hamming ball (which in the binary case is a sum of binomial coefficients), you get an upper bound on the number of error-free code words. It is possible to beat this bound, but for large minimum distance it's not easy. In this regime it's a very good bound.

Resources