Compare two lists of distances (floats) in python - algorithm

I'm looking to compare two lists of distances (floats) in python. The distances represent how far away my robot is from a wall at different angles. One array is my "best guess" distance array and the other is the array of actual distances. I need to return a number between [0, 1] that represents the similarity between these two lists of floats. The distances match up 1 to 1. That is, the distance at index 0 should be compared to the distance at index 0 in the other array. Right now, for each index, I am dividing the smaller number by the larger number to get a percentage difference. Then I am taking the average of these percentage differences (total percentage difference / number of entries in the array) to get a number between 0 and 1. However, my approach does not seem to be accurate enough. Is there a better algorithm for comparing two ordered lists of floats?

It looks like you need a normalized Euclidean distance between two vectors.
It is simple to caclulate and you can read more about it here.

Related

Number of subsets whose XOR contains less than two set bits

I have an Array A(size <= 10^5) of numbers(<= 10^8), and I need to answer some queries(50000), for L, R, how many subsets for elements in the range [L, R], the XOR of the subset is a number that has 0 or 1 bit set(power of 2). Also, point modifications in the array are being done in between the queries, so can't really do some offline processing or use techniques like square root decomposition etc.
I have an approach where I use DP to calculate for a given range, something on the lines of this:
https://www.geeksforgeeks.org/count-number-of-subsets-having-a-particular-xor-value/
But this is clearly too slow. This feels like a classical segment tree problem, but can't seem to find as to what data points to store at each node, so that I can use the left child and right child to compute the answer for the given range.
Yeah, that DP won't be fast enough.
What will be fast enough is applying some linear algebra over GF(2), the Galois field with two elements. Each number can be interpreted as a bit-vector; adding/subtracting vectors is XOR; scalar multiplication isn't really relevant.
The data you need for each segment is (1) how many numbers are there in the segment (2) a basis for the subspace of numbers generated by numbers in the segment, which will consist of at most 27 numbers because all numbers are less than 2^27. The basis for a one-element segment is just that number if it's nonzero, else the empty set. To find the span of the union of two bases, use Gaussian elimination and discard the zero vectors.
Given the length of an interval and a basis for it, you can count the number of good subsets using the rank-nullity theorem. Basically, for each target number, use your Gaussian elimination routine to test whether the target number belongs to the subspace. If so, there are 2^(length of interval minus size of basis) subsets. If not, the answer is zero.

Partitioning an ordered list of weights into N sub-lists of approximately equal weight

Suppose I have an ordered list of weights, having length M. I want to divide this list into N ordered non-empty sublists, where the sum of the weights in each sublist are as close to each other as possible. Finally, the length of the list will always be greater than or equal to the number of partitions.
For example:
A reader of epoch fantasy wants to read the entire Wheel of Time series in N = 90 days. She wants to read approximately the same amount of words each day, but she doesn't want to break a single chapter across two days. Obviously, she also doesn't want to read it out of order either. The series has a total of M chapters, and she has a list of the word counts in each.
What algorithm could she use to calculate the optimum reading schedule?
In this example, the weights probably won't vary much, but the algorithm I'm seeking should be general enough to handle weights that vary widely.
As for what I consider optimum, I would say that given the choice between having two or three partitions vary in weight a small amount from the average would be better than having one partition vary a lot. Or in other words, She would rather have several days where she reads a few hundred more or fewer words than the average, if it means she can avoid having to read a thousand words more or fewer than the average, even once. My thinking is to use something like this to compute the score of any given solution:
let W_1, W_2, W_3 ... w_N be the weights of each partition (calculated by simply summing the weights of its elements).
let x be the total weight of the list, divided by its length M.
Then the score would be the sum, where I goes from 1 to N of (X - w_i)^2
So, I think I know a way to score each solution. The question is, what's the best way to minimize the score, other than brute force?
Any help or pointers in the right direction would be much appreciated!
As hinted by the first entry under "Related" on the right column of this page, you are probably looking for a "minimum raggedness word wrap" algorithm.

How to compute a distance between two byte arrays?

I need to calculate the distance between two byte arrays of the same length. In particular, I am looking for approach to obtain a distance with the following features:
if the two arrays are very similar to each other, then the distance should be very small;
otherwise, the distance should be very large.
Basically, I'm looking for a way to measure the difference between two arrays.
UPDATE: As suggested, I provide the following additional information about the content of a byte array. A sequence of bytes contains the features of an image, so an image is divided into small regions, and some color information is measured for each region (each byte encodes information relating to a single region): when a bit is set within a byte, then it means that a given feature is present within the region.
Therefore, given two sequences of bytes, I would like to compare using a suitable distance measure. I read about Bhattacharyya distance, but I do not know how to apply it in this case, so I was wondering if there were other distance measures to compare two byte arrays.
You can use the Euclidean distance for this. Basically you add the squares of the difference between each pair of elements in your arrays and extract the square root from that sum.
See http://en.wikipedia.org/wiki/Euclidean_distance
However, there are other distance metrics that could apply better to your data, for example Pearson Correlation, cosine similiarity, hamming distance, etc.
By order of complexity,
a L1 = Sum | xi - yi |
or a L2 = Sum | xi - yi |^2

Select a number not present in a list

Is there an elegant method to create a number that does not exist in a given list of floating point numbers? It would be nice if this number were not close to the existing values in the array.
For example, in the list [-1.5, 1e+38, -1e38, 1e-12] it might be nice to pick a number like 20 that's "far" away from the existing numbers as opposed to 0.0 which is not in the list, but very close to 1e-12.
The only algorithm I've been able to come up with involves creating a random number and testing to see if it is not in the array. If so, regenerate. Is there a better deterministic approach?
Here's a way to select a random number not in the list, where the probability is higher the further away from an existing point you get.
Create a probability distribution function f as follows:
f(x) = <the absolute distance to the point closest to x>
such function gives a higher probability the further away from the a given point you are. (Note that it should be normalized so that the area below the function is 1.)
Create the primitive function F of f (i.e. the accumulated area below f up to a given point).
Generate a uniformly random number, x, between 0 and 1 (that's easy! :)
Get the final result by applying the inverse of F to that value: F-1(x).
Here's a picture describing a situation with 1.5, 2.2 and 2.9 given as existing numbers:
Here's the intuition of why it works:
The higher probability you have (the higher the blue line is) the steeper the red line is.
The steeper the red line is, the more probable it is that x hits the red line at that point.
For example: At the given points, the blue lines is 0, thus the red line is horizontal. If the red line is horizontal, probability that x hits that point is zero.
(If you want the full range of doubles, you could set min / max to -Double.MAX_VALUE and Double.MAX_VALUE respectively.)
If you have the constraint, that the new value must be somewhere in between [min, max] then you could sort your values and insert the mean value of the two adjacent values with the largest absolute difference.
In your sample case [-1e38, -1.5, 1e-12, 1e+38] is the ordered list. As you calculate the absolute differences, you'll find the maximum difference for the values (1e-12, 1e+38) so you calculate the new value to be ((n[i+1] - n[i]) / 2) + n[i] (simple mean value calculation).
Update:
Additionally you could also check if the FLOAT_MAX or FLOAT_MIN values will give good candidates. Simply check their distance to min and max and if the result values are larger than the maximum difference for two adjacent values, pick them.
If there is no upper bound, just sum up the absolute value of all the numbers, or subtract them all.
Another possible solution would be to get the smallest number and the greatest number in the list, and choose something outside their bounds (maybe double the greatest number).
Or probably the best way would be to compute the average, the smalelst and the biggest number, as long as the standard deviation. Then, with all this data, you know how the numbers are structured, and can choose accordingly (all clustered around a given negative value? Chosoe a positive one. All small numbers? Choose a big one. etc.)
Something along the lines of
number := 1
multiplier := random(1000)+1
if avg>0
number:= -number
if min < 1 and max > 1
multiplier:= 1 / (random(1000)+1)
if stdDev > 1000
number := avg+random(500)-250
multiplier:= multiplier / (random(1000)+1)
(just an example from the top of my head)
Or another Possibility would be to XOR all the numbers together. Should yield a good result.

Find the "largest" dense sub matrix in a large sparse matrix

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.
Are there any known exact or aproxamate solutions to this problem?
A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?
edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.
A thought based on Chad Okere's idea
Order the rows from largest count to smallest count (not necessary but might help perf)
Select two rows that have a "large" overlap
Add all other rows that won't reduce the overlap
Record that set
Add whatever row reduces the overlap by the least
Repeat at #3 until the result gets to small
Start over at #2 with a different starting pair
Continue until you decide the result is good enough
I assume you want something like this. You have a matrix like
1100101
1110101
0100101
You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.
If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.
So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.
You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).
Then just stop when you can't find any more after a while.
That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.
I know you aren't working on this anymore, but I thought someone might have the same question as me in the future.
So, after realizing this is an NP-hard problem (by reduction to MAX-CLIQUE) I decided to come up with a heuristic that has worked well for me so far:
Given an N x M binary/boolean matrix, find a large dense submatrix:
Part I: Generate reasonable candidate submatrices
Consider each of the N rows to be a M-dimensional binary vector, v_i, where i=1 to N
Compute a distance matrix for the N vectors using the Hamming distance
Use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to cluster vectors
Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order that the vectors should be combined into submatrices. So each internal node in the hierarchical clustering tree is a candidate submatrix.
Part II: Score and rank candidate submatrices
For each submatrix, calculate D, the number of elements in the dense subset of the vectors for the submatrix by eliminating any column with one or more zeros.
Select the submatrix that maximizes D
I also had some considerations regarding the min number of rows that needed to be preserved from the initial full matrix, and I would discard any candidate submatrices that did not meet this criteria before selecting a submatrix with max D value.
Is this a Netflix problem?
MATLAB or some other sparse matrix libraries might have ways to handle it.
Is your intent to write your own?
Maybe the 1D approach for each row would help you. The algorithm might look like this:
Loop over each row
Find the index of the first non-zero element
Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
Sort the rows from largest to smallest span between non-zero columns.
At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.
You don't specify whether or not the dense matrix has to be square. I'll assume not.
I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.
EDIT. This is NOT the same as the problem below.. My bad...
But based on the last comment below, it might be equivilent to the following:
Find the furthest vertically separated pair of zero points that have no zero point between them.
Find the furthest horizontally separated pair of zero points that have no zeros between them ?
Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?
This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...
The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:
iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...
The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)
EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Resources