measure uniform distribution oft discrete values - algorithm

I have an an array of n integer values x[] that range from low to height. There are therefore m:=high-low+1 possible values.
I'm searching now an algorithm that calculates how uniform the input values are distributed over the inteval [low,high].
It should output e.g. 1 if the values are as uniformly as possible and 0 if all x[i] are the same.
The problem now is that the algorithm has to work with n beging much lower than and also much higher than m.
Thank you

You can compute the Kolmogorov-Smirnov statistic, which is the maximum absolute deviation of the empirical cumulative mass function from the test cmf, which in this case is a straight line (since the test pmf is a uniform distribution).
Or you can compute the discrepancy of the data.

I found a solution that works for my case:
First I calculate a cummulative histogram of the values
(a discrete function that maps every possible value v of [min,max] to |{x[i], x[i]<=v}|)
Then I compute the distance to the diagonal line through the histogram (from 0,0 to m,n) in a squared way: sum up the squared distances of every point in the histogram to that line.
This algorithm does not provide a normalized norm, but works well with very few and very many samples. I only need the algorithm to compare two or more sets of values by their uniformity and this algorithm does this for me.

Related

A sudoku problem: Efficiently find or approximate probability distribution over chosen numbers at each index of an array with no repeats

I'm looking for an efficient algorithm to generate or iteratively approximate a solution to the problem described below.
You are given an array of length N and a finite set of numbers Si for each index i of the array. Now, if we are to place a number from Si at each index i to fill the entire array, while ensuring that the number is unique across the entire array; given all the possible arrays, what is the probability ditribution over each number at each index?
Here I give an example:
Assuming we have the following array of length 3 with each column representing Si at the index of the column
4 4 4
   2  2
1  1  1
We will have the following possible arrays:
421
412
124
142
And the following probability distribution: (over 1 2 4 at each index respectively)
0.5 0.25 0.25
      0.5   0.5
0.5 0.25 0.25
Brute forcing this problem is obviously doable but I have a gut feeling that there must be some more efficient algorithms for this.
The reason why I think so is due to the fact that one can derive the probability distribution from the set of all possibilities but not the other way around, so the distribution itself must contain less information then the set of all possibilities have. Therefore, I believe that we do not need to generate all possibilites just to obtain the probability distribution.
Hence, I am wondering if there is any smart matrix operation we could use for this problem or even fixed-point iteration/density evolution to approximate the end probability distribution? Some other potentially more efficient approaches to this problem are also appreciated.
(p.s. The reason why I am interested in this problem is because I wanted to generate probability distribution over candidate numbers for the empty cells in a sudoku and other sudoku-like games without a unique answers by only applying all the standard rules)
Sudoku is a combinatorial problem. It is easy to show that the probability of any independent cell is uniform (because you can relabel a configuration to put any number at a given position). The joint probabilities are more complicated.
If the game is partially filled you have constraints that will affect this distribution.
You must devise an algorithm to calculate the number of solutions from a given initial configuration. Then you compute the fraction of the total solutions are will have a specific value at the position of interest.
counts = {}
for i in range(1, 10):
board[cell] = i;
counts[i] = countSolutions(board);
prob = {i: counts[i] / sum(counts[i] for i in range(1, 10))}
The same approach works for joint probabilities but in some cases the number of possibilities may be too high.

Compare two lists of distances (floats) in python

I'm looking to compare two lists of distances (floats) in python. The distances represent how far away my robot is from a wall at different angles. One array is my "best guess" distance array and the other is the array of actual distances. I need to return a number between [0, 1] that represents the similarity between these two lists of floats. The distances match up 1 to 1. That is, the distance at index 0 should be compared to the distance at index 0 in the other array. Right now, for each index, I am dividing the smaller number by the larger number to get a percentage difference. Then I am taking the average of these percentage differences (total percentage difference / number of entries in the array) to get a number between 0 and 1. However, my approach does not seem to be accurate enough. Is there a better algorithm for comparing two ordered lists of floats?
It looks like you need a normalized Euclidean distance between two vectors.
It is simple to caclulate and you can read more about it here.

discrete fourier transform in Matlab - theoretical confusion

I have a periodic term
v(x) = sum over K of [exp(iKx) V(K) ]
where K =2*pi*n/a where a is the periodicity of the term and n =0,1,2,3....
Now I want to find the Fourier coefficient V(K) corresponding to a particular K. Suppose I have a vector for v(x) having 10000 points for
x = 0,0.01a,0.02a,...a,1.01a,....2a....100a
such that the size of my lattice is 100a. FFT on this vector gives 10000 Fourier coefficients. The K values corresponding to these Fourier coefficients are 2*pi*n/(10000*0.01) with n=0,1,2,3,...9999.
But my K had the form 2*pi*n/a due to the periodicity of the lattice. What am I missing ?
Your function probably is not complex, so you will need negative frequencies in the complex Fourier series expression. During the FFT this does not matter since the negative frequencies are aliased to the higher positive frequencies, but in the expression as continuous function this could give strange results.
That means that the range of n is from -N/2 to N/2-1 if N is the size of the sampling.
Note that the points you have given are 10001 in number if you start at 0a with 0.01a steps and end at 100a. So the last point for N=10000 points should be 100a-0.01a=99.99a.
Your sampling frequency is the reciprocal of the sampling step, Fs=1/(0.01a). The frequencies of the FFT are then 2*pi*n/N*Fs=2*pi*n/(10000*0.01a)=2*pi*n/(100*a), every 100th of them corresponds to one of your K.
This is not astonishing since the sampling is over 100 periods of the function, the longer period results in a much lower basic frequency. If the signal v(x) is truly periodic, all amplitudes except the ones for n divisible by 100 will be zero. If the signal is not exactly periodic due to noise and measurement errors, the peaks will leak out into neighboring frequencies. For a correct result for the original task you will have to integrate the amplitudes over the peaks.

generating sorted random numbers without exponentiation involved?

I am looking for a math equation or algorithm which can generate uniform random numbers in ascending order in the range [0,1] without the help of division operator. i am keen in skipping the division operation because i am implementing it in hardware. Thank you.
Generating the numbers in ascending (or descending) order means generating them sequentially but with the right distribution. That, in turn, means we need to know the distribution of the minimum of a set of size N, and then at each stage we need to use conditioning to determine the next value based on what we've already seen. Mathematically these are both straightforward except for the issue of avoiding division.
You can generate the minimum of N uniform(0,1)'s from a single uniform(0,1) random number U using the algorithm min = 1 - U**(1/N), where ** denotes exponentiation. In other words, the complement of the Nth root of a uniform has the same distribution as the minimum of N uniforms over the range [0,1], which can then be scaled to any other interval length you like.
The conditioning aspect basically says that the k values already generated will have eaten up some portion of the original interval, and that what we now want is the minimum of N-k values, scaled to the remaining range.
Combining the two pieces yields the following logic. Generate the smallest of the N uniforms, scale it by the remaining interval length (1 the first time), and make that result the last value we have generated. Then generate the smallest of N-1 uniforms, scale it by the remaining interval length, and add it to the last one to give you your next value. Lather, rinse, repeat, until you have done them all. The following Ruby implementation gives distributionally correct results, assuming you have read in or specified N prior to this:
last_u = 0.0
N.downto(1) do |i|
p last_u += (1.0 - last_u) * (1.0 - (rand ** (1.0/i)))
end
but we have that pesky ith root which uses division. However, if we know N ahead of time, we can pre-calculate the inverses of the integers from 1 to N offline and table them.
last_u = 0.0
N.downto(1) do |i|
p last_u += (1.0 - last_u) * (1.0 - (rand ** inverse[i]))
end
I don't know of any way get the correct distributional behavior sequentially without using exponentiation. If that's a show-stopper, you're going to have to give up on either the sequential nature of the process or the uniformity requirement.
You can try so-called "stratified sampling", which means you divide the range into bins and then sample randomly from bins. A sample thus generated is more uniform (less clumping) than a sample generated from the entire interval. For this reason, stratified sampling reduces the variance of Monte Carlo estimates (I don't suppose that's important to you, but that's why the method was invented, as a reduction of variance method).
It is an interesting problem to generate numbers in order, but my guess is that to get a uniform distribution over the entire interval, you will have to apply some formulas which require more computation. If you want to minimize computation time, I suspect you cannot do better than generating a sample and then sorting it.

Does adding random numbers make them more random?

This is a purely theoretical question.
We all know that most, if not all, random-number generators actually only generate pseudo-random numbers.
Let's say I want a random number from 10 to 20. I can do this as follows (myRandomNumber being an integer-type variable):
myRandomNumber = rand(10, 20);
However, if I execute this statement:
myRandomNumber = rand(5, 10) + rand(5, 10);
Is this method more random?
No.
The randomness is not cumulative. The rand() function uses a uniform distribution between your two defined endpoints.
Adding two uniformly distributions invalidates the uniform distribution. It will make a strange looking pyramid, with the most probability tending toward the center. This is because of accumulation of the probability density function with increasing degrees of freedom.
I urge you to read this:
Uniform Distribution
and this:
Convolution
Pay special attention to what happens with the two uniform distributions on the top right of the screen.
You can prove this to yourself by writing to a file all the sums and then plotting in excel. Make sure you give yourself a large enough sample size. 25000 should be sufficient.
The best way to understand this is by considering the popular fair ground game "Lucky Seven".
If we roll a six sided die, we know that the probability of obtaining any of the six numbers is the same - 1/6.
What if we roll two dice and add the numbers that appear on the two ?
The sum can range from 2 ( both dice show 'one') uptil 12 (both dice show 'six')
The probabilities of obtaining different numbers from 2 to 12 are no longer uniform. The probability of obtaining a 'seven' is the highest. There can be a 1+6, a 6+1, a 2+5, a 5+2, a 3+4 and a 4+3. Six ways of obtaining a 'seven' out of 36 possibilities.
If we plot the distribution we get a pyramid. The probabilities would be 1,2,3,4,5,6,5,4,3,2,1 (of course each of these has to be divided by 36).
The pyramidal figure (and the probability distribution) of the sum can be obtained by 'convolution.
If we know the 'expected value' and standard deviation ('sigma') for the two random numbers, we can perform a quick a ready calculation of the expected value of the sum of the two random numbers.
The expected value is simply the addition of the two individual expected values.
The sigma is obtained by applying the "pythagoras theorem" on the two individual sigmas (square root of the sum of the square of each sigma).

Resources