Random numbers in the shape of a bell curve [duplicate]

Random numbers in the shape of a bell curve [duplicate] - algorithm

This question already has answers here:
Generate random numbers following a normal distribution in C/C++
(18 answers)
Normal Distributed Random Number in VB.NET
(2 answers)
Closed 7 years ago.
I need to manipulate random to produce a statistical average with the shape of a bell curve, here is my attempt:
Function slopedrand(max As Single, bias As Single) As Integer
Dim count As Single = 0
While count < bias
max = rand.NextDouble() * max
count += rand.NextDouble()
End While
Return rand.NextDouble() * max
End Function
Function bulgedrand(min As Single, max As Single, bulge As Single, bias As Single)
Dim negative = bulge - min
Dim positive = max - bulge
Dim nr = min + slopedrand(negative, bias)
Dim pr = max - slopedrand(positive, bias)
Return rand.NextDouble() * (pr - nr) + nr
End Function
however, given that I suck at math, all it produces is stuff like this: http://i.imgur.com/JDAW6kM.png
which is more like a bulge...
Since it feels like my skull is boiling, maybe somebody here has figured out how to accomplish what I need and will spare me from further attempts of thinking?
(the example is given in vb.net since it was quick to prototype on but any language is welcome)

The "established" way of doing this is to draw a uniformly distributed random number in the range [0, 1), then apply the quantile function of the Bell curve distribution. This then gives you a random number that has the same distribution as the Bell curve.

A bell curve can be approximated by a normal distribution, so I guess you can use random values inside the formula for a normal distribution and get the effect you need, such as in this answer:
drawing random number from a standard normal distribution using the standard rand() functions
Maybe it can help you?

Related

Weighted random number (without predefined values!)

currently I'm needing a function which gives a weighted, random number.
It should chose a random number between two doubles/integers (for example 4 and 8) while the value in the middle (6) will occur on average, about twice as often than the limiter values 4 and 8.
If this were only about integers, I could predefine the values with variables and custom probabilities, but I need the function to give a double with at least 2 digits (meaning thousands of different numbers)!
The environment I use, is the "Game Maker" which provides all sorts of basic random-generators, but not weighted ones.
Could anyone possibly lead my in the right direction how to achieve this?
Thanks in advance!

The sum of two independent continuous uniform(0,1)'s, U1 and U2, has a continuous symmetrical triangle distribution between 0 and 2. The distribution has its peak at 1 and tapers to zero at either end. We can easily translate that to a range of (4,8) via scaling by 2 and adding 4, i.e., 4 + 2*(U1 + U2).
However, you don't want a height of zero at the endpoints, you want half the peak's height. In other words, you want a triangle sitting on a rectangular base (i.e., uniform), with height h at the endpoints and height 2h in the middle. That makes life easy, because the triangle must have a peak of height h above the rectangular base, and a triangle with height h has half the area of a rectangle with the same base and height h. It follows that 2/3 of your probability is in the base, 1/3 is in the triangle.
Combining the elements above leads to the following pseudocode algorithm. If rnd() is a function call that returns continuous uniform(0,1) random numbers:
define makeValue()
if rnd() <= 2/3 # Caution, may want to use 2.0/3.0 for many languages
return 4 + (4 * rnd())
else
return 4 + (2 * (rnd() + rnd()))
I cranked out a million values using that and plotted a histogram:

For the case someone needs this in Game Maker (or a different language ) as an universal function:
if random(1) <= argument0
return argument1 + ((argument2-argument1) * random(1))
else
return argument1 + (((argument2-argument1)/2) * (random(1) + random(1)))
Called as follows (similar to the standard random_range function):
val = weight_random_range(FACTOR, FROM, TO)
"FACTOR" determines how much of the whole probability figure is the "base" for constant probability. E.g. 2/3 for the figure above.
0 will provide a perfect triangle and 1 a rectangle (no weightning).

Intersection of axis-aligned rectangular cuboids (MBR) in one dimension

Currently I'm doing benchmarks on time series indexing algorithms. Since most of the time no reference implementations are available, I have to write my own implementations (all in Java). At the moment I am stuck a little at section 6.2 of a paper called Indexing multi-dimensional time-series with support for multiple distance measures available here in PDF : http://hadjieleftheriou.com/papers/vldbj04-2.pdf
A MBR (minimum bounding rectangle) is basically a rectanglular cubiod with some coordinates and directions. As an example P and Q are two MBRs with P.coord={0,0,0} and P.dir={1,1,3} and Q.coords={0.5,0.5,1} and Q.dir={1,1,1} where the first entries represent the time dimension.
Now I would like to calculate the MINDIST(Q,P) between Q and P :
However I am not sure how to implement the "intersection of two MBRs in the time dimension" (Dim 1) since I am not sure what the intersection in the time dimension actually means. It is also not clear what h_Q, l_Q, l_P, h_P mean, since this notation is not explained (my guess is they mean something like highest or lowest value of a dimension in the intersection).
I would highly appreciate it, if someone could explain to me how to calculate the intersection of two MBRs in the first dimension and maybe enlighten me with an interpretation of the notation. Thanks!

Well, Figure 14 in your paper explains the time intersection. And the rectangles are axis-aligned, thus it makes sense to use high and low on each coordinate.
The multiplication sign you see is not a cross product, just a normal multiplication, because on both sides of it you have a scalar, and not vectors.
However I must agree that the discussions on page 14 are rather fuzzy, but they seem to tell us that both types of intersections (complete and partial), when they are have a t subscript, mean the norm of the intersection along the t coordinate.
Thus it seems you could factorize the time intersection to get a formula that would be :
It is worth noting that, maybe counter-intuitively, when your objects don't intersect on the time plane, their MINDIST is defined to be 0.
Hence the following pseudo-code ;
mindist(P, Q)
{
if( Q.coord[0] + Q.dir[0] < P.coord[0] ||
Q.coord[0] > P.coord[0] + P.dir[0] )
return 0;
time = min(Q.coord[0] + Q.dir[0], P.coord[0] + P.dir[0]) - max(Q.coord[0], P.coord[0]);
sum = 0;
for(d=1; d<D; ++d)
{
if( Q.coord[d] + Q.dir[d] < P.coord[d] )
x = Q.coord[d] + Q.dir[d] - P.coord[d];
else if( P.coord[d] + P.dir[d] < Q.coord[d] )
x = P.coord[d] + P.dir[d] - Q.coord[d];
else
x = 0;
sum += x*x;
}
return sqrt(time * sum);
}
Note the absolute values in the paper are unnecessary since we just checked which values where bigger, and we thus know we only add positive numbers.

Multiliteration implementation with inaccurate distance data

I am trying to create an android smartphone application which uses Apples iBeacon technology to determine the current indoor location of itself. I already managed to get all available beacons and calculate the distance to them via the rssi signal.
Currently I face the problem, that I am not able to find any library or implementation of an algorithm, which calculates the estimated location in 2D by using 3 (or more) distances of fixed points with the condition, that these distances are not accurate (which means, that the three "trilateration-circles" do not intersect in one point).
I would be deeply grateful if anybody can post me a link or an implementation of that in any common programming language (Java, C++, Python, PHP, Javascript or whatever). I already read a lot on stackoverflow about that topic, but could not find any answer I were able to convert in code (only some mathematical approaches with matrices and inverting them, calculating with vectors or stuff like that).
EDIT
I thought about an own approach, which works quite well for me, but is not that efficient and scientific. I iterate over every meter (or like in my example 0.1 meter) of the location grid and calculate the possibility of that location to be the actual position of the handset by comparing the distance of that location to all beacons and the distance I calculate with the received rssi signal.
Code example:
public Location trilaterate(ArrayList<Beacon> beacons, double maxX, double maxY)
{
for (double x = 0; x <= maxX; x += .1)
{
for (double y = 0; y <= maxY; y += .1)
{
double currentLocationProbability = 0;
for (Beacon beacon : beacons)
{
// distance difference between calculated distance to beacon transmitter
// (rssi-calculated distance) and current location:
// |sqrt(dX^2 + dY^2) - distanceToTransmitter|
double distanceDifference = Math
.abs(Math.sqrt(Math.pow(beacon.getLocation().x - x, 2)
+ Math.pow(beacon.getLocation().y - y, 2))
- beacon.getCurrentDistanceToTransmitter());
// weight the distance difference with the beacon calculated rssi-distance. The
// smaller the calculated rssi-distance is, the more the distance difference
// will be weighted (it is assumed, that nearer beacons measure the distance
// more accurate)
distanceDifference /= Math.pow(beacon.getCurrentDistanceToTransmitter(), 0.9);
// sum up all weighted distance differences for every beacon in
// "currentLocationProbability"
currentLocationProbability += distanceDifference;
}
addToLocationMap(currentLocationProbability, x, y);
// the previous line is my approach, I create a Set of Locations with the 5 most probable locations in it to estimate the accuracy of the measurement afterwards. If that is not necessary, a simple variable assignment for the most probable location would do the job also
}
}
Location bestLocation = getLocationSet().first().location;
bestLocation.accuracy = calculateLocationAccuracy();
Log.w("TRILATERATION", "Location " + bestLocation + " best with accuracy "
+ bestLocation.accuracy);
return bestLocation;
}
Of course, the downside of that is, that I have on a 300m² floor 30.000 locations I had to iterate over and measure the distance to every single beacon I got a signal from (if that would be 5, I do 150.000 calculations only for determine a single location). That's a lot - so I will let the question open and hope for some further solutions or a good improvement of this existing solution in order to make it more efficient.
Of course it has not to be a Trilateration approach, like the original title of this question was, it is also good to have an algorithm which includes more than three beacons for the location determination (Multilateration).

If the current approach is fine except for being too slow, then you could speed it up by recursively subdividing the plane. This works sort of like finding nearest neighbors in a kd-tree. Suppose that we are given an axis-aligned box and wish to find the approximate best solution in the box. If the box is small enough, then return the center.
Otherwise, divide the box in half, either by x or by y depending on which side is longer. For both halves, compute a bound on the solution quality as follows. Since the objective function is additive, sum lower bounds for each beacon. The lower bound for a beacon is the distance of the circle to the box, times the scaling factor. Recursively find the best solution in the child with the lower lower bound. Examine the other child only if the best solution in the first child is worse than the other child's lower bound.
Most of the implementation work here is the box-to-circle distance computation. Since the box is axis-aligned, we can use interval arithmetic to determine the precise range of distances from box points to the circle center.
P.S.: Math.hypot is a nice function for computing 2D Euclidean distances.

Instead of taking confidence levels of individual beacons into account, I would instead try to assign an overall confidence level for your result after you make the best guess you can with the available data. I don't think the only available metric (perceived power) is a good indication of accuracy. With poor geometry or a misbehaving beacon, you could be trusting poor data highly. It might make better sense to come up with an overall confidence level based on how well the perceived distance to the beacons line up with the calculated point assuming you trust all beacons equally.
I wrote some Python below that comes up with a best guess based on the provided data in the 3-beacon case by calculating the two points of intersection of circles for the first two beacons and then choosing the point that best matches the third. It's meant to get started on the problem and is not a final solution. If beacons don't intersect, it slightly increases the radius of each up until they do meet or a threshold is met. Likewise, it makes sure the third beacon agrees within a settable threshold. For n-beacons, I would pick 3 or 4 of the strongest signals and use those. There are tons of optimizations that could be done and I think this is a trial-by-fire problem due to the unwieldy nature of beaconing.
import math
beacons = [[0.0,0.0,7.0],[0.0,10.0,7.0],[10.0,5.0,16.0]] # x, y, radius
def point_dist(x1,y1,x2,y2):
x = x2-x1
y = y2-y1
return math.sqrt((x*x)+(y*y))
# determines two points of intersection for two circles [x,y,radius]
# returns None if the circles do not intersect
def circle_intersection(beacon1,beacon2):
r1 = beacon1[2]
r2 = beacon2[2]
dist = point_dist(beacon1[0],beacon1[1],beacon2[0],beacon2[1])
heron_root = (dist+r1+r2)*(-dist+r1+r2)*(dist-r1+r2)*(dist+r1-r2)
if ( heron_root > 0 ):
heron = 0.25*math.sqrt(heron_root)
xbase = (0.5)*(beacon1[0]+beacon2[0]) + (0.5)*(beacon2[0]-beacon1[0])*(r1*r1-r2*r2)/(dist*dist)
xdiff = 2*(beacon2[1]-beacon1[1])*heron/(dist*dist)
ybase = (0.5)*(beacon1[1]+beacon2[1]) + (0.5)*(beacon2[1]-beacon1[1])*(r1*r1-r2*r2)/(dist*dist)
ydiff = 2*(beacon2[0]-beacon1[0])*heron/(dist*dist)
return (xbase+xdiff,ybase-ydiff),(xbase-xdiff,ybase+ydiff)
else:
# no intersection, need to pseudo-increase beacon power and try again
return None
# find the two points of intersection between beacon0 and beacon1
# will use beacon2 to determine the better of the two points
failing = True
power_increases = 0
while failing and power_increases < 10:
res = circle_intersection(beacons[0],beacons[1])
if ( res ):
intersection = res
else:
beacons[0][2] *= 1.001
beacons[1][2] *= 1.001
power_increases += 1
continue
failing = False
# make sure the best fit is within x% (10% of the total distance from the 3rd beacon in this case)
# otherwise the results are too far off
THRESHOLD = 0.1
if failing:
print 'Bad Beacon Data (Beacon0 & Beacon1 don\'t intersection after many "power increases")'
else:
# finding best point between beacon1 and beacon2
dist1 = point_dist(beacons[2][0],beacons[2][1],intersection[0][0],intersection[0][1])
dist2 = point_dist(beacons[2][0],beacons[2][1],intersection[1][0],intersection[1][1])
if ( math.fabs(dist1-beacons[2][2]) < math.fabs(dist2-beacons[2][2]) ):
best_point = intersection[0]
best_dist = dist1
else:
best_point = intersection[1]
best_dist = dist2
best_dist_diff = math.fabs(best_dist-beacons[2][2])
if best_dist_diff < THRESHOLD*best_dist:
print best_point
else:
print 'Bad Beacon Data (Beacon2 distance to best point not within threshold)'
If you want to trust closer beacons more, you may want to calculate the intersection points between the two closest beacons and then use the farther beacon to tie-break. Keep in mind that almost anything you do with "confidence levels" for the individual measurements will be a hack at best. Since you will always be working with very bad data, you will defintiely need to loosen up the power_increases limit and threshold percentage.

You have 3 points : A(xA,yA,zA), B(xB,yB,zB) and C(xC,yC,zC), which respectively are approximately at dA, dB and dC from you goal point G(xG,yG,zG).
Let's say cA, cB and cC are the confidence rate ( 0 < cX <= 1 ) of each point.
Basically, you might take something really close to 1, like {0.95,0.97,0.99}.
If you don't know, try different coefficient depending of distance avg. If distance is really big, you're likely to be not very confident about it.
Here is the way i'll do it :
var sum = (cA*dA) + (cB*dB) + (cC*dC);
dA = cA*dA/sum;
dB = cB*dB/sum;
dC = cC*dC/sum;
xG = (xA*dA) + (xB*dB) + (xC*dC);
yG = (yA*dA) + (yB*dB) + (yC*dC);
xG = (zA*dA) + (zB*dB) + (zC*dC);
Basic, and not really smart but will do the job for some simple tasks.
EDIT
You can take any confidence coef you want in [0,inf[, but IMHO, restraining at [0,1] is a good idea to keep a realistic result.

random unit vector in multi-dimensional space

I'm working on a data mining algorithm where i want to pick a random direction from a particular point in the feature space.
If I pick a random number for each of the n dimensions from [-1,1] and then normalize the vector to a length of 1 will I get an even distribution across all possible directions?
I'm speaking only theoretically here since computer generated random numbers are not actually random.

One simple trick is to select each dimension from a gaussian distribution, then normalize:
from random import gauss
def make_rand_vector(dims):
vec = [gauss(0, 1) for i in range(dims)]
mag = sum(x**2 for x in vec) ** .5
return [x/mag for x in vec]
For example, if you want a 7-dimensional random vector, select 7 random values (from a Gaussian distribution with mean 0 and standard deviation 1). Then, compute the magnitude of the resulting vector using the Pythagorean formula (square each value, add the squares, and take the square root of the result). Finally, divide each value by the magnitude to obtain a normalized random vector.
If your number of dimensions is large then this has the strong benefit of always working immediately, while generating random vectors until you find one which happens to have magnitude less than one will cause your computer to simply hang at more than a dozen dimensions or so, because the probability of any of them qualifying becomes vanishingly small.

You will not get a uniformly distributed ensemble of angles with the algorithm you described. The angles will be biased toward the corners of your n-dimensional hypercube.
This can be fixed by eliminating any points with distance greater than 1 from the origin. Then you're dealing with a spherical rather than a cubical (n-dimensional) volume, and your set of angles should then be uniformly distributed over the sample space.
Pseudocode:
Let n be the number of dimensions, K the desired number of vectors:
vec_count=0
while vec_count < K
generate n uniformly distributed values a[0..n-1] over [-1, 1]
r_squared = sum over i=0,n-1 of a[i]^2
if 0 < r_squared <= 1.0
b[i] = a[i]/sqrt(r_squared) ; normalize to length of 1
add vector b[0..n-1] to output list
vec_count = vec_count + 1
else
reject this sample
end while

There is a boost implementation of the algorithm that samples from normal distributions: random::uniform_on_sphere

I had the exact same question when also developing a ML algorithm.
I got to the same conclusion as Jim Lewis after drawing samples for the 2-d case and plotting the resulting distribution of the angle.
Furthermore, if you try to derive the density distribution for the direction in 2d when you draw at random from [-1,1] for the x- and y-axis ,you will see that:
f_X(x) = 1/(4*cos²(x)) if 0 < x < 45⁰
and
f_X(x) = 1/(4*sin²(x)) if x > 45⁰
where x is the angle, and f_X is the probability density distribution.
I have written about this here:
https://aerodatablog.wordpress.com/2018/01/14/random-hyperplanes/

#define SCL1 (M_SQRT2/2)
#define SCL2 (M_SQRT2*2)
// unitrand in [-1,1].
double u = SCL1 * unitrand();
double v = SCL1 * unitrand();
double w = SCL2 * sqrt(1.0 - u*u - v*v);
double x = w * u;
double y = w * v;
double z = 1.0 - 2.0 * (u*u + v*v);

How to compute frequency of data using FFT?

I want to know the frequency of data. I had a little bit idea that it can be done using FFT, but I am not sure how to do it. Once I passed the entire data to FFT, then it is giving me 2 peaks, but how can I get the frequency?
Thanks a lot in advance.

Here's what you're probably looking for:
When you talk about computing the frequency of a signal, you probably aren't so interested in the component sine waves. This is what the FFT gives you. For example, if you sum sin(2*pi*10x)+sin(2*pi*15x)+sin(2*pi*20x)+sin(2*pi*25x), you probably want to detect the "frequency" as 5 (take a look at the graph of this function). However, the FFT of this signal will detect the magnitude of 0 for the frequency 5.
What you are probably more interested in is the periodicity of the signal. That is, the interval at which the signal becomes most like itself. So most likely what you want is the autocorrelation. Look it up. This will essentially give you a measure of how self-similar the signal is to itself after being shifted over by a certain amount. So if you find a peak in the autocorrelation, that would indicate that the signal matches up well with itself when shifted over that amount. There's a lot of cool math behind it, look it up if you are interested, but if you just want it to work, just do this:
Window the signal, using a smooth window (a cosine will do. The window should be at least twice as large as the largest period you want to detect. 3 times as large will give better results). (see http://zone.ni.com/devzone/cda/tut/p/id/4844 if you are confused).
Take the FFT (however, make sure the FFT size is twice as big as the window, with the second half being padded with zeroes. If the FFT size is only the size of the window, you will effectively be taking the circular autocorrelation, which is not what you want. see https://en.wikipedia.org/wiki/Discrete_Fourier_transform#Circular_convolution_theorem_and_cross-correlation_theorem )
Replace all coefficients of the FFT with their square value (real^2+imag^2). This is effectively taking the autocorrelation.
Take the iFFT
Find the largest peak in the iFFT. This is the strongest periodicity of the waveform. You can actually be a little more clever in which peak you pick, but for most purposes this should be enough. To find the frequency, you just take f=1/T.

Suppose x[n] = cos(2*pi*f0*n/fs) where f0 is the frequency of your sinusoid in Hertz, n=0:N-1, and fs is the sampling rate of x in samples per second.
Let X = fft(x). Both x and X have length N. Suppose X has two peaks at n0 and N-n0.
Then the sinusoid frequency is f0 = fs*n0/N Hertz.
Example: fs = 8000 samples per second, N = 16000 samples. Therefore, x lasts two seconds long.
Suppose X = fft(x) has peaks at 2000 and 14000 (=16000-2000). Therefore, f0 = 8000*2000/16000 = 1000 Hz.

If you have a signal with one frequency (for instance:
y = sin(2 pi f t)
With:
y time signal
f the central frequency
t time
Then you'll get two peaks, one at a frequency corresponding to f, and one at a frequency corresponding to -f.
So, to get to a frequency, can discard the negative frequency part. It is located after the positive frequency part. Furthermore, the first element in the array is a dc-offset, so the frequency is 0. (Beware that this offset is usually much more than 0, so the other frequency components might get dwarved by it.)
In code: (I've written it in python, but it should be equally simple in c#):
import numpy as np
from pylab import *
x = np.random.rand(100) # create 100 random numbers of which we want the fourier transform
x = x - mean(x) # make sure the average is zero, so we don't get a huge DC offset.
dt = 0.1 #[s] 1/the sampling rate
fftx = np.fft.fft(x) # the frequency transformed part
# now discard anything that we do not need..
fftx = fftx[range(int(len(fftx)/2))]
# now create the frequency axis: it runs from 0 to the sampling rate /2
freq_fftx = np.linspace(0,2/dt,len(fftx))
# and plot a power spectrum
plot(freq_fftx,abs(fftx)**2)
show()
Now the frequency is located at the largest peak.

If you are looking at the magnitude results from an FFT of the type most common used, then a strong sinusoidal frequency component of real data will show up in two places, once in the bottom half, plus its complex conjugate mirror image in the top half. Those two peaks both represent the same spectral peak and same frequency (for strictly real data). If the FFT result bin numbers start at 0 (zero), then the frequency of the sinusoidal component represented by the bin in the bottom half of the FFT result is most likely.
Frequency_of_Peak = Data_Sample_Rate * Bin_number_of_Peak / Length_of_FFT ;
Make sure to work out your proper units within the above equation (to get units of cycles per second, per fortnight, per kiloparsec, etc.)
Note that unless the wavelength of the data is an exact integer submultiple of the FFT length, the actual peak will be between bins, thus distributing energy among multiple nearby FFT result bins. So you may have to interpolate to better estimate the frequency peak. Common interpolation methods to find a more precise frequency estimate are 3-point parabolic and Sinc convolution (which is nearly the same as using a zero-padded longer FFT).

Assuming you use a discrete Fourier transform to look at frequencies, then you have to be careful about how to interpret the normalized frequencies back into physical ones (i.e. Hz).
According to the FFTW tutorial on how to calculate the power spectrum of a signal:
#include <rfftw.h>
...
{
fftw_real in[N], out[N], power_spectrum[N/2+1];
rfftw_plan p;
int k;
...
p = rfftw_create_plan(N, FFTW_REAL_TO_COMPLEX, FFTW_ESTIMATE);
...
rfftw_one(p, in, out);
power_spectrum[0] = out[0]*out[0]; /* DC component */
for (k = 1; k < (N+1)/2; ++k) /* (k < N/2 rounded up) */
power_spectrum[k] = out[k]*out[k] + out[N-k]*out[N-k];
if (N % 2 == 0) /* N is even */
power_spectrum[N/2] = out[N/2]*out[N/2]; /* Nyquist freq. */
...
rfftw_destroy_plan(p);
}
Note it handles data lengths that are not even. Note particularly if the data length is given, FFTW will give you a "bin" corresponding to the Nyquist frequency (sample rate divided by 2). Otherwise, you don't get it (i.e. the last bin is just below Nyquist).
A MATLAB example is similar, but they are choosing the length of 1000 (an even number) for the example:
N = length(x);
xdft = fft(x);
xdft = xdft(1:N/2+1);
psdx = (1/(Fs*N)).*abs(xdft).^2;
psdx(2:end-1) = 2*psdx(2:end-1);
freq = 0:Fs/length(x):Fs/2;
In general, it can be implementation (of the DFT) dependent. You should create a test pure sine wave at a known frequency and then make sure the calculation gives the same number.

Frequency = speed/wavelength.
Wavelength is the distance between the two peaks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio