2D coordinate normalization - algorithm

I need to implement a function which normalizes coordinates. I define normalize as (please suggest a better term if Im wrong):
Mapping entries of a data set from their natural range to values between 0 and 1.
Now this was easy in one dimension:
static List<float> Normalize(float[] nums)
{
float max = Max(nums);
float min = Min(nums);
float delta = max - min;
List<float> li = new List<float>();
foreach (float i in nums)
{
li.Add((i - min) / delta);
}
return li;
}
I need a 2D version as well and that one has to keep the aspect ratio intact. But Im having some troubles figuring out the math.
Although the code posted is in C# the answers need not to be.
Thanks in advance. :)

I am posting my response as an answer because I do not have enough points to make a comment.
My interpretation of the question: How do we normalize the coordinates of a set of points in 2 dimensional space?
A normalization operation involves a "shift and scale" operation. In case of 1 dimensional space this is fairly easy and intuitive (as pointed out by #Mizipzor).
normalizedX=(originalX-minX)/(maxX-minX)
In this case we are first shifing the value by a distance of minX and then scaling it by the range which is given by (maxX-minX). The shift operation ensures that the minimum moves to 0 and the scale operation squashes the distribution such that the distribution has an upper limit of 1
In case of 2d , simply dividing by the largest dimension is not enought. Why?
Consider the simplified case with just 2 points as shown below.
The maximum value of any dimension is the Y value of point B and this 10000.
Coordinates of normalized A=>5000/10000,8000/10000 ,i.e 0.5,0.8
Coordinates of normalized A=>7000/10000,10000/10000 ,i.e 0.7,1.0
The X and Y values are all with 0 and 1. However, the distribution of the normalized values is far from uniform. The minimum value is just 0.5. Ideally this should be closer to 0.
Preferred approach for normalizing 2d coordinates
To get a more even distribution we should do a "shift" operation around the minimum of all X values and minimum of all Y values. This could be done around the mean of X and mean of Y as well. Considering the above example,
the minimum of all X is 5000
the minimum of all Y is 8000
Step 1 - Shift operation
A=>(5000-5000,8000-8000), i.e (0,0)
B=>(7000-5000,10000-8000), i.e. (2000,2000)
Step 2 - Scale operation
To scale down the values we need some maximum. We could use the diagonal AB whose length is 2000
A=>(0/2000,0/2000), i.e. (0,0)
B=>(2000/2000,2000/2000)i.e. (1,1)
What happens when there are more than 2 points?
The approach remains similar. We find the coordinates of the smallest bounding box which fits all the points.
We find the minimum value of X (MinX) and minimum value of Y (MinY) from all the points and do a shift operation. This changes the origin to the lower left corner of the bounding box.
We find the maximum value of X (MaxX) and maximum value of Y (MaxY) from all the points.
We calculate the length of the diagonal connecting (MinX,MinY) and (MaxX,MaxY) and use this value to do a scale operation.
.
length of diagonal=sqrt((maxX-minX)*(maxX-minX) + (maxY-minY)*(maxY-minY))
normalized X = (originalX - minX)/(length of diagonal)
normalized Y = (originalY - minY)/(length of diagonal)
How does this logic change if we have more than 2 dimensions?
The concept remains the same.
- We find the minimum value in each of the dimensions (X,Y,Z)
- We find the maximum value in each of the dimensions (X,Y,Z)
- Compute the length of the diagonal as a scaling factor
- Use the minimum values to shift the origin.
length of diagonal=sqrt((maxX-minX)*(maxX-minX)+(maxY-minY)*(maxY-minY)+(maxZ-minZ)*(maxZ-minZ))
normalized X = (originalX - minX)/(length of diagonal)
normalized Y = (originalY - minY)/(length of diagonal)
normalized Z = (originalZ - minZ)/(length of diagonal)

It seems you want each vector (1D, 2D or ND) to have length <= 1.
If that's the only requirement, you can just divide each vector by the length of the longest one.
double max = maximum (|vector| for each vector in 'data');
foreach (Vector v : data) {
li.add(v / max);
}
That will make the longest vector in result list to have length 1.
But this won't be equivalent of your current code for 1-dimensional case, as you can't find minimum or maximum in a set of points on the plane. Thus, no delta.

Simple idea: Find out which dimension is bigger and normalize in this dimension. The second dimension can be computed by using the ratio. This way the ratio is kept and your values are between 0 and 1.

Related

Sorting the following coordinates in the given pattern:

I have the following image:
The coordinates corresponding to the white blobs in the image are sorted according to the increasing value of x-coordinate. However, I want them to follow the following pattern:
(In a zig-zag manner from bottom left to top left.)
Any clue how can I go about it? Any clue regarding the algorithm will be appreciated.
The set of coordinates are as follows:
[46.5000000000000,104.500000000000]
[57.5000000000000,164.500000000000]
[59.5000000000000,280.500000000000]
[96.5000000000000,66.5000000000000]
[127.500000000000,103.500000000000]
[142.500000000000,34.5000000000000]
[156.500000000000,173.500000000000]
[168.500000000000,68.5000000000000]
[175.500000000000,12.5000000000000]
[198.500000000000,37.5000000000000]
[206.500000000000,103.500000000000]
[216.500000000000,267.500000000000]
[225.500000000000,14.5000000000000]
[234.500000000000,62.5000000000000]
[251.500000000000,166.500000000000]
[258.500000000000,32.5000000000000]
[271.500000000000,13.5000000000000]
[284.500000000000,103.500000000000]
[291.500000000000,61.5000000000000]
[313.500000000000,32.5000000000000]
[318.500000000000,10.5000000000000]
[320.500000000000,267.500000000000]
[352.500000000000,57.5000000000000]
[359.500000000000,102.500000000000]
[360.500000000000,167.500000000000]
[366.500000000000,11.5000000000000]
[366.500000000000,34.5000000000000]
[408.500000000000,9.50000000000000]
[414.500000000000,62.5000000000000]
[419.500000000000,34.5000000000000]
[451.500000000000,12.5000000000000]
[456.500000000000,97.5000000000000]
[457.500000000000,168.500000000000]
[465.500000000000,62.5000000000000]
[465.500000000000,271.500000000000]
[468.500000000000,31.5000000000000]
[498.500000000000,10.5000000000000]
[522.500000000000,105.500000000000]
[524.500000000000,32.5000000000000]
[533.500000000000,60.5000000000000]
[534.500000000000,11.5000000000000]
[565.500000000000,164.500000000000]
[576.500000000000,33.5000000000000]
[581.500000000000,10.5000000000000]
[582.500000000000,67.5000000000000]
[586.500000000000,267.500000000000]
[590.500000000000,102.500000000000]
[622.500000000000,10.5000000000000]
[630.500000000000,32.5000000000000]
[646.500000000000,58.5000000000000]
[653.500000000000,94.5000000000000]
[669.500000000000,8.50000000000000]
[678.500000000000,167.500000000000]
[680.500000000000,31.5000000000000]
[705.500000000000,57.5000000000000]
[719.500000000000,9.50000000000000]
[729.500000000000,271.500000000000]
[732.500000000000,33.5000000000000]
[733.500000000000,97.5000000000000]
[757.500000000000,11.5000000000000]
[758.500000000000,59.5000000000000]
[778.500000000000,157.500000000000]
[792.500000000000,31.5000000000000]
[802.500000000000,10.5000000000000]
[812.500000000000,94.5000000000000]
[834.500000000000,59.5000000000000]
[839.500000000000,30.5000000000000]
[865.500000000000,160.500000000000]
[866.500000000000,272.500000000000]
[885.500000000000,58.5000000000000]
[892.500000000000,97.5000000000000]
[955.500000000000,94.5000000000000]
[963.500000000000,163.500000000000]
[972.500000000000,265.500000000000]
Building upon uSeemSurprised's answer, I would go for a 3-steps approach:
Sort the points list by y-coord. This is O(n log n)
Determine the y-axis ranges. I simply iterate over the points and take note of where the y-coord difference is larger than a threshold value. This is O(n) of course
Sort each of the sublists that represent the y-axis lines by x-coord. If we had m sublists of k items each this would be O(m (k log k)); so the overall process is still O(n log n)
The code:
def zigzag(points, threshold=10.0)
#step 1
points.sort(key=lambda x:x[1])
#step 2
breaks = []
for i in range(1, len(points)):
if points[i][1] - points[i-1][1] > threshold:
breaks.append(i)
breaks.append(i)
#step 3
rev = False
start = 0
outpoints = []
for b in breaks:
outpoints += sorted(points[start:b], reverse = rev)
start = b
rev = not rev
return outpoints
You can sort the x-axis coordinates corresponding to y-axis coordinates, where you consider certain y-axis range, i.e the coordinates that are sorted according to x-axis all belong to the same y-axis range. Each time you move up to a different y-axis range you can flip the sorting order, i.e increasing then decreasing and so on.
The most similar algorithm I can think of is Andrew's algorithm for convex hulls, specifically the lower hull (though depending on the coordinate system, you may need to use the upper hull instead).
Running the lower hull algorithm and removing points until no points remain would get you want. To get the zig-zag patterning, reverse the ordering every other time you run it.
Here is implementations in most languages:
https://en.wikibooks.org/wiki/Algorithm_Implementation/Geometry/Convex_hull/Monotone_chain
Edit: Downside here is precision in the case of fuzzy measurements. You may need to adjust the algorithm a bit if convex hulls aren't exactly what you need. IE: if you want to consider it still part of the hull if it's within say with 0.1 or say 1% of being on the hull or something. In the example given, the coordinates are exactly on the line so it would work well, but not so much so if the coordinates were say randomly distributed within say 0.1 of their actual positions.
This approach assumes you know how many rows you expect, although I suspect there's programmatic ways you could estimate that.
nbins = 6; % Number of horizontal rows we expect
[bin,binC] = kmedoids(A(:,2),nbins); % Use a clustering approach to group them
bin = binC(bin); % Clusters in random order, fix it so that clusters
[~,~,bin] = unique(bin); % are ordered by central y value
xord = A(:,1) .* (-1).^mod(bin+1,2); % flip/flop for each row making the x-coord +ve or -ve
% so that we can sort in a zig-zag
[~,idx] = sortrows([bin,xord], [1,2]); % Sort by the clusters and the zig-zag
B = A( idx, : ); % Create re-ordered array
Plotting this, it seems like what you want
figure(99); clf; hold on;
plot( A(:,1), A(:,2), '-o' );
plot( B(:,1), B(:,2), '-', 'linewidth', 1.5 );
set(gca, 'YDir', 'reverse');
legend( {'Original','Reordered'} );
Use a nearest neighbor search, where you define a custom distance measure which makes distance in the Y direction more expensive than distance in the X direction. Then start the algorithm with the bottom left point.
The "normal" Euclidean distance in Cartesian coordinates is calculated by sqrt( (x2 - x1)^2 + (y2 - y1)^2 )
To make the y direction more expensive, use a custom distance formula where you multiply the y result by a constant:
sqrt( (x2 - x1)^2 + k*(y2 - y1)^2 )
where the constant k is larger than 1 but not much larger, I would start with 2.

How to find the closest rotation

Consider points Y given in increasing order from [0,T). We are to consider these points as lying on a circle of circumference T. Now consider points X also from [0,T) and also lying on a circle of circumference T.
We say the distance between X and Y is the sum of the absolute distance between the each point in X and its closest point in Y recalling that both are considered to be lying in a circle. Write this distance as Delta(X, Y).
I am trying to find a quick way of determining a rotation of X which makes this distance as small as possible.
My code for making some data to test with is
#!/usr/bin/python
import random
import numpy as np
from bisect import bisect_left
def simul(rate, T):
time = np.random.exponential(rate)
times = [0]
newtime = times[-1]+time
while (newtime < T):
times.append(newtime)
newtime = newtime+np.random.exponential(rate)
return times[1:]
For each point I use this function to find its closest neighbor.
def takeClosest(myList, myNumber, T):
"""
Assumes myList is sorted. Returns closest value to myNumber in a circle of circumference T.
If two numbers are equally close, return the smallest number.
"""
pos = bisect_left(myList, myNumber)
before = myList[pos - 1]
after = myList[pos%len(myList)]
if after - myNumber < myNumber - before:
return after
else:
return before
So the distance between two circles is:
def circle_dist(timesY, timesX):
dist = 0
for t in timesX:
closest_number = takeClosest(timesY, t, T)
dist += np.abs(closest_number - t)
return dist
So to make some data we just do
#First make some data
T = 5000
timesX = simul(1, T)
timesY = simul(10, T)
Finally to rotate circle timesX by offset we can
timesX = [(t + offset)%T for t in timesX]
In practice my timesX and timesY will have about 20,000 points each.
Given timesX and timesY, how can I quickly find (approximately) which rotation of timesX gives
the smallest distance to timesY?
Distance along the circle between a single point and a set of points is a piecewise linear function of rotation. The critical points of this function are the points of the set itself (zero distance) and points midway between neighbouring points of the set (local maximums of distance). Linear coefficients of such function are ±1.
Sum of such functions is again piecewise linear, but now with a quadratic number of critical points. Actually all these functions are the same, except shifted along the argument axis. Linear coefficients of the sum are integers.
To find its minimum one would have to calculate its value in all critical points.
I don'see a way to significantly reduce the amount of work needed, but 1,600,000,000 points is not such a big deal anyway, especially if you can spread the work between several processors.
To calculate sum of two such functions, represent the summands as sequences of critical points and associated coefficients to the left and to the right of each critical point. Then just merge the two point sequences while adding the coefficients.
You can solve your (original) problem with a sweep line algorithm. The trick is to use the right "discretization". Imagine cutting your circle up into two strips:
X: x....x....x..........x................x.........x...x
Y: .....x..........x.....x..x.x...........x.............
Now calculate the score = 5+0++1+1+5+9+6.
The key observation is that if we rotate X very slightly (right say), some of the points will improve and some will get worse. We can call this the "differential". In the above example the differential would be 1 - 1 - 1 + 1 + 1 - 1 + 1 because the first point is matched to something on its right, the second point is matched to something under it or to its left etc.
Of course, as we move X more, the differential will change. However only as many times as the matchings change, which is never more than |X||Y| but probably much less.
The proposed algorithm is thus to calculate the initial score and the time (X position) of the next change in differential. Go to that next position and calculate the score again. Continue until you reach your starting position.
This is probably a good example for the iterative closest point (ICP) algorithm:
It repeatedly matches each point with its closest neighbor and moves all points such that the mean squared distance is minimized. (Note that this corresponds to minimizing the sum of squared distances.)
import pylab as pl
T = 10.0
X = pl.array([3, 5.5, 6])
Y = pl.array([1, 1.5, 2, 4])
pl.clf()
pl.subplot(1, 2, 1, polar=True)
pl.plot(X / T * 2 * pl.pi, pl.ones(X.shape), 'r.', ms=10, mew=3)
pl.plot(Y / T * 2 * pl.pi, pl.ones(Y.shape), 'b+', ms=10, mew=3)
circDist = lambda X, Y: (Y - X + T / 2) % T - T / 2
while True:
D = circDist(pl.reshape(X, (-1, 1)), pl.reshape(Y, (1, -1)))
closestY = pl.argmin(D**2, axis = 1)
distance = circDist(X, Y[closestY])
shift = pl.mean(distance)
if pl.absolute(shift) < 1e-3:
break
X = (X + shift) % T
pl.subplot(1, 2, 2, polar=True)
pl.plot(X / T * 2 * pl.pi, pl.ones(X.shape), 'r.', ms=10, mew=3)
pl.plot(Y / T * 2 * pl.pi, pl.ones(Y.shape), 'b+', ms=10, mew=3)
Important properties of the proposed solution are:
The ICP is an iterative algorithm. Thus it depends on an initial approximate solution. Furthermore, it won't always converge to the global optimum. This mainly depends on your data and the initial solution. If in doubt, try evaluating the ICP with different starting configurations and choose the most frequent result.
The current implementation performs a directed match: It looks for the closest point in Y relative to each point in X. It might yield different matches when swapping X and Y.
Computing all pair-wise distances between points in X and points in Y might be intractable for large point clouds (like 20,000 points, as you indicated). Therefore, the line D = circDist(...) might get replaced by a more efficient approach, e.g. not evaluating all possible pairs.
All points contribute to the final rotation. If there are any outliers, they might distort the shift significantly. This can be overcome with a robust average like the median or simply by excluding points with large distance.

What does it mean to get the (MSE) mean error squared for 2 images?

The MSE is the average of the channel error squared.
What does that mean in comparing two same size images?
For two pictures A, B you take the square of the difference between every pixel in A and the corresponding pixel in B, sum that up and divide it by the number of pixels.
Pseudo code:
sum = 0.0
for(x = 0; x < width;++x){
for(y = 0; y < height; ++y){
difference = (A[x,y] - B[x,y])
sum = sum + difference*difference
}
}
mse = sum /(width*height)
printf("The mean square error is %f\n",mse)
Conceptually, it would be:
1) Start with red channel
2) Compute the difference between each pixel's gray level value in the two image's red channels pixel-by-pixel (redA(0,0)-redB(0,0) etc for all pixel locations.
3) Square the differences of every one of those pixels (redA(0,0)-redB(0,0)^2
4) Compute the sum of the squared difference for all pixels in the red channel
5) Repeat above for the green and blue channels
6) Add the 3 sums together and divide by 3, i.e, (redsum+greensum+bluesum)/3
7) Divide by the area of the image (Width*Height) to form the mean or average, i.e., (redsum+greensum+bluesum)/(3*Width*Height) = MSE
Note that the E in error is synonymous with difference. So it could be called the Mean Squared Difference. Also mean is the same as average. So it could also be called the Average Squared Difference.
You can have a look at following article: http://en.wikipedia.org/wiki/Mean_squared_error#Definition_and_basic_properties. There "Yi" represents the true values and "hat_Yi" represents the values with which we want to compare the true values.
So, in your case you can consider one image as the reference image and the second image as the image whose pixel values you would like to compare with the first one....and you do so by calculating the MSE which tells you "how different/similar is the second image to the first one"
Check out wikipedia for MSE, it's a measure of the difference between each pixel value. Here's a sample implementation
def MSE(img1, img2):
squared_diff = (img1 -img2) ** 2
summed = np.sum(squared_diff)
num_pix = img1.shape[0] * img1.shape[1] #img1 and 2 should have same shape
err = summed / num_pix
return err
Let's us assume you have two points in a 2-dimensional space A(x1,y1) and B(x2,y2), the distance between the two points is calculated as sqrt((x1-x2)^2+(y1-y2)^2). If the the two points are in 3-dimensional space, it can be calculated as sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2). For two points in n-dimensional space, the distance formulae can be extended as sqrt(sumacrossdimensions(valueofAindim-valueofBindim)^2) (since latex is not allowed).
Now, the image with n pixels can be viewed as a point in n-dimensional space. The distance between two images with n pixels can be thoughts as the distance between 2 points in n-dimensional space. This distance is called MSE.

How to filter a set of 2D points moving in a certain way

I have a list of points moving in two dimensions (x- and y-axis) represented as rows in an array. I might have N points - i.e., N rows:
1 t1 x1 y1
2 t2 x2 y2
.
.
.
N tN xN yN
where ti, xi, and yi, is the time-index, x-coordinate, and the y-coordinate for point i. The time index-index ti is an integer from 1 to T. The number of points at each such possible time index can vary from 0 to N (still with only N points in total).
My goal is the filter out all the points that do not move in a certain way; or to keep only those that do. A point must move in a parabolic trajectory - with decreasing x- and y-coordinate (i.e., moving to the left and downwards only). Points with other dynamic behaviour must be removed.
Can I use a simple sorting mechanism on this array - and then analyse the order of the time-index? I have also considered the fact each point having the same time-index ti are physically distinct points, and so should be paired up with other points. The complexity of the problem grew - and now I turn to you.
NOTE: You can assume that the points are confined to a sub-region of the (x,y)-plane between two parabolic curves. These curves intersect only at only at one point: A point close to the origin of motion for any point.
More Information:
I have made some datafiles available:
MATLAB datafile (1.17 kB)
same data as CSV with semicolon as column separator (2.77 kB)
Necessary context:
The datafile hold one uint32 array with 176 rows and 5 columns. The columns are:
pixel x-coordinate in 175-by-175 lattice
pixel y-coordinate in 175-by-175 lattice
discrete theta angle-index
time index (from 1 to T = 10)
row index for this original sorting
The points "live" in a 175-by-175 pixel-lattice - and again inside the upper quadrant of a circle with radius 175. The points travel on the circle circumference in a counterclockwise rotation to a certain angle theta with horizontal, where they are thrown off into something close to a parabolic orbit. Column 3 holds a discrete index into a list with indices 1 to 45 from 0 to 90 degress (one index thus spans 2 degrees). The theta-angle was originally deduces solely from the points by setting up the trivial equations of motions and solving for the angle. This gives rise to a quasi-symmetric quartic which can be solved in close-form. The actual metric radius of the circle is 0.2 m and the pixel coordinate were converted from pixel-coordinate to metric using simple linear interpolation (but what we see here are the points in original pixel-space).
My problem is that some points are not behaving properly and since I need to statistics on the theta angle, I need to remove the points that certainly do NOT move in a parabolic trajoctory. These error are expected and fully natural, but still need to be filtered out.
MATLAB plot code:
% load data and setup variables:
load mat_points.mat;
num_r = 175;
num_T = 10;
num_gridN = 20;
% begin plotting:
figure(1000);
clf;
plot( ...
num_r * cos(0:0.1:pi/2), ...
num_r * sin(0:0.1:pi/2), ...
'Color', 'k', ...
'LineWidth', 2 ...
);
axis equal;
xlim([0 num_r]);
ylim([0 num_r]);
hold all;
% setup grid (yea... went crazy with one):
vec_tickValues = linspace(0, num_r, num_gridN);
cell_tickLabels = repmat({''}, size(vec_tickValues));
cell_tickLabels{1} = sprintf('%u', vec_tickValues(1));
cell_tickLabels{end} = sprintf('%u', vec_tickValues(end));
set(gca, 'XTick', vec_tickValues);
set(gca, 'XTickLabel', cell_tickLabels);
set(gca, 'YTick', vec_tickValues);
set(gca, 'YTickLabel', cell_tickLabels);
set(gca, 'GridLineStyle', '-');
grid on;
% plot points per timeindex (with increasing brightness):
vec_grayIndex = linspace(0,0.9,num_T);
for num_kt = 1:num_T
vec_xCoords = mat_points((mat_points(:,4) == num_kt), 1);
vec_yCoords = mat_points((mat_points(:,4) == num_kt), 2);
plot(vec_xCoords, vec_yCoords, 'o', ...
'MarkerEdgeColor', 'k', ...
'MarkerFaceColor', vec_grayIndex(num_kt) * ones(1,3) ...
);
end
Thanks :)
Why, it looks almost as if you're simulating a radar tracking debris from the collision of two missiles...
Anyway, let's coin a new term: object. Objects are moving along parabolae and at certain times they may emit flashes that appear as points. There are also other points which we are trying to filter out.
We will need some more information:
Can we assume that the objects obey the physics of things falling under gravity?
Must every object emit a point at every timestep during its lifetime?
Speaking of lifetime, do all objects begin at the same time? Can some expire before others?
How precise is the data? Is it exact? Is there a measure of error? To put it another way, do we understand how poorly the points from an object might fit a perfect parabola?
Sort the data with (index,time) as keys and for all locations of a point i see if they follow parabolic trajectory?
Which part are you facing problem? Sorting should be very easy. IMHO, it is the second part (testing if a set of points follow parabolic trajectory) that is difficult.

random unit vector in multi-dimensional space

I'm working on a data mining algorithm where i want to pick a random direction from a particular point in the feature space.
If I pick a random number for each of the n dimensions from [-1,1] and then normalize the vector to a length of 1 will I get an even distribution across all possible directions?
I'm speaking only theoretically here since computer generated random numbers are not actually random.
One simple trick is to select each dimension from a gaussian distribution, then normalize:
from random import gauss
def make_rand_vector(dims):
vec = [gauss(0, 1) for i in range(dims)]
mag = sum(x**2 for x in vec) ** .5
return [x/mag for x in vec]
For example, if you want a 7-dimensional random vector, select 7 random values (from a Gaussian distribution with mean 0 and standard deviation 1). Then, compute the magnitude of the resulting vector using the Pythagorean formula (square each value, add the squares, and take the square root of the result). Finally, divide each value by the magnitude to obtain a normalized random vector.
If your number of dimensions is large then this has the strong benefit of always working immediately, while generating random vectors until you find one which happens to have magnitude less than one will cause your computer to simply hang at more than a dozen dimensions or so, because the probability of any of them qualifying becomes vanishingly small.
You will not get a uniformly distributed ensemble of angles with the algorithm you described. The angles will be biased toward the corners of your n-dimensional hypercube.
This can be fixed by eliminating any points with distance greater than 1 from the origin. Then you're dealing with a spherical rather than a cubical (n-dimensional) volume, and your set of angles should then be uniformly distributed over the sample space.
Pseudocode:
Let n be the number of dimensions, K the desired number of vectors:
vec_count=0
while vec_count < K
generate n uniformly distributed values a[0..n-1] over [-1, 1]
r_squared = sum over i=0,n-1 of a[i]^2
if 0 < r_squared <= 1.0
b[i] = a[i]/sqrt(r_squared) ; normalize to length of 1
add vector b[0..n-1] to output list
vec_count = vec_count + 1
else
reject this sample
end while
There is a boost implementation of the algorithm that samples from normal distributions: random::uniform_on_sphere
I had the exact same question when also developing a ML algorithm.
I got to the same conclusion as Jim Lewis after drawing samples for the 2-d case and plotting the resulting distribution of the angle.
Furthermore, if you try to derive the density distribution for the direction in 2d when you draw at random from [-1,1] for the x- and y-axis ,you will see that:
f_X(x) = 1/(4*cos²(x)) if 0 < x < 45⁰
and
f_X(x) = 1/(4*sin²(x)) if x > 45⁰
where x is the angle, and f_X is the probability density distribution.
I have written about this here:
https://aerodatablog.wordpress.com/2018/01/14/random-hyperplanes/
#define SCL1 (M_SQRT2/2)
#define SCL2 (M_SQRT2*2)
// unitrand in [-1,1].
double u = SCL1 * unitrand();
double v = SCL1 * unitrand();
double w = SCL2 * sqrt(1.0 - u*u - v*v);
double x = w * u;
double y = w * v;
double z = 1.0 - 2.0 * (u*u + v*v);

Resources