'Classifying with k-Nearest Neighbors' for not-number parameters - algorithm

I have a fact data with set of parameters and some value that correspond to this parameters.
For example:
Street Color Shape Value
--------------------------------------
Versky Blue Ball 10
Soll Green Square 5
...
Now I need a create a function which get set of parameters [Holl, Red, Circle] and returns the predicted 'Value'.
If my parameters were the numbers I could use 'Classifying with k-Nearest Neighbors' algorithm, but they weren't.
Which machine-learning algorithm can I use to solve this task ?

Note that nearest neighbor is finding the nearest neighbor according to some distance metric. While indeed euclidean or similar metrics are widely used, any distance metric can be fine.
You can use a variation of Hamming distance:
Let x[i] be the i'th feature of vector x
Let the number of features be n
d(x,y) = Sum { (x[i] == y[i] ? 0 : 1) | i from 0 to n }
The above is a distance metric which is basically a variation of hamming distance where each feature got its unique alphabet.

Related

Finding the maximum number of points where the distance between each pair of points is at least d

I have a set of points {p1,p2,...,pn}. I want to find the maximum number of points in this set that satisfies that the distance (Euclidean distance) between each pair of points is at least d.
Any help will be appreciated. Thanks
You need to read through the set of points from the first element to the (last element - 1)
then, you calculate the distance between these two points with this formula :
sqrt(pow(x2-x1,2)+pow(y2-y1,2)) where (x1,y1) is a point and (x2,y2) is the following point of the set.
If this distance equals at least d, then you increment the variable which counts the number of points you want.
(Sorry but my english is very bad)
Do you need an example ? I can do it in Python3
from math import sqrt
x_points = [3,5.2,7,1,0,9.8,5]
y_points = [0,2,8,4,7,1,1.2]
# There are 7 points
# (x_points[0];y_points[0]) is the point (3;0)
min_distance = 3
max_number_count = 0
for i in range(0,len(x_points)-1):
if sqrt(pow(x_points[i]-x_points[i+1],2)+pow(y_points[i]-y_points[i+1],2)) >= min_distance:
max_number_count+=1
The result is max_number_count

Triangulate a set of points with a concave domain

Setup
Given some set of nodes within a convex hull, assume the domain contains one or more concave areas:
where blue dots are points, and the black line illustrates the domain. Assume the points are held as a 2D array points of length n, where n is the number of point-pairs.
Let us then triangulate the points, using something like the Delaunay method from scipy.spatial:
As you can see, one may experience the creation of triangles crossing through the domain.
Question
What is a good algorithmic approach to removing any triangles that span outside of the domain? Ideally but not necessarily, where the simplex edges still preserve the domain shape (i.e., no major gaps where triangles are removed).
Since my question is seeming to continue to get a decent amount of activity, I wanted to follow up with the application that I'm currently using.
Assuming that you have your boundary defined, you can use a ray casting algorithm to determine whether or not the polygon is inside the domain.
To do this:
Take the centroid of each polygon as C_i = (x_i,y_i).
Then, imagine a line L = [C_i,(+inf,y_i)]: that is, a line that spans east past the end of your domain.
For each boundary segment s_i in boundary S, check for intersections with L. If yes, add +1 to an internal counter intersection_count; else, add nothing.
After the count of all intersections between L and s_i for i=1..N are calculated:
if intersection_count % 2 == 0:
return True # triangle outside convex hull
else:
return False # triangle inside convex hull
If your boundary is not explicitly defined, I find it helpful to 'map' the shape onto an boolean array and use a neighbor tracing algorithm to define it. Note that this approach assumes a solid domain and you will need to use a more complex algorithm for domains with 'holes' in them.
Here is some Python code that does what you want.
First, building the alpha shape (see my previous answer):
def alpha_shape(points, alpha, only_outer=True):
"""
Compute the alpha shape (concave hull) of a set of points.
:param points: np.array of shape (n,2) points.
:param alpha: alpha value.
:param only_outer: boolean value to specify if we keep only the outer border or also inner edges.
:return: set of (i,j) pairs representing edges of the alpha-shape. (i,j) are the indices in the points array.
"""
assert points.shape[0] > 3, "Need at least four points"
def add_edge(edges, i, j):
"""
Add a line between the i-th and j-th points,
if not in the list already
"""
if (i, j) in edges or (j, i) in edges:
# already added
assert (j, i) in edges, "Can't go twice over same directed edge right?"
if only_outer:
# if both neighboring triangles are in shape, it's not a boundary edge
edges.remove((j, i))
return
edges.add((i, j))
tri = Delaunay(points)
edges = set()
# Loop over triangles:
# ia, ib, ic = indices of corner points of the triangle
for ia, ib, ic in tri.vertices:
pa = points[ia]
pb = points[ib]
pc = points[ic]
# Computing radius of triangle circumcircle
# www.mathalino.com/reviewer/derivation-of-formulas/derivation-of-formula-for-radius-of-circumcircle
a = np.sqrt((pa[0] - pb[0]) ** 2 + (pa[1] - pb[1]) ** 2)
b = np.sqrt((pb[0] - pc[0]) ** 2 + (pb[1] - pc[1]) ** 2)
c = np.sqrt((pc[0] - pa[0]) ** 2 + (pc[1] - pa[1]) ** 2)
s = (a + b + c) / 2.0
area = np.sqrt(s * (s - a) * (s - b) * (s - c))
circum_r = a * b * c / (4.0 * area)
if circum_r < alpha:
add_edge(edges, ia, ib)
add_edge(edges, ib, ic)
add_edge(edges, ic, ia)
return edges
To compute the edges of the outer boundary of the alpha shape use the following example call:
edges = alpha_shape(points, alpha=alpha_value, only_outer=True)
Now, after the edges of the outer boundary of the alpha-shape of points have been computed, the following function will determine whether a point (x,y) is inside the outer boundary.
def is_inside(x, y, points, edges, eps=1.0e-10):
intersection_counter = 0
for i, j in edges:
assert abs((points[i,1]-y)*(points[j,1]-y)) > eps, 'Need to handle these end cases separately'
y_in_edge_domain = ((points[i,1]-y)*(points[j,1]-y) < 0)
if y_in_edge_domain:
upper_ind, lower_ind = (i,j) if (points[i,1]-y) > 0 else (j,i)
upper_x = points[upper_ind, 0]
upper_y = points[upper_ind, 1]
lower_x = points[lower_ind, 0]
lower_y = points[lower_ind, 1]
# is_left_turn predicate is evaluated with: sign(cross_product(upper-lower, p-lower))
cross_prod = (upper_x - lower_x)*(y-lower_y) - (upper_y - lower_y)*(x-lower_x)
assert abs(cross_prod) > eps, 'Need to handle these end cases separately'
point_is_left_of_segment = (cross_prod > 0.0)
if point_is_left_of_segment:
intersection_counter = intersection_counter + 1
return (intersection_counter % 2) != 0
On the input shown in the above figure (taken from my previous answer) the call is_inside(1.5, 0.0, points, edges) will return True, whereas is_inside(1.5, 3.0, points, edges) will return False.
Note that the is_inside function above does not handle degenerate cases. I added two assertions to detect such cases (you can define any epsilon value that fits your application). In many applications this is sufficient, but if not and you encounter these end cases, they need to be handled separately.
See, for example, here on robustness and precision issues when implementing geometric algorithms.
One of Classic DT algorithms generates first a bounding triangle, then adds all new triangles sorted by x, then prunes out all triangles having a vertex in the supertriangle.
At least from the provided image one can derive the heuristics of pruning out also some triangles having all vertices on the concave hull. Without a proof, the triangles to be pruned out have a negative area when their vertices are sorted in the same order as the concave hull is defined.
This may need the concave hull to be inserted as well, and to be pruned out.
Since my question is seeming to continue to get a decent amount of activity, I wanted to follow up with the application that I'm currently using.
Assuming that you have your boundary defined, you can use a ray casting algorithm to determine whether or not the polygon is inside the domain.
To do this:
Take the centroid of each polygon as C_i = (x_i,y_i).
Then, imagine a line L = [C_i,(+inf,y_i)]: that is, a line that spans east past the end of your domain.
For each boundary segment s_i in boundary S, check for intersections with L. If yes, add +1 to an internal counter intersection_count; else, add nothing.
After the count of all intersections between L and s_i for i=1..N are calculated:
if intersection_count % 2 == 0:
return True # triangle outside convex hull
else:
return False # triangle inside convex hull
If your boundary is not explicitly defined, I find it helpful to 'map' the shape onto an boolean array and use a neighbor tracing algorithm to define it. Note that this approach assumes a solid domain and you will need to use a more complex algorithm for domains with 'holes' in them.
You can try a constrained delaunay algorithm for example with sloan algoritm or cgal library.
[1] A Brute-Force Constrained Delaunay Triangulation?
A simple but elegant way is to loop over the triangels and check wether they are within our domain or not. The shapely package could do the trick for you.
for more on this please check the following post: https://gis.stackexchange.com/a/352442
Note that triangulation in shapely is also implemented, even for MultiPoin objects.
I used it, the performance was amazing and the code was only like five lines.
Compute the triangles centroid an check if it's inside the polygon using this algorithm.

Sorting the following coordinates in the given pattern:

I have the following image:
The coordinates corresponding to the white blobs in the image are sorted according to the increasing value of x-coordinate. However, I want them to follow the following pattern:
(In a zig-zag manner from bottom left to top left.)
Any clue how can I go about it? Any clue regarding the algorithm will be appreciated.
The set of coordinates are as follows:
[46.5000000000000,104.500000000000]
[57.5000000000000,164.500000000000]
[59.5000000000000,280.500000000000]
[96.5000000000000,66.5000000000000]
[127.500000000000,103.500000000000]
[142.500000000000,34.5000000000000]
[156.500000000000,173.500000000000]
[168.500000000000,68.5000000000000]
[175.500000000000,12.5000000000000]
[198.500000000000,37.5000000000000]
[206.500000000000,103.500000000000]
[216.500000000000,267.500000000000]
[225.500000000000,14.5000000000000]
[234.500000000000,62.5000000000000]
[251.500000000000,166.500000000000]
[258.500000000000,32.5000000000000]
[271.500000000000,13.5000000000000]
[284.500000000000,103.500000000000]
[291.500000000000,61.5000000000000]
[313.500000000000,32.5000000000000]
[318.500000000000,10.5000000000000]
[320.500000000000,267.500000000000]
[352.500000000000,57.5000000000000]
[359.500000000000,102.500000000000]
[360.500000000000,167.500000000000]
[366.500000000000,11.5000000000000]
[366.500000000000,34.5000000000000]
[408.500000000000,9.50000000000000]
[414.500000000000,62.5000000000000]
[419.500000000000,34.5000000000000]
[451.500000000000,12.5000000000000]
[456.500000000000,97.5000000000000]
[457.500000000000,168.500000000000]
[465.500000000000,62.5000000000000]
[465.500000000000,271.500000000000]
[468.500000000000,31.5000000000000]
[498.500000000000,10.5000000000000]
[522.500000000000,105.500000000000]
[524.500000000000,32.5000000000000]
[533.500000000000,60.5000000000000]
[534.500000000000,11.5000000000000]
[565.500000000000,164.500000000000]
[576.500000000000,33.5000000000000]
[581.500000000000,10.5000000000000]
[582.500000000000,67.5000000000000]
[586.500000000000,267.500000000000]
[590.500000000000,102.500000000000]
[622.500000000000,10.5000000000000]
[630.500000000000,32.5000000000000]
[646.500000000000,58.5000000000000]
[653.500000000000,94.5000000000000]
[669.500000000000,8.50000000000000]
[678.500000000000,167.500000000000]
[680.500000000000,31.5000000000000]
[705.500000000000,57.5000000000000]
[719.500000000000,9.50000000000000]
[729.500000000000,271.500000000000]
[732.500000000000,33.5000000000000]
[733.500000000000,97.5000000000000]
[757.500000000000,11.5000000000000]
[758.500000000000,59.5000000000000]
[778.500000000000,157.500000000000]
[792.500000000000,31.5000000000000]
[802.500000000000,10.5000000000000]
[812.500000000000,94.5000000000000]
[834.500000000000,59.5000000000000]
[839.500000000000,30.5000000000000]
[865.500000000000,160.500000000000]
[866.500000000000,272.500000000000]
[885.500000000000,58.5000000000000]
[892.500000000000,97.5000000000000]
[955.500000000000,94.5000000000000]
[963.500000000000,163.500000000000]
[972.500000000000,265.500000000000]
Building upon uSeemSurprised's answer, I would go for a 3-steps approach:
Sort the points list by y-coord. This is O(n log n)
Determine the y-axis ranges. I simply iterate over the points and take note of where the y-coord difference is larger than a threshold value. This is O(n) of course
Sort each of the sublists that represent the y-axis lines by x-coord. If we had m sublists of k items each this would be O(m (k log k)); so the overall process is still O(n log n)
The code:
def zigzag(points, threshold=10.0)
#step 1
points.sort(key=lambda x:x[1])
#step 2
breaks = []
for i in range(1, len(points)):
if points[i][1] - points[i-1][1] > threshold:
breaks.append(i)
breaks.append(i)
#step 3
rev = False
start = 0
outpoints = []
for b in breaks:
outpoints += sorted(points[start:b], reverse = rev)
start = b
rev = not rev
return outpoints
You can sort the x-axis coordinates corresponding to y-axis coordinates, where you consider certain y-axis range, i.e the coordinates that are sorted according to x-axis all belong to the same y-axis range. Each time you move up to a different y-axis range you can flip the sorting order, i.e increasing then decreasing and so on.
The most similar algorithm I can think of is Andrew's algorithm for convex hulls, specifically the lower hull (though depending on the coordinate system, you may need to use the upper hull instead).
Running the lower hull algorithm and removing points until no points remain would get you want. To get the zig-zag patterning, reverse the ordering every other time you run it.
Here is implementations in most languages:
https://en.wikibooks.org/wiki/Algorithm_Implementation/Geometry/Convex_hull/Monotone_chain
Edit: Downside here is precision in the case of fuzzy measurements. You may need to adjust the algorithm a bit if convex hulls aren't exactly what you need. IE: if you want to consider it still part of the hull if it's within say with 0.1 or say 1% of being on the hull or something. In the example given, the coordinates are exactly on the line so it would work well, but not so much so if the coordinates were say randomly distributed within say 0.1 of their actual positions.
This approach assumes you know how many rows you expect, although I suspect there's programmatic ways you could estimate that.
nbins = 6; % Number of horizontal rows we expect
[bin,binC] = kmedoids(A(:,2),nbins); % Use a clustering approach to group them
bin = binC(bin); % Clusters in random order, fix it so that clusters
[~,~,bin] = unique(bin); % are ordered by central y value
xord = A(:,1) .* (-1).^mod(bin+1,2); % flip/flop for each row making the x-coord +ve or -ve
% so that we can sort in a zig-zag
[~,idx] = sortrows([bin,xord], [1,2]); % Sort by the clusters and the zig-zag
B = A( idx, : ); % Create re-ordered array
Plotting this, it seems like what you want
figure(99); clf; hold on;
plot( A(:,1), A(:,2), '-o' );
plot( B(:,1), B(:,2), '-', 'linewidth', 1.5 );
set(gca, 'YDir', 'reverse');
legend( {'Original','Reordered'} );
Use a nearest neighbor search, where you define a custom distance measure which makes distance in the Y direction more expensive than distance in the X direction. Then start the algorithm with the bottom left point.
The "normal" Euclidean distance in Cartesian coordinates is calculated by sqrt( (x2 - x1)^2 + (y2 - y1)^2 )
To make the y direction more expensive, use a custom distance formula where you multiply the y result by a constant:
sqrt( (x2 - x1)^2 + k*(y2 - y1)^2 )
where the constant k is larger than 1 but not much larger, I would start with 2.

Optimizing the layout of a graph with given (erroneous) node-distances

I have a loosely connected graph. For every edge in this graph, I know the approximate distance d(v,w) between node v and w at positions p(v) and p(w) as a vector in R3, not only as an euclidean distance. The error shall be small (lets say < 3%) and the first node is at <0,0,0>.
If there were no errors at all, I can calculate the node-positions this way:
set p(first_node) = <0,0,0>
calculate_position(first_node)
calculate_position(v):
for (v,w) in Edges:
if p(w) is not set:
set p(w) = p(v) + d(v,w)
calculate_position(w)
for (u,v) in Edges:
if p(u) is not set:
set p(u) = p(v) - d(u,v)
calculate_position(u)
The errors of the distance are not equal. But to keep things simple, assume the relative error (d(v,w)-d'(v,w))/E(v,w) is N(0,1)-normal-distributed. I want to minimize the sum of the squared error
sum( ((p(v)-p(w)) - d(v,w) )^2/E(v,w)^2 ) for all edges
The graph may have a moderate amount of Nodes ( > 100 ) but with just some connections between the nodes and have been "prefiltered" (split into subgraphs, if there is only one connection between these subgraphs).
I have tried a simplistic "physical model" with hooks low but its slow and unstable. Is there a better algorithm or heuristic for this kind of problem?
This looks like linear regression. Take error terms of the following form, i.e. without squares and split into separate coordinates:
(px(v) - px(w) - dx(v,w))/E(v,w)
(py(v) - py(w) - dy(v,w))/E(v,w)
(pz(v) - pz(w) - dz(v,w))/E(v,w)
If I understood you correctly, you are looking for values px(v), py(v) and pz(v) for all nodes v such that the sum of squares of the above terms is minimized.
You can do this by creating a matrix A and a vector b in the following way: every row corresponds to one of equation of the above form, and every column of A corresponds to one variable, i.e. a single coordinate. For n vertices and m edges, the matrix A will have 3m rows (since you separate coordinates) and 3n−3 columns (since you also fix the first node px(0)=py(0)=pz(0)=0).
The row for (px(v) - px(w) - dx(v,w))/E(v,w) would have an entry 1/E(v,w) in the column for px(v) and an entry -1/E(v,w) in the column for px(w). All other columns would be zero. The corresponding entry in the vector b would be dx(v,w)/E(v,w).
Now solve the linear equation (AT·A)x = AT·b where AT denotes the transpose of A. The solution vector x will contain the coordinates for your vertices. You can break this into three independent problems, one for each coordinate direction, to keep the size of the linear equation system down.

2D coordinate normalization

I need to implement a function which normalizes coordinates. I define normalize as (please suggest a better term if Im wrong):
Mapping entries of a data set from their natural range to values between 0 and 1.
Now this was easy in one dimension:
static List<float> Normalize(float[] nums)
{
float max = Max(nums);
float min = Min(nums);
float delta = max - min;
List<float> li = new List<float>();
foreach (float i in nums)
{
li.Add((i - min) / delta);
}
return li;
}
I need a 2D version as well and that one has to keep the aspect ratio intact. But Im having some troubles figuring out the math.
Although the code posted is in C# the answers need not to be.
Thanks in advance. :)
I am posting my response as an answer because I do not have enough points to make a comment.
My interpretation of the question: How do we normalize the coordinates of a set of points in 2 dimensional space?
A normalization operation involves a "shift and scale" operation. In case of 1 dimensional space this is fairly easy and intuitive (as pointed out by #Mizipzor).
normalizedX=(originalX-minX)/(maxX-minX)
In this case we are first shifing the value by a distance of minX and then scaling it by the range which is given by (maxX-minX). The shift operation ensures that the minimum moves to 0 and the scale operation squashes the distribution such that the distribution has an upper limit of 1
In case of 2d , simply dividing by the largest dimension is not enought. Why?
Consider the simplified case with just 2 points as shown below.
The maximum value of any dimension is the Y value of point B and this 10000.
Coordinates of normalized A=>5000/10000,8000/10000 ,i.e 0.5,0.8
Coordinates of normalized A=>7000/10000,10000/10000 ,i.e 0.7,1.0
The X and Y values are all with 0 and 1. However, the distribution of the normalized values is far from uniform. The minimum value is just 0.5. Ideally this should be closer to 0.
Preferred approach for normalizing 2d coordinates
To get a more even distribution we should do a "shift" operation around the minimum of all X values and minimum of all Y values. This could be done around the mean of X and mean of Y as well. Considering the above example,
the minimum of all X is 5000
the minimum of all Y is 8000
Step 1 - Shift operation
A=>(5000-5000,8000-8000), i.e (0,0)
B=>(7000-5000,10000-8000), i.e. (2000,2000)
Step 2 - Scale operation
To scale down the values we need some maximum. We could use the diagonal AB whose length is 2000
A=>(0/2000,0/2000), i.e. (0,0)
B=>(2000/2000,2000/2000)i.e. (1,1)
What happens when there are more than 2 points?
The approach remains similar. We find the coordinates of the smallest bounding box which fits all the points.
We find the minimum value of X (MinX) and minimum value of Y (MinY) from all the points and do a shift operation. This changes the origin to the lower left corner of the bounding box.
We find the maximum value of X (MaxX) and maximum value of Y (MaxY) from all the points.
We calculate the length of the diagonal connecting (MinX,MinY) and (MaxX,MaxY) and use this value to do a scale operation.
.
length of diagonal=sqrt((maxX-minX)*(maxX-minX) + (maxY-minY)*(maxY-minY))
normalized X = (originalX - minX)/(length of diagonal)
normalized Y = (originalY - minY)/(length of diagonal)
How does this logic change if we have more than 2 dimensions?
The concept remains the same.
- We find the minimum value in each of the dimensions (X,Y,Z)
- We find the maximum value in each of the dimensions (X,Y,Z)
- Compute the length of the diagonal as a scaling factor
- Use the minimum values to shift the origin.
length of diagonal=sqrt((maxX-minX)*(maxX-minX)+(maxY-minY)*(maxY-minY)+(maxZ-minZ)*(maxZ-minZ))
normalized X = (originalX - minX)/(length of diagonal)
normalized Y = (originalY - minY)/(length of diagonal)
normalized Z = (originalZ - minZ)/(length of diagonal)
It seems you want each vector (1D, 2D or ND) to have length <= 1.
If that's the only requirement, you can just divide each vector by the length of the longest one.
double max = maximum (|vector| for each vector in 'data');
foreach (Vector v : data) {
li.add(v / max);
}
That will make the longest vector in result list to have length 1.
But this won't be equivalent of your current code for 1-dimensional case, as you can't find minimum or maximum in a set of points on the plane. Thus, no delta.
Simple idea: Find out which dimension is bigger and normalize in this dimension. The second dimension can be computed by using the ratio. This way the ratio is kept and your values are between 0 and 1.

Resources