What's the best way to find KNN by hand? - algorithm

Lets say I'm given the following and need to find 'use' KNN to predict the class label of record 15 and know beforehand that k is set to 3. What are the proper steps, regardless of table, or label or k is set to in order to do this?
The first 10 are training data, and the other 10 are testing data.

First you need to convert the categorical data to numeric data.
For example: In case of the Astigmatism column you may use 1 for 'Yes' and 0 for 'No'.
Similarly do this for Age, Spectacle Prescription and Tear Production Rate.
Now that you have converted your categorical data to numeric values,you are ready to apply KNN.
Considering the testing data, select each row one by one and calculate its distance (can be L1 distance or L2 distance) from each point in the training set. So for the 11th data point you calculate its distance from all training points from 0 to 10.
Note calculation of distance has become possible only because of the conversion of categorial data to numerical values.
Then after you have 10 distance values corresponding to the distance of the 11th data point with all other training datapoints, you select the 3 (As k=3) points with minimum distance,and see their labels then you select the label with the majority.
Repeat this for all testing points.

Related

what is the better algorithm than A* in pair matching?

I have 2 sets of datas.
PreData is size of 251.
PostData is size of 234.
And PreData is converted to PostData, but some are missing in converting process.
below pic shows distributions of each datasets. Orange color is PostData.
all datas does not have nametag, So I cannot findout which pre data is connected to which post data.
But converting process is pretty linear(not 100% linear, noise exist),
I guess, closer two points has higher chance to be a pair.
So, I made distance matrix like below.
Now the problem is it.
find smallest sum of 234 pairs in matrix and each axis only used by 1 time.(it is 234*251 so some y axis cannot be choiced)
First, I tried A* algorithm.
select all first row items as a new node.
and huristic is sum of smallest values of each row from remain(remove first row and their columns) matrix.
1. start
2. remove row and col
3. calc Huristic
sum of all each row's minimums.
it is impossibly optimistic huristic. So, huristic is admissible.
A*, it works well in 7 by 10, 10 by 10 test datas.
But it runs forever in real data.
234 * 251 is too big.
I guess huristic is too optimistic..... but.....
what can i do? please someone tell me better algorithm or ideas.

How to make moving average using geopandas nearest neighbors?

I have a geodataframe ("GDF") with one column as "values", and another column as "geometry" (that in fact are actual geographical regions), so each row represents a region.
The "values" column is zero in many rows, and a large number in some rows.
I need to make a "moving average" or rolling average, using the nearest neighbors up to a certain "max_distance" (we can assume that the GDF has a locally projected CRS, so the max_distance has real meaning).
Thus, the averaged_values would have neither zero or large values in most of the regions, but an average value.
One way to do it would be
for region in GDF:
averaged_values=sjoin_nearest(GDF,GDF,maxdistance=1000).values.mean()
But really I don't know how to proceed.
The expected output would be a geodataframe with 3 columns:
"values", "averaged_values", and "geometry".
Any ideas?
What you are trying to do is also called a spatial lag. The best way is to create spatial weights matrix based on a set distance and compute the lag, both using libpysal library, which is a part of the geopandas ecosystem.
import libpysal
# create weights
W = libpysal.weights.DistanceBand.from_dataframe(gdf, threshold=1000)
# row-normalise weights
W.transform = "r"
# create lag
gdf["averaged_values"] = libpysal.weights.lag_spatial(W, gdf["values"])

Algorithm to assign best value between points based on distance

I am having trouble figuring out an algorithim to best assign values to different points on a diagram based on the distance between the points.
Essentially, I am given a diagram with a block and a dynamic amount of points. It should look something like this:
I am then given a list of values to assign to each point. Here are the rules and info:
I know the Lat,Long values for each point and the central block. In other words, I can get the direct distance from every object to another.
The list of values may be shorter that the total number of points. In this case, values can be repeated multiple times.
In the case where values must be repeated, the duplicate values should be as far away as possible from one another.
Here is an example using a value list of {1,2}:
In reality, this is a very simple example. In truth, there may be thousands of points.
Find out how many values you need to repeat, in your example you have 2 values and 5 points so, you need to have 2 repetition for 2 values, then you will have 2x2=4 positions [call this pNum] (you have to use different pairs as much as possible so that they are far apart from each other).
Calculate a distance array then find the max pNum values in the array, in other words find the greates 4 values in the array in your example.
assigne the repeated values for the the points found most far apart, and assign the rest of the points based on the array distance values.

How to determine whether 2 sets of data are similar

I've got a problem about comparing 2 sets of data.
Now i have 2 sets of data, say set A and set B. What i am going to do is :
1.) plot a line graph based on set A's data
2.) plot another line graph based on set B's data and overlay it on set A's graph.
My problem is that set B's data can be much larger ( or smaller ) than set A's data. But the purpose of drawing these graphs is to compare the pattern of this 2 graphs, which means that i need to multiply or divide every data in set B by a factor, say N, so that the resulting graph will lay in similar range ( get them overlaid ). My problem will be how to find this N. Currently i am just getting this N in this way :
1.) Find Average A, the average of the maximum value and minimum value of set A
2.) Find Average B, the average of the maximum value and minimum value of set B
3.) divided B by A to get N.
However i find the result of this way is not very good. Is there any better algorithm to compare 2 sets of data and find such an N?
How about using central moving average by calculating Moving average for both data sets and then divide them. Moving average essentially smoothens spikes.
You could create a best fit line for each set of data and then compute the cosine similarity between the two lines.
This will only work if each data set is linear.

How can I sort a 10 x 10 grid of 100 car images in two dimensions, by price and speed?

Here's the scenario.
I have one hundred car objects. Each car has a property for speed, and a property for price. I want to arrange images of the cars in a grid so that the fastest and most expensive car is at the top right, and the slowest and cheapest car is at the bottom left, and all other cars are in an appropriate spot in the grid.
What kind of sorting algorithm do I need to use for this, and do you have any tips?
EDIT: the results don't need to be exact - in reality I'm dealing with a much bigger grid, so it would be sufficient if the cars were clustered roughly in the right place.
Just an idea inspired by Mr Cantor:
calculate max(speed) and max(price)
normalize all speed and price data into range 0..1
for each car, calculate the "distance" to the possible maximum
based on a²+b²=c², distance could be something like
sqrt( (speed(car[i])/maxspeed)^2 + (price(car[i])/maxprice)^2 )
apply weighting as (visually) necessary
sort cars by distance
place "best" car in "best" square (upper right in your case)
walk the grid in zigzag and fill with next car in sorted list
Result (mirrored, top left is best):
1 - 2 6 - 7
/ / /
3 5 8
| /
4
Treat this as two problems:
1: Produce a sorted list
2: Place members of the sorted list into the grid
The sorting is just a matter of you defining your rules more precisely. "Fastest and most expensive first" doesn't work. Which comes first my £100,000 Rolls Royce, top speed 120, or my souped-up Mini, cost £50,000, top speed 180?
Having got your list how will you fill it? First and last is easy, but where does number two go? Along the top or down? Then where next, along rows, along the columns, zig-zag? You've got to decide. After that coding should be easy.
I guess what you want is to have cars that have "similar" characteristics to be clustered nearby, and additionally that the cost in general increases rightwards, and speed in general increases upwards.
I would try to following approach. Suppose you have N cars and you want to put them in an X * Y grid. Assume N == X * Y.
Put all the N cars in the grid at random locations.
Define a metric that calculates the total misordering in the grid; for example, count the number of car pairs C1=(x,y) and C2=(x',y') such that C1.speed > C2.speed but y < y' plus car pairs C1=(x,y) and C2=(x',y') such that C1.price > C2.price but x < x'.
Run the following algorithm:
Calculate current misordering metric M
Enumerate through all pairs of cars in the grid and calculate the misordering metric M' you obtain if you swapt the cars
Swap the pair of cars that reduces the metric most, if any such pair was found
If you swapped two cars, repeat from step 1
Finish
This is a standard "local search" approach to an optimization problem. What you have here is basically a simple combinatorial optimization problem. Another approaches to try might be using a self-organizing map (SOM) with preseeded gradient of speed and cost in the matrix.
Basically you have to take one of speed or price as primary and then get the cars with the same value of this primary and sort those values in ascending/descending order and primaries are also taken in the ascending/descending order as needed.
Example:
c1(20,1000) c2(30,5000) c3(20, 500) c4(10, 3000) c5(35, 1000)
Lets Assume Car(speed, price) as the measure in the above list and the primary is speed.
1 Get the car with minimum speed
2 Then get all the cars with the same speed value
3 Arrange these values in ascending order of car price
4 Get the next car with the next minimum speed value and repeat the above process
c4(10, 3000)
c3(20, 500)
c1(20, 1000)
c2(30, 5000)
c5(35, 1000)
If you post what language you are using them it would we helpful as some language constructs make this easier to implement. For example LINQ makes your life very easy in this situation.
cars.OrderBy(x => x.Speed).ThenBy(p => p.Price);
Edit:
Now you got the list, as per placing this cars items into the grid unless you know that there will be this many number of predetermined cars with these values, you can't do anything expect for going with some fixed grid size as you are doing now.
One option would be to go with a nonuniform grid, If you prefer, with each row having car items of a specific speed, but this is only applicable when you know that there will be considerable number of cars which has same speed value.
So each row will have cars of same speed shown in the grid.
Thanks
Is the 10x10 constraint necessary? If it is, you must have ten speeds and ten prices, or else the diagram won't make very much sense. For instance, what happens if the fastest car isn't the most expensive?
I would rather recommend you make the grid size equal to
(number of distinct speeds) x (number of distinct prices),
then it would be a (rather) simple case of ordering by two axes.
If the data originates in a database, then you should order them as you fetch them from the database. This should only mean adding ORDER BY speed, price near the end of your query, but before the LIMIT part (where 'speed' and 'price' are the names of the appropriate fields).
As others have said, "fastest and most expensive" is a difficult thing to do, you ought to just pick one to sort by first. However, it would be possible to make an approximation using this algorithm:
Find the highest price and fastest speed.
Normalize all prices and speeds to e.g. a fraction out of 1. You do this by dividing the price by the highest price you found in step 1.
Multiply the normalized price and speed together to create one "price & speed" number.
Sort by this number.
This ensures that is car A is faster and more expensive than car B, it gets put ahead on the list. Cars where one value is higher but the other is lower get roughly sorted. I'd recommend storing these values in the database and sorting as you select.
Putting them in a 10x10 grid is easy. Start outputting items, and when you get to a multiple of 10, start a new row.
Another option is to apply a score 0 .. 200% to each car, and sort by that score.
Example:
score_i = speed_percent(min_speed, max_speed, speed_i) + price_percent(min_price, max_price, price_i)
Hmmm... kind of bubble sort could be simple algorithm here.
Make a random 10x10 array.
Find two neighbours (horizontal or vertical) that are in "wrong order", and exchange them.
Repeat (2) until no such neighbours can be found.
Two neighbour elements are in "wrong order" when:
a) they're horizontal neighbours and left one is slower than right one,
b) they're vertical neighbours and top one is cheaper than bottom one.
But I'm not actually sure if this algorithm stops for every data. I'm almost sure it is very slow :-). It should be easy to implement and after some finite number of iterations the partial result might be good enough for your purposes though. You can also start by generating the array using one of other methods mentioned here. Also it will maintain your condition on array shape.
Edit: It is too late here to prove anything, but I made some experiments in python. It looks like a random array of 100x100 can be sorted this way in few seconds and I always managed to get full 2d ordering (that is: at the end I got wrongly-ordered neighbours). Assuming that OP can precalculate this array, he can put any reasonable number of cars into the array and get sensible results. Experimental code: http://pastebin.com/f2bae9a79 (you need matplotlib, and I recommend ipython too). iterchange is the sorting method there.

Resources