Example data set for the k-Nearest Neighbors algorithm? - computational-geometry

What is an example of a data set one would use with the k-Nearest Neighbors algorithm?
I understand the concept but I am unsure about what kind of data one would use for the x, y coordinates.
Can one provide an example of a dataset (with x, y coordinates) for the nearest-neighbor-k algorithm?

NN search is in a simple way this:
You have a database of elements (here you have 2 dimensional points, with
coordinates x and y).
A query comes, which is the same type of the elements of the
database, thus a 2D point in your case.
The goal is to find which is the most identical point of the query
point inside the database.
There are many algorithms which allow us not to search the whole database, but to search only what is interest for the query, thus answering the query, efficiently.
Example:
Database has 6 2D points: (thus is the datatset you are refering to)
0 0
1 1
2 2
3 3
4 4
5 5
A query 2D point comes:
q = (9, 9)
The answer is the closest point to q, which in this example is the (5, 5).
In a kNN search, the query asks for the k most identical elements of the database, which in our example is the k closest points of the database presented above to the query point q.
So, for k = 3, for example the answer should be:
5 5 // the 1st closest point to q
4 4 // the 2nd closest point to q
3 3 // the 3rd closest point to q

You do not understand the concept.
k-NN isn't limited to datasets with only 2 dimensional points (with x & y coordinates).
Any dataset could be used with k-NN, regardless of the number of features - and you could use many different distance metrics (even ones that are not technically valid metrics).

Related

Concentric spheres

Two different lists having radii of upper hemisphere and lower hemisphere is provided. The first list consists of N upper hemispheres indexed 1 to N and the second has M lower hemispheres indexed 1 to M. A sphere of radius of R can be made taking one upper half of the radius R and one lower half of the radius R. Also, you can put a sphere into a bigger one and create a sequence of nested concentric spheres. But you can't put two or more spheres directly into another one.
If there is a sequence of (D+1) nested spheres, we can call this sequence as a D-sequence.
Find out how many different X-sequence are possible (1 <= X <= C). An X sequence is different from another if the index of any of the hemisphere used in one X-sequence is different from the other.
INPUT
The first line contains a three integers: N denoting the number of upper sphere halves, M denoting the number of lower sphere halves and C.
The second line contains N space-separated integers denoting the radii of upper hemispheres.
The third line contains M space-separated integers denoting the radii of lower hemispheres.
OUTPUT
Output a single line containing C space-separated integers , the number of ways there are to build i-sequence in modulo 1000000007.
Example
Input
3 4 3
1 2 3
1 1 3 2
Output
5 2 0
I am looking for those elements which are part of both the lists of upper as well as lower hemispheres, so that they can form a sphere and then taking their maximum count by comparing their counts in both radii lists.
And, So, for different C sum of products of counts of C+1 elements yields the answer.
How to calculate the above efficiently or is there any other approach ??
Guys this is my first answer. Spare me the whip for now as i am here to learn.
You first find the numbers of spheres possible for each radii.
no of spheres: 2 1 1
Having Radii: 1 2 3
Now since we can fit a sphere with radius r inside a sphere with radii R such that R>r, all we need to do is to find the no . of increasing subsequences of length 2,3,...till c in the list of all possible spheres formed.
List of possible spheres:[1,1*,2,3](* used for marking)
consider D1: it has 2 spheres. Try finding the no. of increasing subsequences of length 2 in the above list.
They are:
[1,2],[1*,2][1,3][1*,3][2,3]
hence the ans is 5.
Get it??
Now how to solve:
It can be done by using Dp. Naive solution has complexity .O(n^2*constant).
You may follow along the lines as provided in the following link :Dp solution.
It is worth mentioning that faster methods do exist which use BIT , segment trees etc.
It is similar to this SPOJ problem.

Touching segments

Can anyone please suggest me algorithm for this.
You are given starting and the ending points of N segments over the x-axis.
How many of these segments can be touched, even on their edges, by exactly two lines perpendicular to them?
Sample Input :
3
5
2 3
1 3
1 5
3 4
4 5
5
1 2
1 3
2 3
1 4
1 5
3
1 2
3 4
5 6
Sample Output :
Case 1: 5
Case 2: 5
Case 3: 2
Explanation :
Case 1: We will draw two lines (parallel to Y-axis) crossing X-axis at point 2 and 4. These two lines will touch all the five segments.
Case 2: We can touch all the points even with one line crossing X-axis at 2.
Case 3: It is not possible to touch more than two points in this case.
Constraints:
1 ≤ N ≤ 10^5
0 ≤ a < b ≤ 10^9
Let assume that we have a data structure that supports the following operations efficiently:
Add a segment.
Delete a segment.
Return the maximum number of segments that cover one point(that is, the "best" point).
If have such a structure, we can get use the initial problem efficiently in the following manner:
Let's create an array of events(one event for the start of each segment and one for the end) and sort by the x-coordinate.
Add all segments to the magical data structure.
Iterate over all events and do the following: when a segment start, add one to the number of currently covered segments and remove it from that data structure. When a segment ends, subtract one from the number of currently covered segment and add this segment to the magical data structure. After each event, update the answer with the value of the number of currently covered segments(it shows how many segments are covered by the point which corresponds to the current event) plus the maximum returned by the data structure described above(it shows how we can choose another point in the best possible way).
If this data structure can perform all given operations in O(log n), then we have an O(n log n) solution(we sort the events and make one pass over the sorted array making a constant number of queries to this data structure for each event).
So how can we implement this data structure? Well, a segment tree works fine here. Adding a segment is adding one to a specific range. Removing a segment is subtracting one from all elements in a specific range. Get ting the maximum is just a standard maximum operation on a segment tree. So we need a segment tree that supports two operations: add a constant to a range and get maximum for the entire tree. It can be done in O(log n) time per query.
One more note: a standard segment tree requires coordinates to be small. We may assume that they never exceed 2 * n(if it is not the case, we can compress them).
An O(N*max(logN, M)) solution, where M is the medium segment size, implemented in Common Lisp: touching-segments.lisp.
The idea is to first calculate from left to right at every interesting point the number of segments that would be touched by a line there (open-left-to-right on the lisp code). Cost: O(NlogN)
Then, from right to left it calculates, again at every interesting point P, the best location for a line considering segments fully to the right of P (open-right-to-left on the lisp code). Cost O(N*max(logN, M))
Then it is just a matter of looking for the point where the sum of both values tops. Cost O(N).
The code is barely tested and may contain bugs. Also, I have not bothered to handle edge cases as when the number of segments is zero.
The problem can be solved in O(Nlog(N)) time per test case.
Observe that there is an optimal placement of two vertical lines each of which go through some segment endpoints
Compress segments' coordinates. More info at What is coordinate compression?
Build a sorted set of segment endpoints X
Sort segments [a_i,b_i] by a_i
Let Q be a priority queue which stores right endpoints of segments processed so far
Let T be a max interval tree built over x-coordinates. Some useful reading atWhat are some sources (books, etc.) from where I can learn about Interval, Segment, Range trees?
For each segment make [a_i,b_i]-range increment-by-1 query to T. It allows to find maximum number of segments covering some x in [a,b]
Iterate over elements x of X. For each x process segments (not already processed) with x >= a_i. The processing includes pushing b_i to Q and making [a_i,b_i]-range increment-by-(-1) query to T. After removing from Q all elements < x, A= Q.size is equal to number of segments covering x. B = T.rmq(x + 1, M) returns maximum number of segments that do not cover x and cover some fixed y > x. A + B is a candidate for an answer.
Source:
http://www.quora.com/What-are-the-intended-solutions-for-the-Touching-segments-and-the-Smallest-String-and-Regex-problems-from-the-Cisco-Software-Challenge-held-on-Hackerrank

Computation of Error rate in nearest neighbor classification algorithm

I am trying to find the optimal value of K for K Nearest Neighbor Algorithm.
I am been running this classification method on Matlab for different number of classes members
but I need to calculate the error rate when we use different value of K.
I am trying to use this idea as example:
I have the following data set:
1 3 1
2 3 2
2 1 2
3 3 2
3 4 1
3 3 2
2 2 2
Where the first column is the x axis the second it y axis the third is the label
of the class and I need to classify point (x,y) using K-NN algorithm. I am using different values of K.
My question is if I know that that point (4,1) is not included in the source dataset
but I know that it is from the class label 1. How can I compute the error rate of the
certain K value based on method Leave-one-out-cross-validation.
Thank you a lot in advance
Regards
Rinadi
The leave-one-out cross validation means simply, that given your model m, training set T of size n and some evaluation metric (error measure) E you proceed as follows:
For each point (x,y) from T:
You train your model m on T\(x,y) (all points but the one taken in 1.)
You check E( m , (x,y) ), for example you check whether m is able to determine y given x correctly (then E=0) or not (and E=1)
You compute the mean of all E values across all points analyzed
As the result you have a mean generalization error estimation - you checked how well your model can predict a label of one point, trained on the rest of the training set.

Eliminating symmetry from graphs

I have an algorithmic problem in which I have derived a transfer matrix between a lot of states. The next step is to exponentiate it, but it is very large, so I need to do some reductions on it. Specifically it contains a lot of symmetry. Below are some examples on how many nodes can be eliminated by simple observations.
My question is whether there is an algorithm to efficiently eliminate symmetry in digraphs, similarly to the way I've done it manually below.
In all cases the initial vector has the same value for all nodes.
In the first example we see that b, c, d and e all receive values from a and one of each other. Hence they will always contain an identical value, and we can merge them.
In this example we quickly spot, that the graph is identical from the point of view of a, b, c and d. Also for their respective sidenodes, it doesn't matter to which inner node it is attached. Hence we can reduce the graph down to only two states.
Update: Some people were reasonable enough not quite sure what was meant by "State transfer matrix". The idea here is, that you can split a combinatorial problem up into a number of state types for each n in your recurrence. The matrix then tell you how to get from n-1 to n.
Usually you are only interested about the value of one of your states, but you need to calculate the others as well, so you can always get to the next level. In some cases however, multiple states are symmetrical, meaning they will always have the same value. Obviously it's quite a waste to calculate all of these, so we want to reduce the graph until all nodes are "unique".
Below is an example of the transfer matrix for the reduced graph in example 1.
[S_a(n)] [1 1 1] [S_a(n-1)]
[S_f(n)] = [1 0 0]*[S_f(n-1)]
[S_B(n)] [4 0 1] [S_B(n-1)]
Any suggestions or references to papers are appreciated.
Brendan McKay's nauty ( http://cs.anu.edu.au/~bdm/nauty/) is the best tool I know of for computing automorphisms of graphs. It may be too expensive to compute the whole automorphism group of your graph, but you might be able to reuse some of the algorithms described in McKay's paper "Practical Graph Isomorphism" (linked from the nauty page).
I'll just add an extra answer building on what userOVER9000 suggested, if anybody else are interested.
The below is an example of using nauty on Example 2, through the dreadnaut tool.
$ ./dreadnaut
Dreadnaut version 2.4 (64 bits).
> n=8 d g -- Starting a new 8-node digraph
0 : 1 3 4; -- Entering edge data
1 : 0 2 5;
2 : 3 1 6;
3 : 0 2 7;
4 : 0;
5 : 1;
6 : 2;
7 : 3;
> cx -- Calling nauty
(1 3)(5 7)
level 2: 6 orbits; 5 fixed; index 2
(0 1)(2 3)(4 5)(6 7)
level 1: 2 orbits; 4 fixed; index 4
2 orbits; grpsize=8; 2 gens; 6 nodes; maxlev=3
tctotal=8; canupdates=1; cpu time = 0.00 seconds
> o -- Output "orbits"
0:3; 4:7;
Notice it suggests joining nodes 0:3 which are a:d in Example 2 and 4:7 which are e:h.
The nauty algorithm is not well documented, but the authors describe it as exponential worst case, n^2 average.
Computing symmetries seems to be a bit of a second order problem. Taking just a,b,c and d in your second graph, the symmetry would have to be expressed
a(b,c,d) = b(a,d,c)
and all its permutations, or some such. Consider a second subgraph a', b', c', d' added to it. Again, we have the symmetries, but parameterised differently.
For computing people (rather than math people), could we express the problem like so?
Each graph node contains a set of letters. At each iteration, all of the letters in each node are copied to its neighbours by the arrows (some arrows take more than one iteration and can be treated as a pipe of anonymous nodes).
We are trying to find efficient ways of determining things such as
* what letters each set/node contains after N iterations.
* for each node the N after which its set no longer changes.
* what sets of nodes wind up containing the same sets of letters (equivalence class)
?

plane bombing problems- help

I'm training code problems, and on this one I am having problems to solve it, can you give me some tips how to solve it please.
The problem is taken from here:
https://www.ieee.org/documents/IEEEXtreme2008_Competitition_book_2.pdf
Problem 12: Cynical Times.
The problem is something like this (but do refer to above link of the source problem, it has a diagram!):
Your task is to find the sequence of points on the map that the bomber is expected to travel such that it hits all vital links. A link from A to B is vital when its absence isolates completely A from B. In other words, the only way to go from A to B (or vice versa) is via that link.
Due to enemy counter-attack, the plane may have to retreat at any moment, so the plane should follow, at each moment, to the closest vital link possible, even if in the end the total distance grows larger.
Given all coordinates (the initial position of the plane and the nodes in the map) and the range R, you have to determine the sequence of positions in which the plane has to drop bombs.
This sequence should start (takeoff) and finish (landing) at the initial position. Except for the start and finish, all the other positions have to fall exactly in a segment of the map (i.e. it should correspond to a point in a non-hit vital link segment).
The coordinate system used will be UTM (Universal Transverse Mercator) northing and easting, which basically corresponds to a Euclidian perspective of the world (X=Easting; Y=Northing).
Input
Each input file will start with three floating point numbers indicating the X0 and Y0 coordinates of the airport and the range R. The second line contains an integer, N, indicating the number of nodes in the road network graph. Then, the next N (<10000) lines will each contain a pair of floating point numbers indicating the Xi and Yi coordinates (1 < i<=N). Notice that the index i becomes the identifier of each node. Finally, the last block starts with an integer M, indicating the number of links. Then the next M (<10000) lines will each have two integers, Ak and Bk (1 < Ak,Bk <=N; 0 < k < M) that correspond to the identifiers of the points that are linked together.
No two links will ever cross with each other.
Output
The program will print the sequence of coordinates (pairs of floating point numbers with exactly one decimal place), each one at a line, in the order that the plane should visit (starting and ending in the airport).
Sample input 1
102.3 553.9 0.2
14
342.2 832.5
596.2 638.5
479.7 991.3
720.4 874.8
744.3 1284.1
1294.6 924.2
1467.5 659.6
1802.6 659.6
1686.2 860.7
1548.6 1111.2
1834.4 1054.8
564.4 1442.8
850.1 1460.5
1294.6 1485.1
17
1 2
1 3
2 4
3 4
4 5
4 6
6 7
7 8
8 9
8 10
9 10
10 11
6 11
5 12
5 13
12 13
13 14
Sample output 1
102.3 553.9
720.4 874.8
850.1 1460.5
102.3 553.9
Pre-process the input first, so you identify the choke points. Algorithms like Floyd-Warshall would help you.
Model the problem as a Heuristic Search problem, you can compute a MST which covers all choke-points and take the sum of the costs of the edges as a heuristic.
As the commenters said, try to make concrete questions, either here or to the TA supervising your class.
Don't forget to mention where you got these hints.
The problem can be broken down into two parts.
1) Find the vital links.
These are nothing but the Bridges in the graph described. See the wiki page (linked to in the previous sentence), it mentions an algorithm by Tarjan to find the bridges.
2) Once you have the vital links, you need to find the smallest number of points which given the radius of the bomb, will cover the links. For this, for each link, you create a region around it, where dropping the bomb will destroy it. Now you form a graph of these regions (two regions are adjacent if they intersect). You probably need to find a minimum clique partition in this graph.
Haven't thought it through (especially part 2), but hope it helps.
And good luck in the contest!
I think Moron' is right about the first part, but on the second part...
The problem description does not tell anything about "smallest number of points". It tells that the plane flies to the closest vital link.
So, I think the part 2 will be much simpler:
Find the closest non-hit segment to the current location.
Travel to the closest point on the closest segment.
Bomb the current location (remove all segments intersecting a circle)
Repeat until there are no non-hit vital links left.
This straight-forward algorithm has a complexity of O(N*N), but this should be sufficient considering input constraints.

Resources