Computation of Error rate in nearest neighbor classification algorithm - algorithm

I am trying to find the optimal value of K for K Nearest Neighbor Algorithm.
I am been running this classification method on Matlab for different number of classes members
but I need to calculate the error rate when we use different value of K.
I am trying to use this idea as example:
I have the following data set:
1 3 1
2 3 2
2 1 2
3 3 2
3 4 1
3 3 2
2 2 2
Where the first column is the x axis the second it y axis the third is the label
of the class and I need to classify point (x,y) using K-NN algorithm. I am using different values of K.
My question is if I know that that point (4,1) is not included in the source dataset
but I know that it is from the class label 1. How can I compute the error rate of the
certain K value based on method Leave-one-out-cross-validation.
Thank you a lot in advance
Regards
Rinadi

The leave-one-out cross validation means simply, that given your model m, training set T of size n and some evaluation metric (error measure) E you proceed as follows:
For each point (x,y) from T:
You train your model m on T\(x,y) (all points but the one taken in 1.)
You check E( m , (x,y) ), for example you check whether m is able to determine y given x correctly (then E=0) or not (and E=1)
You compute the mean of all E values across all points analyzed
As the result you have a mean generalization error estimation - you checked how well your model can predict a label of one point, trained on the rest of the training set.

Related

Concentric spheres

Two different lists having radii of upper hemisphere and lower hemisphere is provided. The first list consists of N upper hemispheres indexed 1 to N and the second has M lower hemispheres indexed 1 to M. A sphere of radius of R can be made taking one upper half of the radius R and one lower half of the radius R. Also, you can put a sphere into a bigger one and create a sequence of nested concentric spheres. But you can't put two or more spheres directly into another one.
If there is a sequence of (D+1) nested spheres, we can call this sequence as a D-sequence.
Find out how many different X-sequence are possible (1 <= X <= C). An X sequence is different from another if the index of any of the hemisphere used in one X-sequence is different from the other.
INPUT
The first line contains a three integers: N denoting the number of upper sphere halves, M denoting the number of lower sphere halves and C.
The second line contains N space-separated integers denoting the radii of upper hemispheres.
The third line contains M space-separated integers denoting the radii of lower hemispheres.
OUTPUT
Output a single line containing C space-separated integers , the number of ways there are to build i-sequence in modulo 1000000007.
Example
Input
3 4 3
1 2 3
1 1 3 2
Output
5 2 0
I am looking for those elements which are part of both the lists of upper as well as lower hemispheres, so that they can form a sphere and then taking their maximum count by comparing their counts in both radii lists.
And, So, for different C sum of products of counts of C+1 elements yields the answer.
How to calculate the above efficiently or is there any other approach ??
Guys this is my first answer. Spare me the whip for now as i am here to learn.
You first find the numbers of spheres possible for each radii.
no of spheres: 2 1 1
Having Radii: 1 2 3
Now since we can fit a sphere with radius r inside a sphere with radii R such that R>r, all we need to do is to find the no . of increasing subsequences of length 2,3,...till c in the list of all possible spheres formed.
List of possible spheres:[1,1*,2,3](* used for marking)
consider D1: it has 2 spheres. Try finding the no. of increasing subsequences of length 2 in the above list.
They are:
[1,2],[1*,2][1,3][1*,3][2,3]
hence the ans is 5.
Get it??
Now how to solve:
It can be done by using Dp. Naive solution has complexity .O(n^2*constant).
You may follow along the lines as provided in the following link :Dp solution.
It is worth mentioning that faster methods do exist which use BIT , segment trees etc.
It is similar to this SPOJ problem.

Touching segments

Can anyone please suggest me algorithm for this.
You are given starting and the ending points of N segments over the x-axis.
How many of these segments can be touched, even on their edges, by exactly two lines perpendicular to them?
Sample Input :
3
5
2 3
1 3
1 5
3 4
4 5
5
1 2
1 3
2 3
1 4
1 5
3
1 2
3 4
5 6
Sample Output :
Case 1: 5
Case 2: 5
Case 3: 2
Explanation :
Case 1: We will draw two lines (parallel to Y-axis) crossing X-axis at point 2 and 4. These two lines will touch all the five segments.
Case 2: We can touch all the points even with one line crossing X-axis at 2.
Case 3: It is not possible to touch more than two points in this case.
Constraints:
1 ≤ N ≤ 10^5
0 ≤ a < b ≤ 10^9
Let assume that we have a data structure that supports the following operations efficiently:
Add a segment.
Delete a segment.
Return the maximum number of segments that cover one point(that is, the "best" point).
If have such a structure, we can get use the initial problem efficiently in the following manner:
Let's create an array of events(one event for the start of each segment and one for the end) and sort by the x-coordinate.
Add all segments to the magical data structure.
Iterate over all events and do the following: when a segment start, add one to the number of currently covered segments and remove it from that data structure. When a segment ends, subtract one from the number of currently covered segment and add this segment to the magical data structure. After each event, update the answer with the value of the number of currently covered segments(it shows how many segments are covered by the point which corresponds to the current event) plus the maximum returned by the data structure described above(it shows how we can choose another point in the best possible way).
If this data structure can perform all given operations in O(log n), then we have an O(n log n) solution(we sort the events and make one pass over the sorted array making a constant number of queries to this data structure for each event).
So how can we implement this data structure? Well, a segment tree works fine here. Adding a segment is adding one to a specific range. Removing a segment is subtracting one from all elements in a specific range. Get ting the maximum is just a standard maximum operation on a segment tree. So we need a segment tree that supports two operations: add a constant to a range and get maximum for the entire tree. It can be done in O(log n) time per query.
One more note: a standard segment tree requires coordinates to be small. We may assume that they never exceed 2 * n(if it is not the case, we can compress them).
An O(N*max(logN, M)) solution, where M is the medium segment size, implemented in Common Lisp: touching-segments.lisp.
The idea is to first calculate from left to right at every interesting point the number of segments that would be touched by a line there (open-left-to-right on the lisp code). Cost: O(NlogN)
Then, from right to left it calculates, again at every interesting point P, the best location for a line considering segments fully to the right of P (open-right-to-left on the lisp code). Cost O(N*max(logN, M))
Then it is just a matter of looking for the point where the sum of both values tops. Cost O(N).
The code is barely tested and may contain bugs. Also, I have not bothered to handle edge cases as when the number of segments is zero.
The problem can be solved in O(Nlog(N)) time per test case.
Observe that there is an optimal placement of two vertical lines each of which go through some segment endpoints
Compress segments' coordinates. More info at What is coordinate compression?
Build a sorted set of segment endpoints X
Sort segments [a_i,b_i] by a_i
Let Q be a priority queue which stores right endpoints of segments processed so far
Let T be a max interval tree built over x-coordinates. Some useful reading atWhat are some sources (books, etc.) from where I can learn about Interval, Segment, Range trees?
For each segment make [a_i,b_i]-range increment-by-1 query to T. It allows to find maximum number of segments covering some x in [a,b]
Iterate over elements x of X. For each x process segments (not already processed) with x >= a_i. The processing includes pushing b_i to Q and making [a_i,b_i]-range increment-by-(-1) query to T. After removing from Q all elements < x, A= Q.size is equal to number of segments covering x. B = T.rmq(x + 1, M) returns maximum number of segments that do not cover x and cover some fixed y > x. A + B is a candidate for an answer.
Source:
http://www.quora.com/What-are-the-intended-solutions-for-the-Touching-segments-and-the-Smallest-String-and-Regex-problems-from-the-Cisco-Software-Challenge-held-on-Hackerrank

Archers and Pikemen (CODEMASTERS) (Codechef)

There was a recent contest on codechef named CODEMASTER (which has just ended a few minutes back, so I can put this question on a forum now, I believe :P ).
The question is Archers and Pikemen.
This is the problem statement::
You are being attacked by hostile enemy forces, with knights and
swordsmen charging at you. Being the commander of your unit, you have
been given the task of arranging your elite archers and pikemen in a
special formation.
The archers and pikemen must stand between two flag posts in a
straight line (no soldier can stand beyond the flags). Each archer
must have at least two pikemen by his side (one to his left and one to
his right) such that he is at equal distances from both of them. (A
pikeman may be shared between two archers).
The archers stand at given fixed positions and the separations between
them may not be equal. You need to position your troops in the given
formation using the minimum number of pikemen.
Assume that the minimum distance between a pikeman and an archer or a
pikeman and a flag is 1 unit. the minimum distance between two
archers is 2 units. Input
The first line of the input contains an integer T denoting the number
of test cases. The description of T test cases follow.
The second line contains an integer N denoting the number of
separations.
The following N lines each contain an integer x, which is the
separation of the current archer from the previous one.
The first value of x is the separation of the first archer from the
first flag. The last value of x is the separation between the last
archer and the second flag Output
For each test case, output a single line containing the minimum number
of pikemen required. Constraints
1 ≤ T ≤ 1000
1 ≤ N ≤ 1000
2 ≤ x ≤ 1000
Example
Input:
2 3 4 4 2 4 2 2 2 2
Output:
3 4
Explanation
Example case 1: A possible formation can be :
F---1---p---3---A---3---p---1---A---1---p---1--- F
Example case 2: A possible formation can be :
F---1---p---1---A---1---p---1---A---1---p---1---A---1---p---1---F
F = flag A = archer p = pikeman
---d--- = distance between pikeman and archer/flag
The first instinct was that the number of pikeman were equal to the number of gaps, but then I realized that there can be a case when, we might that to place two pikeman between 2 archers since the distance of the next archer to the right might have their distance less than the distance between previous two archers.
Can some please help me explain the algorithm for this question.
The problem link:: http://www.codechef.com/CDMS2014/problems/CM1401
Link to one of the accepted solutions:: http://www.codechef.com/viewsolution/5166485
Please help me explain this problem guys.
Thanks in advance.. ;)

Example data set for the k-Nearest Neighbors algorithm?

What is an example of a data set one would use with the k-Nearest Neighbors algorithm?
I understand the concept but I am unsure about what kind of data one would use for the x, y coordinates.
Can one provide an example of a dataset (with x, y coordinates) for the nearest-neighbor-k algorithm?
NN search is in a simple way this:
You have a database of elements (here you have 2 dimensional points, with
coordinates x and y).
A query comes, which is the same type of the elements of the
database, thus a 2D point in your case.
The goal is to find which is the most identical point of the query
point inside the database.
There are many algorithms which allow us not to search the whole database, but to search only what is interest for the query, thus answering the query, efficiently.
Example:
Database has 6 2D points: (thus is the datatset you are refering to)
0 0
1 1
2 2
3 3
4 4
5 5
A query 2D point comes:
q = (9, 9)
The answer is the closest point to q, which in this example is the (5, 5).
In a kNN search, the query asks for the k most identical elements of the database, which in our example is the k closest points of the database presented above to the query point q.
So, for k = 3, for example the answer should be:
5 5 // the 1st closest point to q
4 4 // the 2nd closest point to q
3 3 // the 3rd closest point to q
You do not understand the concept.
k-NN isn't limited to datasets with only 2 dimensional points (with x & y coordinates).
Any dataset could be used with k-NN, regardless of the number of features - and you could use many different distance metrics (even ones that are not technically valid metrics).

how to apply dynamic programming in finding the minimum cost to create the tower in a field

you are given an N X M rectangular field with bottom left point at the origin. You have to construct a tower with square base in the field. There are trees in the field with associated cost to uproot them. So you have to minimize the number of trees uprooted to minimize the cost of constructing the tower.
Example Input:
N = 4
M = 3
Lenght of side of Tower = 1
Number of Trees in the field = 4
1 3 5
3 3 4
2 2 1
2 1 2
The 4 rows in the Input are the coordinates of the tree with cost for uprooting as the third integer.
Tree coinciding with the edge of the tower is considered as placed inside the tower and have to be uprooted as well.
I'm facing problem in formulating the Dynamic Programming relation for this problem
thanks
It sounds like your problem boils down to: find the KxK subblock of an MxN matrix with the smallest sum. You can solve this problem efficiently (proportional to the size of your input) by using an integral transform. Of course, this doesn't necessarily help you with your dynamic programming issue -- I'm not sure this solution is equivalent to any dynamic programming formulation....
At any rate, for each index pair (a,b) of your original matrix M, compute an "integral transform" matrix I[a,b] = sum[i<=a, j<=b](M[i,j]). This is computable by traversing the matrix in order, referring to the value computed from the previous row/column. (with a bit of thought, you can also do this efficiently with a sparse matrix)
Then, you can compute the sum of any subblock (a1..a2, b1..b2) in constant time as I[a2,b2] - I[a1-1,b2] - I[a2,b1-1] + I[a1-1,b1-1]. Iterating through all KxK subblocks to find the smallest sum will then take time proportional to the size of your original matrix also.
Since the original problem is phrased as a list of integral coordinates (and, presumably, expects the tower location to be output as an integral coordinate pair), you likely do need to represent your field as a sparse matrix for an efficient solution -- this involves sorting your trees' coordinates in lexicographic order (e.g. first by x-coordinate, then by y-coordinate). Note that this sorting step may take O(L log L) for input of size L, dominating the following steps, which take only O(L) in the size of the input.
Also note that, due to the problem specifying that "trees coinciding with the edge of the tower are uprooted...", a tower with edge length K actually corresponds to an (K+1)x(K+1) subblock.

Resources