I don't fully understand how the triangular inequality is used to optimise distance calculations in KNN classification.
I had written a python script referring the steps mentioned below
Calculate the distance between each training pixel to the other.
For each test sample
Calculate the distance from the first training sample as dn. This would be the current minimum distance.
Calculate the distance from the second training sample(p) as dp.
If dp < dn assign dn =dp
For each remaining training sample(c)
If distance between the sample c and sample p measured as dcp meets
dp - dn < dcp < dp + dn
Calculate distance from test sample to the sample c as dp
If dp < dn, assign: dn = dp
Else, skip this training sample.
Stop if there are no more training samples
The class to which n belongs is the estimate.
Python Script:
def get_distance(p1 = (0, 0), p2 = (0, 0)):
return abs(p1[0] - p2[0]) + abs(p1[1] - p2[1])
def algorithm(train_set, new_point):
d_n = get_distance(new_point, train_set[0])
d_p = get_distance(new_point, train_set[1])
min_index = 0
if d_p < d_n:
d_n = d_p
min_index = 1
for c in range(2, len(train_set)):
dcp = get_distance(train_set[min_index], train_set[c])
if d_p - d_n < dcp < d_p + d_n:
d_p = get_distance(new_point, train_set[c])
if d_p < d_n:
d_n = d_p
min_index = c
print(train_set[min_index], d_n)
train_set = [
(0, 1, 'A'),
(1, 1, 'A'),
(2, 5, 'B'),
(1, 8, 'A'),
(5, 3, 'C'),
(4, 2, 'C'),
(3, 2, 'A'),
(1, 7, 'B'),
(4, 8, 'B'),
(4, 0, 'A'),
]
for new_point in train_set:
# Checking the distances from the points within training set iteself: min distance = 0, used for validation
result_point = min(train_set, key = lambda x : get_distance(x, new_point))
print(result_point, get_distance(result_point, new_point))
algorithm(train_set, new_point)
print('----------')
But it doesn't give the required result for 1 point.
Is my understanding of the optimization wrong?
Thank you in advance for any help.
Related
Given a list of paired numbers [(13, 4), (8, 12), (8, 4), (13, 2)] where each number appears exactly once or twice, what's a good algorithm to create a "linked list", where tuples are connected if they shared a number?
Linking the above tuples would return either (2, 13)-(13, 4)-(4, 8)-(8, 12) or just (2-13-4-8-12) (or either the reverse order, the order in a pair doesn't matter).
(As a second question: is there a name for this type of problem?)
Provided that the input defines a graph that is a single path, you could use this algorithm. I used Python:
from collections import defaultdict
# The function that takes a list of pairs as input and returns the path/chain
def chain(pairs):
# create graph
edges = defaultdict(set) # each node will be associated with a set of neighbors
for a, b in pairs:
edges[a].add(b)
edges[b].add(a) # define both the forward as backward edge
# find a node with just one edge
start = pairs[0][0] # default node
for node, second in edges.items():
if len(second) == 1:
start = node
break
# start at that node and follow the edges
result = [start]
prev = None
while len(result) <= len(pairs):
neighbors = edges[start]
neighbors.discard(prev) # discard the back referencing edge
prev = start
start = next(iter(neighbors)) # get the only other neighbor
result.append(start)
return result
pairs = [(13, 4), (8, 12), (8, 4), (13, 2)]
print(chain(pairs))
Here is another possible implementation which avoids using graphs, and instead joins edges to sub-chains, until they form a single long chains:
def chain(edges):
from collections import deque
link = lambda c, e: (c.appendleft if c[0] == a else c.append)
chains = {}
for a, b in edges:
if a in chains and b in chains:
c1 = chains.pop(a)
c2 = chains.pop(b)
if c1[0] == a: c1.reverse()
if c2[0] != b: c2.reverse()
c1.extend(c2)
chains[c1[-1]] = c1
elif a in chains:
chain = chains.pop(a)
link(chain, a)(b)
elif b in chains:
chain = chains.pop(b)
link(chain, b)(a)
else:
chains[a] = chains[b] = deque([a, b])
return set(map(tuple, chains.values())).pop()
edges = [(13, 4), (8, 12), (8, 4), (13, 2)]
path = chain(edges)
This is based on trincot's answer, which avoids mutating the graph:
def make_graph(edges):
graph = defaultdict(set)
for a, b in pairs:
edges[a].add(b)
edges[b].add(a
return graph
def find_endpoints(edges):
from collections import Counter
from itertools import chain
return [v for v, count in Counter(chain(*edges)).items() if count == 1]
def find_path(graph, start, end):
chain = [start, next(iter(graph[start]))]
while chain[-1] != end:
for neighbour in graph[chain[-1]]:
if neighbour != chain[-2]:
chain.append(neighbour)
break
return chain
edges = [(13, 4), (8, 12), (8, 4), (13, 2)]
graph = make_graph(edges)
endpoints = find_endpoints(edges)
path = find_path(graph, *endpoints)
Within a super-region S, there are k small subregions. The number k can be up to 200. There may be overlap between subregions. I have millions of regions S.
For each super-region, my goal is to find out all combinations in which there are 2 or more non-overlapped subregions.
Here is an example:
Super region: 1-100
Subregions: 1-8, 2-13, 9-18, 15-30, 20-35
Goal:
Combination1: 1-8, 9-18
Combination2: 1-8, 20-35
Combination3: 1-8, 9-18, 20-35
Combination4: 1-8, 15-30
...
Number of subsets might be exponential (max 2^k), so there is nothing wrong to traverse all possible independent subsets with recursion. I've used linear search of the next possible interval, but it is worth to exploit binary search.
def nonovl(l, idx, right, ll):
if idx == len(l):
if ll:
print(ll)
return
#find next non-overlapping interval without using l[idx]
next = idx + 1
while next < len(l) and right >= l[next][0]:
next += 1
nonovl(l, next, right, ll)
#find next non-overlapping interval after using l[idx]
next = idx + 1
right = l[idx][1]
while next < len(l) and right >= l[next][0]:
next += 1
nonovl(l, next, right, ll + str(l[idx]))
l=[(1,8),(2,13),(9,18),(15,30),(20,35)]
l.sort()
nonovl(l, 0, -1, "")
(20, 35)
(15, 30)
(9, 18)
(9, 18)(20, 35)
(2, 13)
(2, 13)(20, 35)
(2, 13)(15, 30)
(1, 8)
(1, 8)(20, 35)
(1, 8)(15, 30)
(1, 8)(9, 18)
(1, 8)(9, 18)(20, 35)
I found this problem in a programming forum Ohjelmointiputka:
https://www.ohjelmointiputka.net/postit/tehtava.php?tunnus=ahdruu and
https://www.ohjelmointiputka.net/postit/tehtava.php?tunnus=ahdruu2
Somebody said that there is a solution found by a computer, but I was unable to find a proof.
Prove that there is a matrix with 117 elements containing the digits such that one can read the squares of the numbers 1, 2, ..., 100.
Here read means that you fix the starting position and direction (8 possibilities) and then go in that direction, concatenating the numbers. For example, if you can find for example the digits 1,0,0,0,0,4 consecutively, you have found the integer 100004, which contains the square numbers of 1, 2, 10, 100 and 20, since you can read off 1, 4, 100, 10000, and 400 (reversed) from that sequence.
But there are so many numbers to be found (100 square numbers, to be precise, or 81 if you remove those that are contained in another square number with total 312 digits) and so few integers in a matrix that you have to put all those square numbers so densely that finding such a matrix is difficult, at least for me.
I found that if there is such a matrix mxn, we may assume without loss of generalty that m<=n. Therefore, the matrix must be of the type 1x117, 3x39 or 9x13. But what kind of algorithm will find the matrix?
I have managed to do the program that checks if numbers to be added can be put on the board. But how can I implemented the searching algorithm?
# -*- coding: utf-8 -*-
# Returns -1 if can not put and value how good a solution is if can be put. Bigger value of x is better.
def can_put_on_grid(grid, number, start_x, start_y, direction):
# Check that the new number lies inside the grid.
x = 0
if start_x < 0 or start_x > len(grid[0]) - 1 or start_y < 0 or start_y > len(grid) - 1:
return -1
end = end_coordinates(number, start_x, start_y, direction)
if end[0] < 0 or end[0] > len(grid[0]) - 1 or end[1] < 0 or end[1] > len(grid) - 1:
return -1
# Test if new number does not intersect any previous number.
A = [-1,-1,-1,0,0,1,1,1]
B = [-1,0,1,-1,1,-1,0,1]
for i in range(0,len(number)):
if grid[start_x + A[direction] * i][start_y + B[direction] * i] not in ("X", number[i]):
return -1
else:
if grid[start_x + A[direction] * i][start_y + B[direction] * i] == number[i]:
x += 1
return x
def end_coordinates(number, start_x, start_y, direction):
end_x = None
end_y = None
l = len(number)
if direction in (1, 4, 7):
end_x = start_x - l + 1
if direction in (3, 6, 5):
end_x = start_x + l - 1
if direction in (2, 0):
end_x = start_x
if direction in (1, 2, 3):
end_y = start_y - l + 1
if direction in (7, 0, 5):
end_y = start_y + l - 1
if direction in (4, 6):
end_y = start_y
return (end_x, end_y)
if __name__ == "__main__":
A = [['X' for x in range(13)] for y in range(9)]
numbers = [str(i*i) for i in range(1, 101)]
directions = [0, 1, 2, 3, 4, 5, 6, 7]
for i in directions:
C = can_put_on_grid(A, "10000", 3, 5, i)
if C > -1:
print("One can put the number to the grid!")
exit(0)
I also found think that brute force search or best first search is too slow. I think there might be a solution using simulated annealing, genetic algorithm or bin packing algorithm. I also wondered if one can apply Markov chains somehow to find the grid. Unfortunately those seems to be too hard for me to implemented at current skills.
There is a program for that in https://github.com/minkkilaukku/square-packing/blob/master/sqPackMB.py . Just change M=9, N=13 from the lines 20 and 21.
I have a sparse matrix that has been exported to this format:
(1, 3) = 4
(0, 5) = 88
(6, 0) = 100
...
Strings are stored into a Trie data structure. The numbers in the previous exported sparse matrix correspond to the result of the lookup on the Trie.
Lets say the word "stackoverflow" is mapped to number '0'. I need to iterate the exported sparse matrix where the first element is equals to '0' and find the highest value.
For example:
(0, 1) = 4
(0, 3) = 8
(0, 9) = 100 <-- highest value
(0, 9) is going to win.
What would be the best implementation to store the exported sparse matrix?
In general, what would be the best approach (data structure, algorithm) to handle this functionality?
Absent memory or dynamism constraints, probably the best approach is to slurp the sparse matrix into a map from first number to the pairs ordered by value, e.g.,
matrix_map = {} # empty map
for (first_number, second_number, value) in matrix_triples:
if first_number not in matrix_map:
matrix_map[first_number] = [] # empty list
matrix_map[first_number].append((second_number, value))
for lst in matrix_map.values():
lst.sort(key=itemgetter(1), reverse=True) # sort by value descending
Given a matrix like
(0, 1) = 4
(0, 3) = 8
(0, 5) = 88
(0, 9) = 100
(1, 3) = 4
(6, 0) = 100,
the finished product looks like this:
{0: [(9, 100), (5, 88), (3, 8), (1, 4)],
1: [(3, 4)],
6: [(0, 100)]}.
Given some sets (or lists) of numbers, I would like to iterate through the cross product of these sets in the order determined by the sum of the returned numbers. For example, if the given sets are { 1,2,3 }, { 2,4 }, { 5 }, then I would like to retrieve the cross-products in the order
<3,4,5>,
<2,4,5>,
<3,2,5> or <1,4,5>,
<2,2,5>,
<1,2,5>
I can't compute all the cross-products first and then sort them, because there are way too many. Is there any clever way to achieve this with an iterator?
(I'm using Perl for this, in case there are modules that would help.)
For two sets A and B, we can use a min heap as follows.
Sort A.
Sort B.
Push (0, 0) into a min heap H with priority function (i, j) |-> A[i] + B[j]. Break ties preferring small i and j.
While H is not empty, pop (i, j), output (A[i], B[j]), insert (i + 1, j) and (i, j + 1) if they exist and don't already belong to H.
For more than two sets, use the naive algorithm and sort to get down to two sets. In the best case (which happens when each set is relatively small), this requires storage for O(√#tuples) tuples instead of Ω(#tuples).
Here's some Python to do this. It should transliterate reasonably straightforwardly to Perl. You'll need a heap library from CPAN and to convert my tuples to strings so that they can be keys in a Perl hash. The set can be stored as a hash as well.
from heapq import heappop, heappush
def largest_to_smallest(lists):
"""
>>> print list(largest_to_smallest([[1, 2, 3], [2, 4], [5]]))
[(3, 4, 5), (2, 4, 5), (3, 2, 5), (1, 4, 5), (2, 2, 5), (1, 2, 5)]
"""
for lst in lists:
lst.sort(reverse=True)
num_lists = len(lists)
index_tuples_in_heap = set()
min_heap = []
def insert(index_tuple):
if index_tuple in index_tuples_in_heap:
return
index_tuples_in_heap.add(index_tuple)
minus_sum = 0 # compute -sum because it's a min heap, not a max heap
for i in xrange(num_lists): # 0, ..., num_lists - 1
if index_tuple[i] >= len(lists[i]):
return
minus_sum -= lists[i][index_tuple[i]]
heappush(min_heap, (minus_sum, index_tuple))
insert((0,) * num_lists)
while min_heap:
minus_sum, index_tuple = heappop(min_heap)
elements = []
for i in xrange(num_lists):
elements.append(lists[i][index_tuple[i]])
yield tuple(elements) # this is where the tuple is returned
for i in xrange(num_lists):
neighbor = []
for j in xrange(num_lists):
if i == j:
neighbor.append(index_tuple[j] + 1)
else:
neighbor.append(index_tuple[j])
insert(tuple(neighbor))