How to make this sparse matrix and trie work in tandem - algorithm

I have a sparse matrix that has been exported to this format:
(1, 3) = 4
(0, 5) = 88
(6, 0) = 100
...
Strings are stored into a Trie data structure. The numbers in the previous exported sparse matrix correspond to the result of the lookup on the Trie.
Lets say the word "stackoverflow" is mapped to number '0'. I need to iterate the exported sparse matrix where the first element is equals to '0' and find the highest value.
For example:
(0, 1) = 4
(0, 3) = 8
(0, 9) = 100 <-- highest value
(0, 9) is going to win.
What would be the best implementation to store the exported sparse matrix?
In general, what would be the best approach (data structure, algorithm) to handle this functionality?

Absent memory or dynamism constraints, probably the best approach is to slurp the sparse matrix into a map from first number to the pairs ordered by value, e.g.,
matrix_map = {} # empty map
for (first_number, second_number, value) in matrix_triples:
if first_number not in matrix_map:
matrix_map[first_number] = [] # empty list
matrix_map[first_number].append((second_number, value))
for lst in matrix_map.values():
lst.sort(key=itemgetter(1), reverse=True) # sort by value descending
Given a matrix like
(0, 1) = 4
(0, 3) = 8
(0, 5) = 88
(0, 9) = 100
(1, 3) = 4
(6, 0) = 100,
the finished product looks like this:
{0: [(9, 100), (5, 88), (3, 8), (1, 4)],
1: [(3, 4)],
6: [(0, 100)]}.

Related

KNN - Triangular Inequality Optimization

I don't fully understand how the triangular inequality is used to optimise distance calculations in KNN classification.
I had written a python script referring the steps mentioned below
Calculate the distance between each training pixel to the other.
For each test sample
Calculate the distance from the first training sample as dn. This would be the current minimum distance.
Calculate the distance from the second training sample(p) as dp.
If dp < dn assign dn =dp
For each remaining training sample(c)
If distance between the sample c and sample p measured as dcp meets
dp - dn < dcp < dp + dn
Calculate distance from test sample to the sample c as dp
If dp < dn, assign: dn = dp
Else, skip this training sample.
Stop if there are no more training samples
The class to which n belongs is the estimate.
Python Script:
def get_distance(p1 = (0, 0), p2 = (0, 0)):
return abs(p1[0] - p2[0]) + abs(p1[1] - p2[1])
def algorithm(train_set, new_point):
d_n = get_distance(new_point, train_set[0])
d_p = get_distance(new_point, train_set[1])
min_index = 0
if d_p < d_n:
d_n = d_p
min_index = 1
for c in range(2, len(train_set)):
dcp = get_distance(train_set[min_index], train_set[c])
if d_p - d_n < dcp < d_p + d_n:
d_p = get_distance(new_point, train_set[c])
if d_p < d_n:
d_n = d_p
min_index = c
print(train_set[min_index], d_n)
train_set = [
(0, 1, 'A'),
(1, 1, 'A'),
(2, 5, 'B'),
(1, 8, 'A'),
(5, 3, 'C'),
(4, 2, 'C'),
(3, 2, 'A'),
(1, 7, 'B'),
(4, 8, 'B'),
(4, 0, 'A'),
]
for new_point in train_set:
# Checking the distances from the points within training set iteself: min distance = 0, used for validation
result_point = min(train_set, key = lambda x : get_distance(x, new_point))
print(result_point, get_distance(result_point, new_point))
algorithm(train_set, new_point)
print('----------')
But it doesn't give the required result for 1 point.
Is my understanding of the optimization wrong?
Thank you in advance for any help.

Filtering Spatial Data in Apache Spark

I am currently solving a problem involving GPS data from buses. The issue I am facing is to reduce computation in my process.
There are about 2 billion GPS-coordinate points (Lat-Long degrees) in one table and about 12,000 bus-stops with their Lat-Long in another table. It is expected that only 5-10% of the 2-billion points are at bus-stops.
Problem: I need to tag and extract only those points (out of the 2-billion) that are at bus-stops (the 12,000 points). Since this is GPS data, I cannot do exact matching of the coordinates, but rather do a tolerance based geofencing.
Issue: The process of tagging bus-stops is taking extremely long time with the current naive approach. Currently, we are picking each of the 12,000 bus-stop points, and querying the 2-billion points with a tolerance of 100m (by converting degree-differences into distance).
Question: Is there an algorithmically efficient process to achieve this tagging of points?
Yes you can use something like SpatialSpark. It only works with Spark 1.6.1 but you can use BroadcastSpatialJoin to create an RTree which is extremely efficient.
Here's an example of me using SpatialSpark with PySpark to check if different polygons are within each other or are intersecting:
from ast import literal_eval as make_tuple
print "Java Spark context version:", sc._jsc.version()
spatialspark = sc._jvm.spatialspark
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((-1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
])
def geomCWithId():
return sc.parallelize([
(0L, rectangleC.wkt)
])
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize([
(0L, pointD.wkt)
])
dfAB = sqlContext.createDataFrame(geomABWithId(), ['id', 'wkt'])
dfABC = sqlContext.createDataFrame(geomABCWithId(), ['id', 'wkt'])
dfC = sqlContext.createDataFrame(geomCWithId(), ['id', 'wkt'])
dfD = sqlContext.createDataFrame(geomDWithId(), ['id', 'wkt'])
# Supported Operators: Within, WithinD, Contains, Intersects, Overlaps, NearestD
SpatialOperator = spatialspark.operator.SpatialOperator
BroadcastSpatialJoin = spatialspark.join.BroadcastSpatialJoin
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
joinRDD.count()
results = joinRDD.collect()
map(lambda result: make_tuple(result.toString()), results)
# [(0, 0), (1, 1), (2, 0)] read as:
# ID 0 is within 0
# ID 1 is within 1
# ID 2 is within 0
Note the line
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
the last argument is a buffer value, in your case it would be the tolerance you want to use. It will probably be a very small number if you are using lat/lon since it's a radial system and depending on the meters you want for your tolerance you will need to calculate based on lat/lon for your area of interest.

Sorting vector of x/y coordinates

I have a vector of (u32, u32) tuples which represent coordinates on a 10 x 10 grid. The coordinates are unsorted. Because the standard sort function also didn't yield the result I wanted, I wrote a sort function like this for them:
vec.sort_by(|a, b| {
if a.0 > b.0 { return Ordering::Greater; }
if a.0 < b.0 { return Ordering::Less; }
if a.1 > b.1 { return Ordering::Greater; }
if a.1 < b.1 { return Ordering::Less; }
return Ordering::Equal;
});
The resulting grid for my custom function looks like this:
(0/0) (0/1) (0/2) (0/3) (0/4) (0/5) (0/6) (0/7) (0/8) (0/9)
(1/0) (1/1) (1/2) (1/3) (1/4) (1/5) (1/6) (1/7) (1/8) (1/9)
(2/0) (2/1) (2/2) (2/3) (2/4) (2/5) (2/6) (2/7) (2/8) (2/9)
...
(9/0) (9/1) (9/2) (9/3) (9/4) (9/5) (9/6) (9/7) (9/8) (9/9)
This is not what I want, because the lower left should start with (0/0) as I would expect on a mathematical coordinates grid.
I probably can manage to add more cases to the sort algorithm, but is there an easier way to do what I want besides writing a big if .. return Ordering ...; block?
You didn't show how you are populating or printing your tuples, so this is a guess. Flip around and/or negate parts of your coordinates. I'd also recommend using sort_by_key as it's easier, as well as just reusing the existing comparison of tuples:
fn main() {
let mut points = [(0, 0), (1, 1), (1, 0), (0, 1)];
points.sort_by_key(|&(x, y)| (!y, x));
println!("{:?}", points);
}
Adding an extra newline in the output:
[(0, 1), (1, 1),
(0, 0), (1, 0)]
Originally, this answer suggested negating the value ((-y, x)). However, as pointed out by Francis Gagné, this fails for unsigned integers or signed integers when the value is the minimum value. Negating the bits happens to work fine, but is a bit too "clever".
Nowadays, I would use Ordering::reverse and Ordering::then for the clarity:
fn main() {
let mut points = [(0u8, 0u8), (1, 1), (1, 0), (0, 1)];
points.sort_by(|&(x0, y0), &(x1, y1)| y0.cmp(&y1).reverse().then(x0.cmp(&x1)));
println!("{:?}", points);
}
[(0, 1), (1, 1),
(0, 0), (1, 0)]

System of integer inequalities: Counting solutions

I have a system of inequalities and constraints:
Let A=[F1,F2,F3,F4,F5,F6] where F1 through F6 are given.
Let B=[a,b,c,d,e,f] where a<=b<=c<=d<=e<=f.
Let C=[u,v,w,x,y,z] where u<=v<=w<=x<=y<=z.
Equation 1: if(a>F1, 1, 0) + if(a>F2, 1, 0) + ... + if(f>F6, 1, 0) > 18
Equation 2: if(u>a, 1, 0) + if(u>b, 1, 0) + ... + if (z>f, 1, 0) > 18
Equation 3: if(F1>u, 1, 0) + if(F1>v, 1, 0) + ... + if(F6>z, 1, 0) > 18
Other constraints: All variables must be integers between 1 and N (N is given).
I wish to merely count the number of integer solutions to my variables (I do not wish to actually solve them). I know how to use solvers to calculate systems of equations in matrices but this usually assumes those equations use = as opposed to >=, >, <, or <=.
Here's a stab at it.
This is horribly inefficient, as I compute the Cartesian product of the two vectors, then compare each tuple combination. This also won't scale past 2 dimensions.
Also, I'm worried this isn't exactly what you are looking for, because I'm solving each equation independently. If you're looking for all the integer values that satisfy a 3-dimensional space bound by the system of inequalities, well, that's a bit of a brain bender for me, albeit very interesting.
Python anyone?
#sample data
A =[12,2,15,104,54,20]
B =[10,20,30,40,50,60]
C =[100,200,300,400,500,600]
import itertools
def eq1():
product = itertools.product(B,A) #construct Cartesian product of 2 lists
#list(product) returns a Cartesian product of tuples
# [(12, 10), (12, 20), (12, 30)... (2, 10), (2, 20)... (20, 60)]
#now, use a list comprehension to compare the values in each tuple,
# generating a list of only those that satisfy the inequality...
# then return the length of that list - which is the count
return len([ Bval for Bval, Aval in list(product) if Bval > Aval])
def eq2():
product = itertools.product(C,B)
return len([ Cval for Cval, Bval in list(product) if Cval>Bval])
def eq3():
product = itertools.product(A,C)
return len([ Aval for Aval, Cval in list(product) if Aval>Cval])
print eq1()
print eq2()
print eq3()
This sample data returns:
eq1 : 21
eq2 : 36
eq3 : 1
But doesn't know how to combine these answers into a single integer count of all 3 - there's some kind of union that's going to happen between the lists.
My sanity test is in equation 3, which returns '1' - because only when Aval = 104 does it satisfy Aval > Cval for Cval only at 100.

How to generate cross product of sets in specific order

Given some sets (or lists) of numbers, I would like to iterate through the cross product of these sets in the order determined by the sum of the returned numbers. For example, if the given sets are { 1,2,3 }, { 2,4 }, { 5 }, then I would like to retrieve the cross-products in the order
<3,4,5>,
<2,4,5>,
<3,2,5> or <1,4,5>,
<2,2,5>,
<1,2,5>
I can't compute all the cross-products first and then sort them, because there are way too many. Is there any clever way to achieve this with an iterator?
(I'm using Perl for this, in case there are modules that would help.)
For two sets A and B, we can use a min heap as follows.
Sort A.
Sort B.
Push (0, 0) into a min heap H with priority function (i, j) |-> A[i] + B[j]. Break ties preferring small i and j.
While H is not empty, pop (i, j), output (A[i], B[j]), insert (i + 1, j) and (i, j + 1) if they exist and don't already belong to H.
For more than two sets, use the naive algorithm and sort to get down to two sets. In the best case (which happens when each set is relatively small), this requires storage for O(√#tuples) tuples instead of Ω(#tuples).
Here's some Python to do this. It should transliterate reasonably straightforwardly to Perl. You'll need a heap library from CPAN and to convert my tuples to strings so that they can be keys in a Perl hash. The set can be stored as a hash as well.
from heapq import heappop, heappush
def largest_to_smallest(lists):
"""
>>> print list(largest_to_smallest([[1, 2, 3], [2, 4], [5]]))
[(3, 4, 5), (2, 4, 5), (3, 2, 5), (1, 4, 5), (2, 2, 5), (1, 2, 5)]
"""
for lst in lists:
lst.sort(reverse=True)
num_lists = len(lists)
index_tuples_in_heap = set()
min_heap = []
def insert(index_tuple):
if index_tuple in index_tuples_in_heap:
return
index_tuples_in_heap.add(index_tuple)
minus_sum = 0 # compute -sum because it's a min heap, not a max heap
for i in xrange(num_lists): # 0, ..., num_lists - 1
if index_tuple[i] >= len(lists[i]):
return
minus_sum -= lists[i][index_tuple[i]]
heappush(min_heap, (minus_sum, index_tuple))
insert((0,) * num_lists)
while min_heap:
minus_sum, index_tuple = heappop(min_heap)
elements = []
for i in xrange(num_lists):
elements.append(lists[i][index_tuple[i]])
yield tuple(elements) # this is where the tuple is returned
for i in xrange(num_lists):
neighbor = []
for j in xrange(num_lists):
if i == j:
neighbor.append(index_tuple[j] + 1)
else:
neighbor.append(index_tuple[j])
insert(tuple(neighbor))

Resources