Filtering Spatial Data in Apache Spark - algorithm

I am currently solving a problem involving GPS data from buses. The issue I am facing is to reduce computation in my process.
There are about 2 billion GPS-coordinate points (Lat-Long degrees) in one table and about 12,000 bus-stops with their Lat-Long in another table. It is expected that only 5-10% of the 2-billion points are at bus-stops.
Problem: I need to tag and extract only those points (out of the 2-billion) that are at bus-stops (the 12,000 points). Since this is GPS data, I cannot do exact matching of the coordinates, but rather do a tolerance based geofencing.
Issue: The process of tagging bus-stops is taking extremely long time with the current naive approach. Currently, we are picking each of the 12,000 bus-stop points, and querying the 2-billion points with a tolerance of 100m (by converting degree-differences into distance).
Question: Is there an algorithmically efficient process to achieve this tagging of points?

Yes you can use something like SpatialSpark. It only works with Spark 1.6.1 but you can use BroadcastSpatialJoin to create an RTree which is extremely efficient.
Here's an example of me using SpatialSpark with PySpark to check if different polygons are within each other or are intersecting:
from ast import literal_eval as make_tuple
print "Java Spark context version:", sc._jsc.version()
spatialspark = sc._jvm.spatialspark
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((-1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
])
def geomCWithId():
return sc.parallelize([
(0L, rectangleC.wkt)
])
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize([
(0L, pointD.wkt)
])
dfAB = sqlContext.createDataFrame(geomABWithId(), ['id', 'wkt'])
dfABC = sqlContext.createDataFrame(geomABCWithId(), ['id', 'wkt'])
dfC = sqlContext.createDataFrame(geomCWithId(), ['id', 'wkt'])
dfD = sqlContext.createDataFrame(geomDWithId(), ['id', 'wkt'])
# Supported Operators: Within, WithinD, Contains, Intersects, Overlaps, NearestD
SpatialOperator = spatialspark.operator.SpatialOperator
BroadcastSpatialJoin = spatialspark.join.BroadcastSpatialJoin
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
joinRDD.count()
results = joinRDD.collect()
map(lambda result: make_tuple(result.toString()), results)
# [(0, 0), (1, 1), (2, 0)] read as:
# ID 0 is within 0
# ID 1 is within 1
# ID 2 is within 0
Note the line
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
the last argument is a buffer value, in your case it would be the tolerance you want to use. It will probably be a very small number if you are using lat/lon since it's a radial system and depending on the meters you want for your tolerance you will need to calculate based on lat/lon for your area of interest.

Related

How to design a heuristic algorithm to solve this location optimization problem?

I simplified the problem to the following description:
If produce a thing, we need to go through three devices : A, B, C, and it must pass through these devices the order of A->B->C. The device to select an address (from 0 to 7) for installation before it can be used, The installation cost of the device is different for different addresses, as shown in the figure below, as shown below
The addresses are in the order of the arrows. Device A can choose to install at address 0, and the cost is 800, or it can be installed at address 1, the cost is 700. Other devices have similar installation locations and costs The following is a correct placement method(A choose 2, B choose 3, C choose 5):
If we first install device A in 2 and device C in 3, then B has no address to install, which is a wrong installation method.
Finding a correct installation method is very simple, but because the cost of installing equipment at different addresses is different, how to find a cost-optimized solution under the correct premise? Because of the large scale of the problem, I want to use a heuristic search algorithm. How to design the algorithm? Thank you very much if you can answer my question!
This isn’t an assignment problem because of the ordering constraint on the installations. A dynamic program will do. In Python 3:
import math
input_instance = [
("A", [(0, 800), (1, 700), (2, 500), (3, 1000)]),
("B", [(2, 200), (3, 1500)]),
("C", [(3, 1000), (4, 200), (5, 500), (6, 700)]),
]
LAST_ADDRESS = 0
TOTAL_COST = 1
INSTALLATIONS = 2
solutions = [(-math.inf, 0, [])]
for device, address_cost_pairs in input_instance:
next_solutions = []
i = 0
for address, cost in sorted(address_cost_pairs):
if address <= solutions[i][LAST_ADDRESS]:
continue
while i + 1 < len(solutions) and solutions[i + 1][LAST_ADDRESS] < address:
i += 1
next_solutions.append(
(
address,
solutions[i][TOTAL_COST] + cost,
solutions[i][INSTALLATIONS] + [(device, address)],
)
)
del solutions[:]
total_cost_limit = math.inf
for solution in next_solutions:
if total_cost_limit <= solution[TOTAL_COST]:
continue
solutions.append(solution)
total_cost_limit = solution[TOTAL_COST]
if not solutions:
break
else:
print(*solutions[-1][1:])
Output:
1100 [('A', 1), ('B', 2), ('C', 4)]

LightGBM predict with pred_contrib=True for multiclass: order of SHAP values in the returned array

LightGBM predict method with pred_contrib=True returns an array of shape =(n_samples, (n_features + 1) * n_classes).
What is the order of data in the second dimension of this array?
In other words, there are two questions:
What is the correct way to reshape this array to use it: shape = (n_samples, n_features + 1, n_classes) or shape = (n_samples, n_classes, n_features + 1)?
In the feature dimension, there are n_features entries, one for each feature, and a (useless) entry for the contribution not related to any feature. What is the order of these entries: feature contributions in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0, or some other way?
The answers are as follows:
The correct shape is (n_samples, n_classes, n_features + 1).
The feature contributions are in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0.
The following code shows it convincingly:
import lightgbm, pandas, numpy
params = {'objective': 'multiclass', 'num_classes': 4, 'num_iterations': 10000,
'metric': 'multiclass', 'early_stopping_rounds': 10}
train_df = pandas.DataFrame({'f0': [0, 1, 2, 3] * 50, 'f1': [0, 0, 1] * 66 + [1, 2]}, dtype=float)
val_df = train_df.copy()
train_target = pandas.Series([0, 1, 2, 3] * 50)
val_target = pandas.Series([0, 1, 2, 3] * 50)
train_set = lightgbm.Dataset(train_df, train_target)
val_set = lightgbm.Dataset(val_df, val_target)
model = lightgbm.train(params=params, train_set=train_set, valid_sets=[val_set, train_set])
feature_contribs = model.predict(val_df, pred_contrib=True)
print('Shape of SHAP:', feature_contribs.shape)
# Shape of SHAP: (200, 12)
print('Averages over samples:', numpy.mean(feature_contribs, axis=0))
# Averages over samples: [ 3.99942301e-13 -4.02281771e-13 -4.30029167e+00 -1.90606677e-05
# 1.90606677e-05 -4.04157656e+00 2.24205077e-05 -2.24205077e-05
# -4.04265615e+00 -3.70370401e-15 5.20335728e-18 -4.30029167e+00]
feature_contribs.shape = (200, 4, 3)
print('Mean feature contribs:', numpy.mean(feature_contribs, axis=(0, 1)))
# Mean feature contribs: [ 8.39960111e-07 -8.39960113e-07 -4.17120401e+00]
(Each output appears as a comment in the following line.)
The explanation is as follows.
I have created a dataset with two features and with labels identical to the second of these features.
I would expect significant contribution from the second feature only.
After averaging the SHAP output over the samples, we get an array of the shape (12,) with nonzero values at the positions 2, 5, 8, 11 (zero-based).
This shows that the correct shape of this array is (4, 3).
After reshaping this way and averaging over the samples and the classes, we get an array of the shape (3,) with the nonzero entry at the end.
This shows that the last entry of this array corresponds to the last feature. This means that the entry at the position 0 does not correspond to any feature and the following entries correspond to features.

GridSearch in Keras + TensorFlow resulting in Resource exhausted

I know that this error is recurrent and I understand what can cause it.
For example, running this model with 163 images of 150x150 gives me the error (however it's not clear to me why setting batch_size Keras still seems to try to allocate all images at a time in the GPU):
model = Sequential()
model.add(Conv2D(64, kernel_size=(6, 6), activation='relu', input_shape=input_shape, padding='same', name='b1_conv'))
model.add(MaxPooling2D(pool_size=(2, 2), name='b1_poll'))
model.add(Conv2D(128, kernel_size=(6, 6), activation='relu', padding='same', name='b2_conv'))
model.add(MaxPooling2D(pool_size=(2, 2), name='b2_pool'))
model.add(Conv2D(256, kernel_size=(6, 6), activation='relu', padding='same', name='b3_conv'))
model.add(MaxPooling2D(pool_size=(2, 2), name='b3_pool'))
model.add(Flatten())
model.add(Dense(500, activation='relu', name='fc1'))
model.add(Dropout(0.5))
model.add(Dense(500, activation='relu', name='fc2'))
model.add(Dropout(0.5))
model.add(Dense(n_targets, activation='softmax', name='prediction'))
model.compile(optimizer=optim, loss='categorical_crossentropy', metrics=['accuracy'])
Given that, I reduced the images size to 30x30 (which resulted in an accuracy drop, as expected). However, running grid search in this model Resource exhausted.
model = KerasClassifier(build_fn=create_model, verbose=0)
# grid initial weight, batch size and optimizer
sgd = optimizers.SGD(lr=0.0005)
rms = optimizers.RMSprop(lr=0.0005)
adag = optimizers.Adagrad(lr=0.0005)
adad = optimizers.Adadelta(lr=0.0005)
adam = optimizers.Adam(lr=0.0005)
adamm = optimizers.Adamax(lr=0.0005)
nadam = optimizers.Nadam(lr=0.0005)
optimizers = [sgd, rms, adag, adad, adam, adamm, nadam]
init = ['glorot_uniform', 'normal', 'uniform', 'he_normal']
batches = [32, 64, 128]
param_grid = dict(optim=optimizers, batch_size=batches, init=init)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
I wonder if it's possible to "clean" things before each combination used by the grid search (not sure if I made myself clear, this is all new to me).
EDIT
Using the fit_generator also gives me the same error:
def generator(features, labels, batch_size):
# Create empty arrays to contain batch of features and labels#
batch_features = np.zeros((batch_size, size, size, 1))
batch_labels = np.zeros((batch_size, n_targets))
while True:
for i in range(batch_size):
# choose random index in features
index = np.random.choice(len(features),1)
batch_features[i] = features[index]
batch_labels[i] = labels[index]
yield batch_features, batch_labels
sgd = optimizers.SGD(lr=0.0005)
rms = optimizers.RMSprop(lr=0.0005)
adag = optimizers.Adagrad(lr=0.0005)
adad = optimizers.Adadelta(lr=0.0005)
adam = optimizers.Adam(lr=0.0005)
adamm = optimizers.Adamax(lr=0.0005)
nadam = optimizers.Nadam(lr=0.0005)
optim = [rms, adag, adad, adam, adamm, nadam]
init = ['normal', 'uniform', 'he_normal']
combinations = [(a, b) for a in optim for b in init]
for combination in combinations:
init = combination[1]
optim = combination[0]
model = create_model(init=init, optim=optim)
model.fit_generator(generator(X_train, y_train, batch_size=32),
steps_per_epoch=X_train.shape[0] // 32,
epochs=100, verbose=0, validation_data=(X_test, y_test))
scores = model.model.evaluate(X_test, y_test, verbose=0)
print("%s: %.2f%% Model %s %s" % (model.model.metrics_names[1], scores[1]*100, optim, init))
You should work with generators + yield, they discard from the memory the data they already used. Check out my answer to a similar question.
You have to clear the tensorflow session after each run of training/evaluation
K.clear_session()
with K as the tensorflow backend.

How to make this sparse matrix and trie work in tandem

I have a sparse matrix that has been exported to this format:
(1, 3) = 4
(0, 5) = 88
(6, 0) = 100
...
Strings are stored into a Trie data structure. The numbers in the previous exported sparse matrix correspond to the result of the lookup on the Trie.
Lets say the word "stackoverflow" is mapped to number '0'. I need to iterate the exported sparse matrix where the first element is equals to '0' and find the highest value.
For example:
(0, 1) = 4
(0, 3) = 8
(0, 9) = 100 <-- highest value
(0, 9) is going to win.
What would be the best implementation to store the exported sparse matrix?
In general, what would be the best approach (data structure, algorithm) to handle this functionality?
Absent memory or dynamism constraints, probably the best approach is to slurp the sparse matrix into a map from first number to the pairs ordered by value, e.g.,
matrix_map = {} # empty map
for (first_number, second_number, value) in matrix_triples:
if first_number not in matrix_map:
matrix_map[first_number] = [] # empty list
matrix_map[first_number].append((second_number, value))
for lst in matrix_map.values():
lst.sort(key=itemgetter(1), reverse=True) # sort by value descending
Given a matrix like
(0, 1) = 4
(0, 3) = 8
(0, 5) = 88
(0, 9) = 100
(1, 3) = 4
(6, 0) = 100,
the finished product looks like this:
{0: [(9, 100), (5, 88), (3, 8), (1, 4)],
1: [(3, 4)],
6: [(0, 100)]}.

top down ranges merge?

I want to merge some intervals like this:
>>> ranges = [(30, 45), (40, 50), (10, 50), (60, 90), (90, 100)]
>>> merge(ranges)
[(10, 50), (60, 100)]
I'm not in cs field. I know how to do it by iteration, but wonder if there's a more efficient "top-down" approach to merge them more efficiently, maybe using some special data structure?
Thanks.
Interval tree definitely works, but it is more complex than what you need. Interval tree is an "online" solution, and so it allows you to add some intervals, look at the union, add more intervals, look again, etc.
If you have all the intervals upfront, you can do something simpler:
Start with the input
ranges = [(30, 45), (40, 50), (10, 50)]
Convert the range list into a list of endpoints. If you have range (A, B), you'll convert it to two endpoints: (A, 0) will be the left endpoint and (B, 1) wil be the right endpoint.
endpoints = [(30, 0), (45, 1), (40, 0), (50, 1), (10, 0), (50, 1)]
Sort the endpoints
endpoints = [(10, 0), (30, 0), (40, 0), (45, 1), (50, 1), (50, 1)]
Scan forward through the endpoints list. Increment a counter when you see a left endpoint and decrement the counter when you see a right endpoint. Whenever the counter hits 0, you close the current merged interval.
This solution can be implemented in a few lines.
Yeah, the efficient way to do it is to use an interval tree.
The following algorithm in C# does what you want. It uses DateTime interval ranges, but you can adapt it however you like. Once the collection is sorted in ascending start order, if the start of the next interval is at or before the end of the previous one, they overlap, and you extend the end time outward if needed. Otherwise they don't overlap, and you save the prior one off to the results.
public static List<DateTimeRange> MergeTimeRanges(List<DateTimeRange> inputRanges)
{
List<DateTimeRange> mergedRanges = new List<DateTimeRange>();
// Sort in ascending start order.
inputRanges.Sort();
DateTime currentStart = inputRanges[0].Start;
DateTime currentEnd = inputRanges[0].End;
for (int i = 1; i < inputRanges.Count; i++)
{
if (inputRanges[i].Start <= currentEnd)
{
if (inputRanges[i].End > currentEnd)
{
currentEnd = inputRanges[i].End; // Extend range.
}
}
else
{
// Save current range to output.
mergedRanges.Add(new DateTimeRange(currentStart, currentEnd));
currentStart = inputRanges[i].Start;
currentEnd = inputRanges[i].End;
}
}
mergedRanges.Add(new DateTimeRange(currentStart, currentEnd));
return mergedRanges;
}

Resources