Pyspark: weighted average by a column - matrix

For example, I have a dataset like this
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(3, 3, 1, "2018-06-01", "Region A"),
(3, 1, 3, "2018-06-05", "Region A"),
.toDF("orderid", "customerid", "price", "transactiondate", "location")
and I can obtain the customer-region order count matrix by
overall_stat = test.groupBy("customerid").agg(count("orderid"))\
.withColumnRenamed("count(orderid)", "overall_count")
temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0).join(overall_stat, ["customerid"])
for field in temp_result.schema.fields:
if str( not in ['customerid', "overall_count", "overall_amount"]:
name = str(
temp_result = temp_result.withColumn(name, col(name)/col("overall_count"))
The data would look like this
Now, I want to calculate the weighted average by the overall_count, how can I do it?
The result should be (0.66*3+1*1)/4 for region A, and (0.33*3+1*1)/4 for region B
My thoughts:
It can certainly be achieved through turning the data into python/pandas and then do some calculation, but in what cases should we use Pyspark?
I can get something like
temp_result.agg(sum(col("Region A") * col("overall_count")), sum(col("Region B")*col("overall_count"))).show()
but it doesn't feel right, especially if there is many regions to count.

you can achieve a weighted average by breaking your above steps into multiple stages.
Consider the following:
Dataframe Name: sales_table
[ total_sales, count_of_orders, location]
[ 50 , 9 , A ]
[ 80 , 4 , A ]
[ 90 , 7 , A ]
To calculate the grouped weighted average of the above (70) is broken into two steps:
Multiplying sales by importance
Aggregating the sales_x_count product
Dividing sales_x_count by the sum of the original
If we break the above into several stages within our PySpark code, you can get the following:
new_sales = sales_table \
.withColumn("sales_x_count", col("total_sales") * col("count_orders")) \
.groupBy("Location") \
.agg(sf.sum("total_sales").alias("sum_total_sales"), \
sf.sum("sales_x_count").alias("sum_sales_x_count")) \
.withColumn("count_weighted_average", col("sum_sales_x_count") / col("sum_total_sales"))
So... no fancy UDF is really necessary here (and would likely slow you down).


How to design a heuristic algorithm to solve this location optimization problem?

I simplified the problem to the following description:
If produce a thing, we need to go through three devices : A, B, C, and it must pass through these devices the order of A->B->C. The device to select an address (from 0 to 7) for installation before it can be used, The installation cost of the device is different for different addresses, as shown in the figure below, as shown below
The addresses are in the order of the arrows. Device A can choose to install at address 0, and the cost is 800, or it can be installed at address 1, the cost is 700. Other devices have similar installation locations and costs The following is a correct placement method(A choose 2, B choose 3, C choose 5):
If we first install device A in 2 and device C in 3, then B has no address to install, which is a wrong installation method.
Finding a correct installation method is very simple, but because the cost of installing equipment at different addresses is different, how to find a cost-optimized solution under the correct premise? Because of the large scale of the problem, I want to use a heuristic search algorithm. How to design the algorithm? Thank you very much if you can answer my question!
This isn’t an assignment problem because of the ordering constraint on the installations. A dynamic program will do. In Python 3:
import math
input_instance = [
("A", [(0, 800), (1, 700), (2, 500), (3, 1000)]),
("B", [(2, 200), (3, 1500)]),
("C", [(3, 1000), (4, 200), (5, 500), (6, 700)]),
solutions = [(-math.inf, 0, [])]
for device, address_cost_pairs in input_instance:
next_solutions = []
i = 0
for address, cost in sorted(address_cost_pairs):
if address <= solutions[i][LAST_ADDRESS]:
while i + 1 < len(solutions) and solutions[i + 1][LAST_ADDRESS] < address:
i += 1
solutions[i][TOTAL_COST] + cost,
solutions[i][INSTALLATIONS] + [(device, address)],
del solutions[:]
total_cost_limit = math.inf
for solution in next_solutions:
if total_cost_limit <= solution[TOTAL_COST]:
total_cost_limit = solution[TOTAL_COST]
if not solutions:
1100 [('A', 1), ('B', 2), ('C', 4)]

LightGBM predict with pred_contrib=True for multiclass: order of SHAP values in the returned array

LightGBM predict method with pred_contrib=True returns an array of shape =(n_samples, (n_features + 1) * n_classes).
What is the order of data in the second dimension of this array?
In other words, there are two questions:
What is the correct way to reshape this array to use it: shape = (n_samples, n_features + 1, n_classes) or shape = (n_samples, n_classes, n_features + 1)?
In the feature dimension, there are n_features entries, one for each feature, and a (useless) entry for the contribution not related to any feature. What is the order of these entries: feature contributions in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0, or some other way?
The answers are as follows:
The correct shape is (n_samples, n_classes, n_features + 1).
The feature contributions are in the entries 1,..., n_features in the same order they appear in the dataset, with the remaining (useless) entry at index 0.
The following code shows it convincingly:
import lightgbm, pandas, numpy
params = {'objective': 'multiclass', 'num_classes': 4, 'num_iterations': 10000,
'metric': 'multiclass', 'early_stopping_rounds': 10}
train_df = pandas.DataFrame({'f0': [0, 1, 2, 3] * 50, 'f1': [0, 0, 1] * 66 + [1, 2]}, dtype=float)
val_df = train_df.copy()
train_target = pandas.Series([0, 1, 2, 3] * 50)
val_target = pandas.Series([0, 1, 2, 3] * 50)
train_set = lightgbm.Dataset(train_df, train_target)
val_set = lightgbm.Dataset(val_df, val_target)
model = lightgbm.train(params=params, train_set=train_set, valid_sets=[val_set, train_set])
feature_contribs = model.predict(val_df, pred_contrib=True)
print('Shape of SHAP:', feature_contribs.shape)
# Shape of SHAP: (200, 12)
print('Averages over samples:', numpy.mean(feature_contribs, axis=0))
# Averages over samples: [ 3.99942301e-13 -4.02281771e-13 -4.30029167e+00 -1.90606677e-05
# 1.90606677e-05 -4.04157656e+00 2.24205077e-05 -2.24205077e-05
# -4.04265615e+00 -3.70370401e-15 5.20335728e-18 -4.30029167e+00]
feature_contribs.shape = (200, 4, 3)
print('Mean feature contribs:', numpy.mean(feature_contribs, axis=(0, 1)))
# Mean feature contribs: [ 8.39960111e-07 -8.39960113e-07 -4.17120401e+00]
(Each output appears as a comment in the following line.)
The explanation is as follows.
I have created a dataset with two features and with labels identical to the second of these features.
I would expect significant contribution from the second feature only.
After averaging the SHAP output over the samples, we get an array of the shape (12,) with nonzero values at the positions 2, 5, 8, 11 (zero-based).
This shows that the correct shape of this array is (4, 3).
After reshaping this way and averaging over the samples and the classes, we get an array of the shape (3,) with the nonzero entry at the end.
This shows that the last entry of this array corresponds to the last feature. This means that the entry at the position 0 does not correspond to any feature and the following entries correspond to features.

Coin change with limited coins complexity

If there is an unlimited number of every coin then the complexity is O(n*m) where is n is the total change and m is the number of coin types. Now when the coins for every type are limited then we have to take into account the remaining coins. I managed to make it work with a complexity of O(n*m2) using another for of size n so I can track the remaining coins for each type. Is there a way-trick to make the complexity better? EDIT : The problem is to compute the least ammount of coins required to make the exact given change and the number of times that we used each coin type
There is no need for an extra loop. You need to:
recurse with a depth of at most m (number of coins) levels, dealing with one specific coin per recursion level.
Loop at most n times at each recursion level in order to decide how many you will take of a given coin.
Here is how the code would look in Python 3:
def getChange(coins, amount, coinIndex = 0):
if amount == 0:
return [] # success
if coinIndex >= len(coins):
return None # failure
coin = coins[coinIndex]
coinIndex += 1
# Start by taking as many as possible from this coin
canTake = min(amount // coin["value"], coin["count"])
# Reduce the number taken from this coin until success
for count in range(canTake, -1, -1): # count will go down to zero
# Recurse to decide how many to take from the next coins
change = getChange(coins, amount - coin["value"] * count, coinIndex)
if change != None: # We had success
if count: # Register this number for this coin:
return change + [{ "value": coin["value"], "count": count }]
return change
# Example data and call:
coins = [
{ "value": 20, "count": 2 },
{ "value": 10, "count": 2 },
{ "value": 5, "count": 3 },
{ "value": 2, "count": 2 },
{ "value": 1, "count": 10 }
result = getChange(coins, 84)
Output for the given example:
{'value': 1, 'count': 5},
{'value': 2, 'count': 2},
{'value': 5, 'count': 3},
{'value': 10, 'count': 2},
{'value': 20, 'count': 2}
Minimising the number of coins used
As stated in comments, the above algorithm returns the first solution it finds. If there is a requirement that the number of individual coins must be minimised when there are multiple solutions, then you cannot return halfway a loop, but must retain the "best" solution found so far.
Here is the modified code to achieve that:
def getchange(coins, amount):
minCount = None
def recurse(amount, coinIndex, coinCount):
nonlocal minCount
if amount == 0:
if minCount == None or coinCount < minCount:
minCount = coinCount
return [] # success
return None # not optimal
if coinIndex >= len(coins):
return None # failure
bestChange = None
coin = coins[coinIndex]
# Start by taking as many as possible from this coin
cantake = min(amount // coin["value"], coin["count"])
# Reduce the number taken from this coin until 0
for count in range(cantake, -1, -1):
# Recurse, taking out this coin as a possible choice
change = recurse(amount - coin["value"] * count, coinIndex + 1,
coinCount + count)
# Do we have a solution that is better than the best so far?
if change != None:
if count: # Does it involve this coin?
change.append({ "value": coin["value"], "count": count })
bestChange = change # register this as the best so far
return bestChange
return recurse(amount, 0, 0)
coins = [{ "value": 10, "count": 2 },
{ "value": 8, "count": 2 },
{ "value": 3, "count": 10 }]
result = getchange(coins, 26)
{'value': 8, 'count': 2},
{'value': 10, 'count': 1}
Here's an implementation of an O(nm) solution in Python.
If one defines C(c, k) = 1 + x^c + x^(2c) + ... + x^(kc), then the program calculates the first n+1 coefficients of the polynomial product(C(c[i], k[i]), i = 1...ncoins). The j'th coefficient of this polynomial is the number of ways of making change for j.
When all the ks are unlimited, this polynomial product is easy to calculate (see, for example: When limited, one needs to be able to calculate running sums of k terms efficiently, which is done in the program using the rs array.
# cs is a list of pairs (c, k) where there's k
# coins of value c.
def limited_coins(cs, n):
r = [1] + [0] * n
for c, k in cs:
# rs[i] will contain the sum r[i] + r[i-c] + r[i-2c] + ...
rs = r[:]
for i in xrange(c, n+1):
rs[i] += rs[i-c]
# This line effectively performs:
# r'[i] = sum(r[i-j*c] for j=0...k)
# but using rs[] so that the computation is O(1)
# and in place.
r[i] += rs[i-c] - (0 if i<c*(k+1) else rs[i-c*(k+1)])
return r[n]
for n in xrange(50):
print n, limited_coins([(1, 3), (2, 2), (5, 3), (10, 2)], n)

Filtering Spatial Data in Apache Spark

I am currently solving a problem involving GPS data from buses. The issue I am facing is to reduce computation in my process.
There are about 2 billion GPS-coordinate points (Lat-Long degrees) in one table and about 12,000 bus-stops with their Lat-Long in another table. It is expected that only 5-10% of the 2-billion points are at bus-stops.
Problem: I need to tag and extract only those points (out of the 2-billion) that are at bus-stops (the 12,000 points). Since this is GPS data, I cannot do exact matching of the coordinates, but rather do a tolerance based geofencing.
Issue: The process of tagging bus-stops is taking extremely long time with the current naive approach. Currently, we are picking each of the 12,000 bus-stop points, and querying the 2-billion points with a tolerance of 100m (by converting degree-differences into distance).
Question: Is there an algorithmically efficient process to achieve this tagging of points?
Yes you can use something like SpatialSpark. It only works with Spark 1.6.1 but you can use BroadcastSpatialJoin to create an RTree which is extremely efficient.
Here's an example of me using SpatialSpark with PySpark to check if different polygons are within each other or are intersecting:
from ast import literal_eval as make_tuple
print "Java Spark context version:", sc._jsc.version()
spatialspark = sc._jvm.spatialspark
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((-1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
def geomCWithId():
return sc.parallelize([
(0L, rectangleC.wkt)
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize([
(0L, pointD.wkt)
dfAB = sqlContext.createDataFrame(geomABWithId(), ['id', 'wkt'])
dfABC = sqlContext.createDataFrame(geomABCWithId(), ['id', 'wkt'])
dfC = sqlContext.createDataFrame(geomCWithId(), ['id', 'wkt'])
dfD = sqlContext.createDataFrame(geomDWithId(), ['id', 'wkt'])
# Supported Operators: Within, WithinD, Contains, Intersects, Overlaps, NearestD
SpatialOperator = spatialspark.operator.SpatialOperator
BroadcastSpatialJoin = spatialspark.join.BroadcastSpatialJoin
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
results = joinRDD.collect()
map(lambda result: make_tuple(result.toString()), results)
# [(0, 0), (1, 1), (2, 0)] read as:
# ID 0 is within 0
# ID 1 is within 1
# ID 2 is within 0
Note the line
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
the last argument is a buffer value, in your case it would be the tolerance you want to use. It will probably be a very small number if you are using lat/lon since it's a radial system and depending on the meters you want for your tolerance you will need to calculate based on lat/lon for your area of interest.

Bayesian Algorithm for sorting hotel reviews

i'm trying to find a Formula or algorithm to sort the most useful rating for a set of hotel reviews.
The thing is that i have in a determined place three different hotels that has the following information:
Hotel A: 124 reviews, 8.6 avg rating.
Hotel B: 10 reviews, 8.8 avg rating.
Hotel C: 1000 reviews, 8 avg rating.
i tried the algorithm used here:What is a better way to sort by a 5 star rating?
WR = (v * R + m * C) / (v + m)
But i'm not being able to reflect that the "score" should be higher for Hotel C, because the quantity of reviews is the biggest.
If i can get that solved, I imagine that the sort should be close to: 1) Hotel C; 2) hotel A and 3) hotel B.
Thank you!
This seems a duplicate of this one. It seems that to have the proper order you have to assign a predefined number of downvotes equal to the greatest number of reviews an item has, instead of summing the total number of votes of all items and dividing by the number of items as I had thought before. In python:
pretend_votes = [1000, 0, 0, 0, 0, 0, 0, 0, 0, 0]
rating = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
def score(item_votes):
votes = [iv+pv for (iv,pv) in zip(item_votes,pretend_votes)]
return sum(v*u for (v,u) in zip(votes,rating))/float(sum(votes))
