Poor weak scaling of Apache Spark join operation - performance

I run the "join" operation on the Apache Spark and see that there is no weak scalability. It will be grateful if anyone can explain this.
I create two dataframes ("a", "b") and ("a", "c") and join the dataframes by the first column. I generate dataframe values for "one to one" join. Also, I use the same partitioner to avoid shuffle.
Number of rows in the dataframes - 1024 * 1024 * 16 * cores_total (cores_total - total number of cores on which program is launched).
Column "a" consist of random Int values, all values of the "b" column equal to 1, all values of the "c" column equal to 2.
Theoretically, with the increase of the data size and cores by 64 times, the execution time should remain the same, but the execution time slightly grows. I obtain the following execution times:
Apache Spark version - 2.1.0. We use 8 cluster nodes, equipped with 1 Gbit Ethernet, each node has 2x Intel Xeon E5-2630, 64 GB RAM.
/* join perf */
import scala.io.Source
import scala.math._
import org.apache.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.util.control.Breaks._
import scala.collection.mutable._
import org.apache.spark.rdd._
import org.apache.spark.sql._
import scala.util.Random
import org.apache.spark.util.SizeEstimator
import org.apache.spark.HashPartitioner
object joinPerf {
def get_array(n: Int): Array[Int] = {
var res = Array[Int]()
for (x <- 1 to n) {
res :+= Random.nextInt
}
return res
}
def main(args: Array[String]) {
val start_time = System.nanoTime
val conf = new SparkConf().setAppName("joinPerf")
val sc = new SparkContext(conf)
val cores_total = sc.getConf.get("spark.cores.max").toInt
val partitions_total = sc.getConf.get("spark.default.parallelism").toInt
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
import sqlContext.implicits._
println("start")
val elems_total = 1024 * 1024 * 16 * cores_total
val start_cnt = 1024 * 1024
Random.setSeed(785354)
var vals = Vector[Int]()
for (x <- 1 to start_cnt) {
vals :+= Random.nextInt
}
var test_rdd = sc.parallelize(vals)
println(test_rdd.count)
test_rdd = test_rdd.flatMap(x => get_array(elems_total / start_cnt)).distinct
println("test_rdd count = " + test_rdd.count)
println("partitions count = " + test_rdd.getNumPartitions)
var test_rdd1 = test_rdd.map(x => (x, 1)).toDF("a", "b").repartition(partitions_total, $"a").cache
var test_rdd2 = test_rdd.map(x => (x, 2)).toDF("a", "c").repartition(partitions_total, $"a").cache
println("test_rdd1 count = " + test_rdd1.count)
println("test_rdd2 count = " + test_rdd2.count)
var start_test_time = System.nanoTime
var test_res = test_rdd1.join(test_rdd2, test_rdd1("a") === test_rdd2("a"))
println(test_res.count)
print("join time = ")
println((System.nanoTime - start_test_time) / 1e9d + " sec. ")
print("all time = ")
println((System.nanoTime - start_time) / 1e9d + " sec. ")
sc.stop()
}
}
config parameters:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 1024
spark.kryo.unsafe true
spark.kryo.referenceTracking false
spark.driver.memory 22g
spark.executor.memory 22g
spark.driver.maxResultSize 22g
spark.rpc.message.maxSize 2047
spark.memory.fraction 0.8
spark.memory.storageFraction 0.5
spark.executor.extraJavaOptions "-XX:+UseParallelGC"
Partitions per core - 4.
Example of launching program:
./bin/spark-submit --class "joinPerf" --conf spark.executor.cores=8 --conf spark.cores.max=64 --conf spark.default.parallelism=256 ./joinPerf.jar

Theoretically, with the increase of the data size and cores by 64 times, the execution time should remain the same, but the execution time slightly grows
It shouldn't. While one could expect linear scalability, assuming no IO bottlenecks, when performing strictly local operations on uniformly distributed data, this is not longer the case, when transformations require data exchange (RDD shuffles, Dataset Exchange). Among wide transformations, joins belong to most expensive category (next groupByKey-like operations), due their non-reducing nature, and usage of large, local, supporting collections.
Shuffles not only have higher than linear complexity (at least O(N log N) for sorting-based methods), but also can induce non-uniform distribution of data, and require significant disk and network IO.
This is even more severe in case of your code, which shuffles data twice - once to repartition RDDs and once to join Datasets (HashPartitioner for RDDs is not compatible with Dataset partitioning).
Finally increasing cluster size, has its own performance impact, related to increased communication and synchronization overhead and decreased data locality.
Overall you'll rarely see truly linear scalability, and even if you do, you can expect slope to be < 1.
On a side note I wouldn't depend on cache - count idiom when working with Datasets. It is likely to be unreliable.
See also Spark: Inconsistent performance number in scaling number of cores

Related

Why the computing efficiency of torch.tanh is much higher than the direct expression?

The two ways of computing 'tanh' are shown as follows. Why the computing efficiency of torch.tanh(1) is much higher than the direct expression(2)? I am confused. And where can I find the original code of torch.tanh in pytorch? Dose it written by C/C++?
import torch
import time
def tanh(x):
return (torch.exp(x) - torch.exp(-x)) / (torch.exp(x) + torch.exp(-x))
class Function(torch.nn.Module):
def __init__(self):
super(Function, self).__init__()
self.Linear1 = torch.nn.Linear(3, 50)
self.Linear2 = torch.nn.Linear(50, 50)
self.Linear3 = torch.nn.Linear(50, 50)
self.Linear4 = torch.nn.Linear(50, 1)
def forward(self, x):
# (1) for torch.torch
x = torch.tanh(self.Linear1(x))
x = torch.tanh(self.Linear2(x))
x = torch.tanh(self.Linear3(x))
x = torch.tanh(self.Linear4(x))
# (2) for direct expression
# x = tanh(self.Linear1(x))
# x = tanh(self.Linear2(x))
# x = tanh(self.Linear3(x))
# x = tanh(self.Linear4(x))
return x
func = Function()
x= torch.ones(1000,3)
T1 = time.time()
for i in range(10000):
y = func(x)
T2 = time.time()
print(T2-T1)
The mathematical functions are writen in higly optimized code, they can use advanced CPU features and multiple cores, it can even take advantage of GPUs.
in your tanh function it evaluates the exp function four times, does 2 subtraction and one division, creating temporary tensors require memory allocation that can be slow as well, not to mention the overhead of the python interpreter, being 4 to 10 times slow is reasonable.

Problem by dictionaries to use numba njit parallelization to accelerate the code

I have written a code and try to use numba for accelerating the code. The main goal of the code is to group some values based on a condition. In this regard, iter_ is used for converging the code to satisfy the condition. I prepared a small case below to reproduce the sample code:
import numpy as np
import numba as nb
rng = np.random.default_rng(85)
# --------------------------------------- small data volume ---------------------------------------
# values_ = {'R0': np.array([0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723, 0.00990762,
# 0.01090976, 0.01069902, 0.00724112, 0.0068463 , 0.01135723]),
# 'R1': np.array([0.01836379, 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088,
# 0.01900166, 0.01864162, 0.0182823 , 0.01840322, 0.01653088]),
# 'R2': np.array([0.02430913, 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411,
# 0.02239156, 0.02225379, 0.02093393, 0.02408692, 0.02110411])}
#
# params = {'R0': [3, 0.9490579204466154, 1825, 7.070272000000002e-05],
# 'R1': [0, 0.9729203826820172, 167 , 7.070272000000002e-05],
# 'R2': [1, 0.6031363088057902, 1316, 8.007296000000003e-05]}
#
# Sno, dec_, upd_ = 2, 100, 200
# -------------------------------------------------------------------------------------------------
# ----------------------------- UPDATED (medium and large data volumes) ---------------------------
# values_ = np.load("values_med.npy", allow_pickle=True)[()]
# params = np.load("params_med.npy", allow_pickle=True)[()]
values_ = np.load("values_large.npy", allow_pickle=True)[()]
params = np.load("params_large.npy", allow_pickle=True)[()]
Sno, dec_, upd_ = 2000, 1000, 200
# -------------------------------------------------------------------------------------------------
# values_ = [*values_.values()]
# params = [*params.values()]
# #nb.jit(forceobj=True)
# def test(values_, params, Sno, dec_, upd_):
final_dict = {}
for i, j in enumerate(values_.keys()):
Rand_vals = []
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
while not np.allclose(goal_sum, final_sum, atol=tel):
iter_ += 1
vals_group = rng.choice(values_[j], size=params[j][0], replace=False)
# final_sum = 0.0016 * np.sum(vals_group) # -----> For small data volume
final_sum = np.sum(vals_group ** 3) # -----> UPDATED For med or large data volume
if iter_ == upd_:
t += 1
tel = t * tel
values_[j] = np.delete(values_[j], np.where(np.in1d(values_[j], vals_group)))
Rand_vals.append(vals_group)
else:
Rand_vals = [np.array([])] * Sno
final_dict["R" + str(i)] = Rand_vals
# return final_dict
# test(values_, params, Sno, dec_, upd_)
At first, for applying numba on this code #nb.jit was used (forceobj=True is used for avoiding warnings and …), which will have adverse effect on the performance. nopython is checked, too, with #nb.njit which get the following error due to not supporting (as mentioned in 1, 2) dictionary type of the inputs:
cannot determine Numba type of <class 'dict'>
I don't know if (how) it could be handled by Dict from numba.typed (by converting created python dictionaries to numba Dict) or if converting the dictionaries to lists of arrays have any advantage. I think, parallelization may be possible if some code lines e.g. Rand_vals.append(vals_group) or else section or … be taken or be modified out of the function to get the same results as before, but I don't have any idea how to do so.
I will be grateful for helping utilize numba on this code. numba parallelization will be the most desired (probably the best applicable method in terms of performance) solution if it could.
Data:
medium data volume: values_med, params_med
large data volume: values_large, params_large
This code can be converted to Numba but it is not straightforward.
First of all, the dictionary and list type must be defined since Numba njit functions cannot directly operate on reflected lists (aka. pure-python lists). This is a bit tedious to do in Numba and the resulting code is a bit verbose:
String = nb.types.unicode_type
ValueArray = nb.float64[::1]
ValueDict = nb.types.DictType(String, ValueArray)
ParamDictValue = nb.types.Tuple([nb.int_, nb.float64, nb.int_, nb.float64])
ParamDict = nb.types.DictType(String, ParamDictValue)
FinalDictValue = nb.types.ListType(ValueArray)
FinalDict = nb.types.DictType(String, FinalDictValue)
Then you need to convert the input dictionaries:
nbValues = nb.typed.typeddict.Dict.empty(String, ValueArray)
for key,value in values_.items():
nbValues[key] = value.copy()
nbParams = nb.typed.typeddict.Dict.empty(String, ParamDictValue)
for key,value in params.items():
nbParams[key] = (nb.int_(value[0]), nb.float64(value[1]), nb.int_(value[2]), nb.float64(value[3]))
Then, you need to write the core function. np.allclose and np.isin are not implemented in Numba so they should be reimplemented manually. But the main point is that Numba does not support the rng Numpy object. I think it will certainly not support it any time soon. Note that Numba has an random numbers implementation that try to mimic the behavior of Numpy but the management of the seed is a bit different. Note also that results should be the same with the np.random.xxx Numpy functions if the seed is set to the same value (Numpy and Numba have different seed variables that are not synchronized).
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
if params[j][0] != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
vals_group = np.empty(0, dtype=nb.float64)
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
vals_group = np.random.choice(values_[j], size=params[j][0], replace=False)
final_sum = 0.0016 * np.sum(vals_group)
# final_sum = 0.0016 * np.sum(vals_group) # (for small data volume)
final_sum = np.sum(vals_group ** 3) # (for med or large data volume)
if iter_ == upd_:
t += 1
tel = t * tel
# Perform an in-place deletion
vals, gr = values_[j], vals_group
cur = 0
for l in range(vals.size):
found = False
for m in range(gr.size):
found |= vals[l] == gr[m]
if not found:
# Keep the value (delete it otherwise)
vals[cur] = vals[l]
cur += 1
values_[j] = vals[:cur]
Rand_vals.append(vals_group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
Note that the replacement implementation of np.isin is quite naive but it works pretty well in practice on your input example.
The function can be called using the following way:
nbFinalDict = nbTest(nbValues, nbParams, Sno, dec_, upd_)
Finally, the dictionary should be converted back to basic Python objects:
finalDict = dict()
for key,value in nbFinalDict.items():
finalDict[key] = list(value)
This implementation is fast for small inputs but not large ones since np.random.choice takes almost all the time (>96%). The thing is this function is clearly not optimal when the number of requested item is small (which is your case). Indeed, it surprisingly runs in linear time of the input array and not in linear time of the number of requested items.
Further Optimizations
The algorithm can be completely rewritten to extract only 12 random items and discard them from the main currant array in a much more efficient way. The idea is to swap n items (small target sample) at the end of the array with other items at random locations, then check the sum, repeat this process until a condition is fulfilled, and finally extract the view to the last n items before resizing the view so to discard the last items. All of this can be done in O(n) time rather than O(m) time where m is the size of the main current array with n << m (eg. 12 VS 20_000). It can also be compute without any expensive allocation. Here is the resulting code:
#nb.njit(nb.void(ValueArray, nb.int_, nb.int_))
def swap(arr, i, j):
arr[i], arr[j] = arr[j], arr[i]
#nb.njit(FinalDict(ValueDict, ParamDict, nb.int_, nb.int_, nb.int_))
def nbTest(values_, params, Sno, dec_, upd_):
final_dict = nb.typed.Dict.empty(String, FinalDictValue)
for i, j in enumerate(values_.keys()):
Rand_vals = nb.typed.List.empty_list(ValueArray)
goal_sum = params[j][1] * params[j][3]
tel = goal_sum / dec_ * 10
values = values_[j]
n = params[j][0]
if n != 0:
for k in range(Sno):
final_sum = 0.0
iter_ = 0
t = 1
m = values.size
assert n <= m
group = values[-n:]
while np.abs(goal_sum - final_sum) > (1e-05 * np.abs(final_sum) + tel):
iter_ += 1
# Swap the group view with other random items
for pos in range(m - n, m):
swap(values, pos, np.random.randint(0, m))
# For small data volume:
# final_sum = 0.0016 * np.sum(group)
# For med/large data volume
final_sum = 0.0
for v in group:
final_sum += v ** 3
if iter_ == upd_:
t += 1
tel *= t
assert iter_ > 0
values = values[:m-n]
Rand_vals.append(group)
else:
for k in range(Sno):
Rand_vals.append(np.empty(0, dtype=nb.float64))
final_dict["R" + str(i)] = Rand_vals
return final_dict
In addition to being faster, this implementation as the benefit of being also simpler. Results looks quite similar to the previous implementation despite the randomness make the check of the results tricky (especially since this function does not use the same method to choose the random sample). Note that this implementation does not remove items in values that are in group as opposed to the previous one (this is probably not wanted though).
Benchmark
Here are the results of the last implementation on my machine (compilation and conversion timings excluded):
Provided small input (embedded in the question):
- Initial code: 42.71 ms
- Numba code: 0.11 ms
Medium input:
- Initial code: 3481 ms
- Numba code: 11 ms
Large input:
- Initial code: 6728 ms
- Numba code: 20 ms
Note that the conversion time takes about the same time than the computation.
This last implementation is 316~388 times faster than the initial code on small inputs.
Notes
Note that the compilation time takes few seconds due to the dict and lists types.
Note that while it may be possible to parallelise the implementation, only the most encompassing loop can be parallelised. The thing is there is only few items to compute and the time is already quite small (not the best case for multi-threading). <-- Additionally, the creation of many temporary arrays (created by rng.choice) will certainly cause the parallel loop not to scale well anyway. --> Additionally, the list/dict cannot be written from multiple threads safely so one need to use Numpy arrays in the whole function to be able to do that (or add additional conversion that are already expensive). Moreover, Numba parallelism tends to increase significantly the compilation time which is already significant. Finally, the result will be less deterministic since each Numba thread has its own random number generator seed and the items computed by the threads cannot be predicted with prange (dependent of the parallel runtime chosen on the target platform). Note that in Numpy there is one global seed by default used by usual random functions (deprecated way) and RNG objects have their own seed (new preferred way).

How to speed up this python loop script or parallelize it

I am currently working on a script which takes in data for a correlation matrix and compute a bunch of values. this step right here is very costly, and I would like to find ways to speed it up or parallelize it. I have tried using Parallel (from python's joblib) however because of CPU overhead (at least because of the way I parametrized it) it's significantly slower than a sequential loop.
import time
import numpy as np
import itertools
from sklearn.datasets import make_blobs
N = 5000
data,_ = make_blobs(n_samples=N,n_features=500)
G = np.corrcoef(data)
''' the cluster function '''
def clus_lc(i, j, G, ns=2):
''' c_s'''
cs = 2*(G[i,j]+1)-1e-6
''' A and B'''
if cs<=ns:
return 0
return 0.5*( np.log(ns/cs) + (ns - 1)*np.log( (ns**2 - ns) / ( ns**2 - cs) ) )
''' merge and time '''
indices = list(itertools.combinations(range(N),2))
t0 = time.time()
costs = np.zeros(len(indices))
k=0
for i, j in indices:
costs[k] = clus_lc(i,j,G)
k+=1
t1 = time.time()
toseq = t1-t0
print(toseq)
I think I solved the issue by using numba and adding a decorator #jit. This seems to work fine because all operations in the function are calls to numpy functions. On a dataset with N=5000 it goes from 75 sec to 10 sec. Fantastic improvement. Now whether this can be further improved I am interested in hearing other inputs.

Integration of orbits with solar system gravity fields from Skyfield - speed issues

In the time tests shown below, I found that Skyfield takes several hundred microseconds up to a millisecond to return obj.at(jd).position.km for a single time value in jd, but the incremental cost for longer JulianDate objects (a list of points in time) is only about one microsecond per point. I see similar speeds using Jplephem and with two different ephemerides.
My question here is: if I want to random-access points in time, for example as a slave to an external Runge-Kutta routine which uses its own variable stepsize, is there a way I can do this faster within python (without having to learn to compile code)?
I understand this is not at all the typical way Skyfield is intended to be used. Normally we'd load a JulianDate object with a long list of time points and then calculate them at once, and probably do that a few times, not thousands of times (or more), the way an orbit integrator might do.
Workaround: I can imagine a work-around where I build my own NumPy database by running Skyfield once using a JulianDate object with fine time granularity, then writing my own Runge-Kutta routine which changes step sizes up and down by discrete amounts such that the timesteps always correspond directly to the striding of NumPy array.
Or I could even try re-interpolating. I am not doing highly precise calculations so a simple NumPy or SciPy 2nd order might be fine.
Ultimately I'd like to try integrating the path of objects under the influence of the gravity field of the solar system (e.g. deep-space satellite, comet, asteroid). When looking for an orbit solution one might try millions of starting state vectors in 6D phase space. I know I should be using things like ob.at(jd).observe(large_body).position.km method because gravity travels at the speed of light like everything else. This seems to cost substantial time since (I'm guessing) it's an iterative calculation ("Let's see... where would Jupiter have been such that I feel it's gravity right NOW"). But let's peel the cosmic onion one layer at a time.
Figure 1. Skyfield and JPLephem performance on my laptop for different length JulianDate objects, for de405 and de421. They are all about the same - (very) roughly about a half-millisecond for the first point and a microsecond for each additional point. Also the very first point to be calculated when the script runs (Earth (blue) with len(jd) = 1) has an additional millisecond artifact.
Earth and Moon are slower because it is a two-step calculation internally (the Earth-Moon Barycenter plus the individual orbits about the Barycenter). Mercury may be slower because it moves so fast compared to the ephemeris time steps that it requires more coefficients in the (costly) Chebyshev interpolation?
SCRIPT FOR SKYFIELD DATA the JPLephem script is farther down
import numpy as np
import matplotlib.pyplot as plt
from skyfield.api import load, JulianDate
import time
ephem = 'de421.bsp'
ephem = 'de405.bsp'
de = load(ephem)
earth = de['earth']
moon = de['moon']
earth_barycenter = de['earth barycenter']
mercury = de['mercury']
jupiter = de['jupiter barycenter']
pluto = de['pluto barycenter']
things = [ earth, moon, earth_barycenter, mercury, jupiter, pluto ]
names = ['earth', 'moon', 'earth barycenter', 'mercury', 'jupiter', 'pluto']
ntimes = [i*10**n for n in range(5) for i in [1, 2, 5]]
years = [np.zeros(1)] + [np.linspace(0, 100, n) for n in ntimes[1:]] # 100 years
microsecs = []
for y in years:
jd = JulianDate(utc=(1900 + y, 1, 1))
mics = []
for thing in things:
tstart = time.clock()
answer = thing.at(jd).position.km
mics.append(1E+06 * (time.clock() - tstart))
microsecs.append(mics)
microsecs = np.array(microsecs).T
many = [len(y) for y in years]
fig = plt.figure()
ax = plt.subplot(111, xlabel='length of JD object',
ylabel='microseconds',
title='time for thing.at(jd).position.km with ' + ephem )
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(item.get_fontsize() + 4) # http://stackoverflow.com/a/14971193/3904031
for name, mics in zip(names, microsecs):
ax.plot(many, mics, lw=2, label=name)
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.xscale('log')
plt.yscale('log')
plt.savefig("skyfield speed test " + ephem.split('.')[0])
plt.show()
SCRIPT FOR JPLEPHEM DATA the Skyfield script is above
import numpy as np
import matplotlib.pyplot as plt
from jplephem.spk import SPK
import time
ephem = 'de421.bsp'
ephem = 'de405.bsp'
kernel = SPK.open(ephem)
jd_1900_01_01 = 2415020.5004882407
ntimes = [i*10**n for n in range(5) for i in [1, 2, 5]]
years = [np.zeros(1)] + [np.linspace(0, 100, n) for n in ntimes[1:]] # 100 years
barytup = (0, 3)
earthtup = (3, 399)
# moontup = (3, 301)
microsecs = []
for y in years:
mics = []
#for thing in things:
jd = jd_1900_01_01 + y * 365.25 # roughly, it doesn't matter here
tstart = time.clock()
answer = kernel[earthtup].compute(jd) + kernel[barytup].compute(jd)
mics.append(1E+06 * (time.clock() - tstart))
microsecs.append(mics)
microsecs = np.array(microsecs)
many = [len(y) for y in years]
fig = plt.figure()
ax = plt.subplot(111, xlabel='length of JD object',
ylabel='microseconds',
title='time for jplephem [0,3] and [3,399] with ' + ephem )
# from here: http://stackoverflow.com/a/14971193/3904031
for item in ([ax.title, ax.xaxis.label, ax.yaxis.label] +
ax.get_xticklabels() + ax.get_yticklabels()):
item.set_fontsize(item.get_fontsize() + 4)
#for name, mics in zip(names, microsecs):
ax.plot(many, microsecs, lw=2, label='earth')
plt.legend(loc='upper left', shadow=False, fontsize='x-large')
plt.xscale('log')
plt.yscale('log')
plt.ylim(1E+02, 1E+06)
plt.savefig("jplephem speed test " + ephem.split('.')[0])
plt.show()

Implicit recommender Tuning hyper parameters Pyspark

computeMAPK function takes the model, Actual data and Validation data (user,product) to generate ratings. Then sort the predicted ratings for every user and take top K to compare with the actual data to calculate Mean Average Precision at K
I am using this function to tune the hyper parameters i.e. fit multiple models and select the best Lambda, Alpha, Ranks with highest MAPK. This works for small data sets but when the the matrix becomes 10 Million users * 200 products. It breaks especially with reduceByKey step and joins. Any better way to Tune the hyper parameters for ALS implicit ? and I am using Spark 1.3.
Actual RDD is of the form (user,product)
Valid RDD is of the form (user,product)
def apk(act_pred):
predicted = act_pred[0]
actual = act_pred[1]
k = act_pred[2]
if len(predicted)>k:
predicted = predicted[:k]
score =0.0
num_hits = 0.0
for i,p in enumerate(predicted):
if p in actual and p not in predicted[:i]:
num_hits += 1.0
score += num_hits / (i+1.0)
if not actual:
return 1.0
#return num_hits
return (score/min(len(actual),k))
def computeMAPKR(model,actual,valid,k):
pred = model.predictAll(valid).map(lambda x:(x[0],[(x[1],x[2])])).cache()
gp = pred.reduceByKey(lambda x,y:x+y)
#gp = pred.groupByKey().map(lambda x : (x[0], list(x[1])))
# for every user, sort the items by predicted ratings and get user, item pairs
def f(x):
s = sorted(x,key=lambda x:x[1],reverse=True)
sm = map(lambda x:x[0],s)
return sm
sp = gp.mapValues(f)
# actual data
ac = actual.map(lambda x:(x[0],[(x[1])]))
#gac = ac.reduceByKey(lambda x,y:(x,y)).map(lambda x : (x[0], list(x[1])))
gac = ac.reduceByKey(lambda x,y:x+y)
ap = sp.join(gac)
apk_result = ap.map(lambda x:(x[0],(x[1][0],x[1][1],k))).mapValues(apk)
mapk = apk_result.map(lambda x :x[1]).reduce(add) / ap.count()
#print(apk_result.collect())
return mapk

Resources