I use currently one master (local machine) and two workers (2*32 cores, Memory 2*61.9 GB) for standard ALS algorithm of Spark and produce the next code for the time evaluation:
import numpy as np
from scipy.sparse.linalg import spsolve
import random
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
import hashlib
#Spark configuration settings
conf = SparkConf().setAppName("Temp").setMaster("spark://<myip>:7077").set("spark.cores.max","64").set("spark.executor.memory", "61g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
#first time
t1 = time.time()
#load the DataFrame and transform it into RDD<Rating>
rddob = sqlContext.read.json("file.json").rdd
rdd1 = rddob.map(lambda line:(line.ColOne, line.ColTwo))
rdd2 = rdd1.map(lambda line: (line, 1))
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
ratings = rdd3.map(lambda (line, rating): Rating(int(hash(line[0]) % (10 ** 8)), int(line[1]), float(rating)))
ratings.cache()
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 5
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
#second time
t2 = time.time()
#print results
print "Time of ALS",t2-t1
In this code I hold all parameters constant excepted parameter set("spark.cores.max","x") for which I use the next values for x: 1,2,4,8,16,32,64. I got the next time evaluation:
#cores time [s]
1 20722
2 11803
4 5596
8 3131
16 2125
32 2000
64 2051
The results of evaluation are a little bit strange for me. I see a good linear scalability by the small number of cores. But in the range of 16, 32 and 64 possible cores I don't see either any scalability, or improvement of time performance anymore. How is it possible? My input file is approximately 70 GB big and has 200 000 000 lines.
Linear scalability in distributed system like Spark is only in a small part a result of increasing number of cores. The most important part is opportunity to distribute disk / network IO. If you have constant number of workers and don't scale storage at the same time you'll quickly get to the point where throughput is limited by IO.
Related
Let's consider a very large numpy array a (M, N).
where M can typically be 1 or 100 and N 10-100,000,000
We have the array of indices that can split it into many (K = 1,000,000) along axis=1.
We want to efficiently perform an operation like integration along axis=1 (np.sum to take the simplest form) on each sub-array and return a (M, K) array.
An elegant and efficient solution was proposed by #Divakar in question [41920367]how to split numpy array and perform certain actions on split arrays [Python] but my understanding is that it only applies to cases where all sub-arrays have the same shape, which allows for reshaping.
But in our case the sub-arrays don't have the same shape, which, so far has forced me to loop on the index... please take me out of my misery...
Example
a = np.random.random((10, 100000000))
ind = np.sort(np.random.randint(10, 9000000, 1000000))
The size of the sub-arrays are not homogenous:
sizes = np.diff(ind)
print(sizes.min(), size.max())
2, 8732
So far, the best I found is:
output = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
Possible feature request for numpy and scipy:
If looping is really unavoidable, at least having it done in C inside the numpy and scipy.integrate.simps (or romb) functions would probably speed-up the output.
Something like
output = np.sum(a, axis=1, split_ind=ind)
output = scipy.integrate.simps(a, x=x, axis=1, split_ind=ind)
output = scipy.integrate.romb(a, x=x, axis=1, split_ind=ind)
would be very welcome !
(where x itself could be splitable, or not)
Side note:
While trying this example, I noticed that with these numbers there was almost always an element of sizes equal to 0 (the sizes.min() is almost always zero).
This looks peculiar to me, as we are picking 10,000 integers between 10 and 9,000,000, the odds that the same number comes up twice (such that diff = 0) should be close to 0. It seems to be very close to 1.
Would that be due to the algorithm behind np.random.randint ?
What you want is np.add.reduceat
output = np.add.reduceat(a, ind, axis = 1)
output.shape
Out[]: (10, 1000000)
Universal Functions (ufunc) are a very powerful tool in numpy
As for the repeated indices, that's simply the Birthday Problem cropping up.
Great !
Thanks ! on my VM Cent OS 6.9 I have the following results:
In [71]: a = np.random.random((10, 10000000))
In [72]: ind = np.unique(np.random.randint(10, 9000000, 100000))
In [73]: ind2 = np.append([0], ind)
In [74]: out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
In [75]: out2 = np.add.reduceat(a, ind2, axis=1)
In [83]: np.allclose(out, out2)
Out[83]: True
In [84]: %timeit out = np.concatenate([np.sum(vv, axis=1)[:, None] for vv in np.split(a, ind, axis=1)], axis=1)
2.7 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [85]: %timeit out2 = np.add.reduceat(a, ind2, axis=1)
179 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That's a good 93 % speed gain (or factor 15 faster) over the list concatenation :-)
Great !
I'd like to evaluate np.random.dirichlet with large dimension as quickly as possible. More precisely, I'd like a function approximating the below by at least 10 times faster. Empirically, I observed that small-dimension-version of this function outputs one or two entries that have the order of 0.1, and every other entries are so small that they are immaterial. But this observation isn't based on any rigorous assessment. The approximation doesn't need to be so accurate, but I want something not too crude, as I'm using this noise for MCTS.
def g():
np.random.dirichlet([0.03]*4840)
>>> timeit.timeit(g,number=1000)
0.35117408499991143
Assuming your alpha is fixed over components and used for many iterations you could tabulate the ppf of the corresponding gamma distribution. This is probably available as scipy.stats.gamma.ppf but we can also use scipy.special.gammaincinv. This function seems rather slow, so this is a siginificant upfront investment.
Here is a crude implementation of the general idea:
import numpy as np
from scipy import special
class symm_dirichlet:
def __init__(self, alpha, resolution=2**16):
self.alpha = alpha
self.resolution = resolution
self.range, delta = np.linspace(0, 1, resolution,
endpoint=False, retstep=True)
self.range += delta / 2
self.table = special.gammaincinv(self.alpha, self.range)
def draw(self, n_sampl, n_comp, interp='nearest'):
if interp != 'nearest':
raise NotImplementedError
gamma = self.table[np.random.randint(0, self.resolution,
(n_sampl, n_comp))]
return gamma / gamma.sum(axis=1, keepdims=True)
import time, timeit
t0 = time.perf_counter()
X = symm_dirichlet(0.03)
t1 = time.perf_counter()
print(f'Upfront cost {t1-t0:.3f} sec')
print('Running cost per 1000 samples of width 4840')
print('tabulated {:3f} sec'.format(timeit.timeit(
'X.draw(1, 4840)', number=1000, globals=globals())))
print('np.random.dirichlet {:3f} sec'.format(timeit.timeit(
'np.random.dirichlet([0.03]*4840)', number=1000, globals=globals())))
Sample output:
Upfront cost 13.067 sec
Running cost per 1000 samples of width 4840
tabulated 0.059365 sec
np.random.dirichlet 0.980067 sec
Better check whether it is roughly correct:
I gave the the two GPUs on my machine a try and I expected the Titan-XP to be faster than the Quadro-P400. However, both gave almost the same execution time.
I need to know if PyTorch will dynamically choose one GPU over another, or, I myself will have to specify which one PyTorch will use, during run-time.
Here is the code snippet used in the test:
import torch
import time
def do_something(gpu_device):
torch.cuda.set_device(gpu_device) # torch.cuda.set_device(device_num)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
a = torch.randn(100000000).cuda()
xx = time.time() - strt
print("execution time, to create 1E8 random numbers, is ", xx)
# print(a)
# print(a + 2)
no_of_GPUs= torch.cuda.device_count()
print("how many GPUs are there:", no_of_GPUs)
for i in range(0, no_of_GPUs):
print(i, "th GPU is", torch.cuda.get_device_name(i))
do_something(i)
Sample output:
how many GPUs are there: 2
0 th GPU is TITAN Xp COLLECTORS EDITION
current GPU device 0
execution time, to create 1E8 random numbers, is 5.527713775634766
1 th GPU is Quadro P400
current GPU device 1
execution time, to create 1E8 random numbers, is 5.511776685714722
Despite what you might believe, the lack of performance difference which you see is because the random number generation is being run on your host CPU not the GPU. If I modify your do_something routine like this:
def do_something(gpu_device, ongpu=False, N=100000000):
torch.cuda.set_device(gpu_device)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
if ongpu:
a = torch.cuda.FloatTensor(N).normal_()
else:
a = torch.randn(N).cuda()
print("execution time, to create 1E8 random no, is ", time.time() - strt)
return a
and run it two ways, I get very different execution times:
In [4]: do_something(0)
current GPU device 0
execution time, to create 1E8 random no, is 7.736972808837891
Out[4]:
-9.3955e-01
-1.9721e-01
-1.1502e+00
......
-1.2428e+00
3.1547e-01
-2.1870e+00
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
In [5]: do_something(0,True)
current GPU device 0
execution time, to create 1E8 random no, is 0.001735687255859375
Out[5]:
4.1403e+06
5.7016e+06
1.2710e+07
......
8.9790e+06
1.3779e+07
8.0731e+06
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
i.e. your version takes 7 seconds and mine takes 1.7ms. I think it is obvious which one ran on the GPU....
I am running a Pyspark MLLib job on EMR.
The RDD has 98000 rows.
When I am executing Kmeans on it, it takes hours and still shows 0%.
I tried enabling maximizeresourceAllocation and increasing memory of the executors and the driver, but it is still the same.
How can I speed it up?
following is the code I am executing:
from numpy import array
from math import sqrt
import time
from pyspark.mllib.clustering import KMeans, KMeansModel
start=time.time()
clusters = KMeans.train(parsedata,88000, maxIterations=5, initializationMode="random")
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedata.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
print time.time()-start
Any help or suggestions greatly appreciated.
I'd like to perform inference on a simple Ising model with pymc3:
mu = pm.Uniform('mu', lower=0, upper=1, shape=(N,1))
energy = mu.T * W * mu + f.T * mu
logp = pm.Potential('logp', energy)
start = model.test_point
step = pm.NUTS(vars=[mu])
print 'creating NUTS took', time.time() - t0
However, the last pm.NUTS step takes 2 minutes on average to complete, and uses ~1 gigabyte of memory as well. This is for N=15, so a pretty small model. Any tips on speeding this up? It's already using very basic operations for which the second-order info should be easy to compute.