What is the maximum speedup vectorization can give? - parallel-processing

I mostly develop in Python. There I noticed, that vecotrizing numpy operations gives a HUGE speedup; sometimes 1000x faster.
I've just heard in Performance: SIMD, Vectorization and Performance Tuning by James Reinders (former Intel Director), that vectorization gives at most a speedup of 16x (minute 03:00 - 03:09), but parallelization can give up to 256x speedup.
Where do those numbers come from? I thought the speedup for parallelization is the number of threads, hence 8x on my Intel i7-6700HQ?
Python vectorization example
This is one example where I see a huge difference:
import timeit
import numpy as np
def print_durations(durations):
print('min: {min:5.1f}ms, mean: {mean:5.1f}ms, max: {max:6.1f}ms (total: {len})'
.format(min=min(durations) * 10**3,
mean=np.mean(durations) * 10**3,
max=max(durations) * 10**3,
len=len(durations)
))
def test_speed(nb_items=1000):
print('## nb_items={}'.format(nb_items))
durations = timeit.repeat('cosine_similarity(mat)',
setup='from sklearn.metrics.pairwise import cosine_similarity;import numpy as np;mat = np.random.random(({}, 50))'.format(nb_items),
repeat=10, number=1)
print_durations(durations)
durations = timeit.repeat('for i, j in combinations(range({}), 2): cosine_similarity([mat[i], mat[j]])'.format(nb_items),
setup='from itertools import combinations;from sklearn.metrics.pairwise import cosine_similarity;import numpy as np;mat = np.random.random(({}, 50))'.format(nb_items),
repeat=10, number=1)
print_durations(durations)
print('First vectorized, second with loops')
test_speed(nb_items=100)
test_speed(nb_items=200)
test_speed(nb_items=300)
test_speed(nb_items=400)
test_speed(nb_items=500)

Related

Hyperparameter tuning in KNN

I am trying to find the best k for my model and tries gridsearch cv and ended up having K=1,
but k=1 is not a best k most of the times as it wont perform well on test data.
Find the code below.
from sklearn import neighbors
from sklearn import metrics
from sklearn import model_selection
import matplotlib.pyplot as plt
knn_parameters = {'n_neighbors': \[1, 3, 5, 7, 11\],'weights':\['uniform','distance'\],'metric':\['euclidean','manhattan'\]}
knn = neighbors.KNeighborsClassifier()
features = df2.iloc\[:,:-1\]
labels = df2.target
# Train - Test Split
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.3, random_state=42)
\#kNN
knn_best_k = model_selection.GridSearchCV(knn, knn_parameters)
knn_best_k.fit( train_features, train_labels)
print("The best classifier for kNN is:", knn_best_k.best_estimator\_)
print("kNN accuracy is:",knn_best_k.best_score\_)
print("kNN parameters are:",knn_best_k.best_params\_)`
Was expecting optimal k but found k = 1.

Python Gekko - How to use built-in maximum function with sequential solvers?

When solving models sequentially in Python GEKKO (i.e. with IMODE >= 4) fails when using the max2 and max3 functions that come with GEKKO.
This is for use cases, where np.maximum or the standard max function treat a GEKKO parameter like an array, which is not always the intended usage or can create errors when comparing against integers for example.
minimal code example:
from gekko import GEKKO
import numpy as np
m = GEKKO()
m.time = np.arange(0,20)
y = m.Var(value=5)
forcing = m.Param(value=np.arange(-5,15))
m.Equation(y.dt()== m.max2(forcing,0) * y)
m.options.IMODE=4
m.solve(disp=False)
returns:
Exception: #error: Degrees of Freedom
* Error: DOF must be zero for this mode
STOPPING...
I know from looking at the code that both max2 and max3 use inequality expressions in the equations, which understandably introduces the degrees of freedoms, so was this functionality never intended? Could there be some workaround to fix this?
Any help would be much appreciated!
Note:
I hope this is not a duplicate of How to define maximum of Intermediate and another value in Python Gekko, when using sequential solver?, but instead asking a more concise & different question, about essentially the same issue.
You can get a successful solution by switching to IMODE=6. IMODE=4 (simultaneous simulation) or IMODE=7 sequential simulation requires zero degrees of freedom. Both m.max2() and m.max3() require degrees of freedom and an optimizer to solve.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
m.time = np.arange(0,20)
y = m.Var(value=5)
forcing = m.Param(value=np.arange(-5,15))
m.Equation(y.dt()== -m.max2(forcing,0) * y)
m.options.IMODE=6
m.solve(disp=True)
The equation y.dt()== -m.max2(forcing,0) * y exponentially increases beyond machine precision so I switched the equation to something that can solve.

A very quick method to approximate np.random.dirichlet with large dimension

I'd like to evaluate np.random.dirichlet with large dimension as quickly as possible. More precisely, I'd like a function approximating the below by at least 10 times faster. Empirically, I observed that small-dimension-version of this function outputs one or two entries that have the order of 0.1, and every other entries are so small that they are immaterial. But this observation isn't based on any rigorous assessment. The approximation doesn't need to be so accurate, but I want something not too crude, as I'm using this noise for MCTS.
def g():
np.random.dirichlet([0.03]*4840)
>>> timeit.timeit(g,number=1000)
0.35117408499991143
Assuming your alpha is fixed over components and used for many iterations you could tabulate the ppf of the corresponding gamma distribution. This is probably available as scipy.stats.gamma.ppf but we can also use scipy.special.gammaincinv. This function seems rather slow, so this is a siginificant upfront investment.
Here is a crude implementation of the general idea:
import numpy as np
from scipy import special
class symm_dirichlet:
def __init__(self, alpha, resolution=2**16):
self.alpha = alpha
self.resolution = resolution
self.range, delta = np.linspace(0, 1, resolution,
endpoint=False, retstep=True)
self.range += delta / 2
self.table = special.gammaincinv(self.alpha, self.range)
def draw(self, n_sampl, n_comp, interp='nearest'):
if interp != 'nearest':
raise NotImplementedError
gamma = self.table[np.random.randint(0, self.resolution,
(n_sampl, n_comp))]
return gamma / gamma.sum(axis=1, keepdims=True)
import time, timeit
t0 = time.perf_counter()
X = symm_dirichlet(0.03)
t1 = time.perf_counter()
print(f'Upfront cost {t1-t0:.3f} sec')
print('Running cost per 1000 samples of width 4840')
print('tabulated {:3f} sec'.format(timeit.timeit(
'X.draw(1, 4840)', number=1000, globals=globals())))
print('np.random.dirichlet {:3f} sec'.format(timeit.timeit(
'np.random.dirichlet([0.03]*4840)', number=1000, globals=globals())))
Sample output:
Upfront cost 13.067 sec
Running cost per 1000 samples of width 4840
tabulated 0.059365 sec
np.random.dirichlet 0.980067 sec
Better check whether it is roughly correct:

Titan XP vs Quadro P400 GPU in Pytorch

I gave the the two GPUs on my machine a try and I expected the Titan-XP to be faster than the Quadro-P400. However, both gave almost the same execution time.
I need to know if PyTorch will dynamically choose one GPU over another, or, I myself will have to specify which one PyTorch will use, during run-time.
Here is the code snippet used in the test:
import torch
import time
def do_something(gpu_device):
torch.cuda.set_device(gpu_device) # torch.cuda.set_device(device_num)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
a = torch.randn(100000000).cuda()
xx = time.time() - strt
print("execution time, to create 1E8 random numbers, is ", xx)
# print(a)
# print(a + 2)
no_of_GPUs= torch.cuda.device_count()
print("how many GPUs are there:", no_of_GPUs)
for i in range(0, no_of_GPUs):
print(i, "th GPU is", torch.cuda.get_device_name(i))
do_something(i)
Sample output:
how many GPUs are there: 2
0 th GPU is TITAN Xp COLLECTORS EDITION
current GPU device 0
execution time, to create 1E8 random numbers, is 5.527713775634766
1 th GPU is Quadro P400
current GPU device 1
execution time, to create 1E8 random numbers, is 5.511776685714722
Despite what you might believe, the lack of performance difference which you see is because the random number generation is being run on your host CPU not the GPU. If I modify your do_something routine like this:
def do_something(gpu_device, ongpu=False, N=100000000):
torch.cuda.set_device(gpu_device)
print("current GPU device ", torch.cuda.current_device())
strt = time.time()
if ongpu:
a = torch.cuda.FloatTensor(N).normal_()
else:
a = torch.randn(N).cuda()
print("execution time, to create 1E8 random no, is ", time.time() - strt)
return a
and run it two ways, I get very different execution times:
In [4]: do_something(0)
current GPU device 0
execution time, to create 1E8 random no, is 7.736972808837891
Out[4]:
-9.3955e-01
-1.9721e-01
-1.1502e+00
......
-1.2428e+00
3.1547e-01
-2.1870e+00
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
In [5]: do_something(0,True)
current GPU device 0
execution time, to create 1E8 random no, is 0.001735687255859375
Out[5]:
4.1403e+06
5.7016e+06
1.2710e+07
......
8.9790e+06
1.3779e+07
8.0731e+06
[torch.cuda.FloatTensor of size 100000000 (GPU 0)]
i.e. your version takes 7 seconds and mine takes 1.7ms. I think it is obvious which one ran on the GPU....

Spark scalability

I use currently one master (local machine) and two workers (2*32 cores, Memory 2*61.9 GB) for standard ALS algorithm of Spark and produce the next code for the time evaluation:
import numpy as np
from scipy.sparse.linalg import spsolve
import random
import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
import hashlib
#Spark configuration settings
conf = SparkConf().setAppName("Temp").setMaster("spark://<myip>:7077").set("spark.cores.max","64").set("spark.executor.memory", "61g")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
#first time
t1 = time.time()
#load the DataFrame and transform it into RDD<Rating>
rddob = sqlContext.read.json("file.json").rdd
rdd1 = rddob.map(lambda line:(line.ColOne, line.ColTwo))
rdd2 = rdd1.map(lambda line: (line, 1))
rdd3 = rdd2.reduceByKey(lambda a,b: a+b)
ratings = rdd3.map(lambda (line, rating): Rating(int(hash(line[0]) % (10 ** 8)), int(line[1]), float(rating)))
ratings.cache()
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 5
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))
#second time
t2 = time.time()
#print results
print "Time of ALS",t2-t1
In this code I hold all parameters constant excepted parameter set("spark.cores.max","x") for which I use the next values for x: 1,2,4,8,16,32,64. I got the next time evaluation:
#cores time [s]
1 20722
2 11803
4 5596
8 3131
16 2125
32 2000
64 2051
The results of evaluation are a little bit strange for me. I see a good linear scalability by the small number of cores. But in the range of 16, 32 and 64 possible cores I don't see either any scalability, or improvement of time performance anymore. How is it possible? My input file is approximately 70 GB big and has 200 000 000 lines.
Linear scalability in distributed system like Spark is only in a small part a result of increasing number of cores. The most important part is opportunity to distribute disk / network IO. If you have constant number of workers and don't scale storage at the same time you'll quickly get to the point where throughput is limited by IO.

Resources