Hyperparameter tuning in KNN - knn

I am trying to find the best k for my model and tries gridsearch cv and ended up having K=1,
but k=1 is not a best k most of the times as it wont perform well on test data.
Find the code below.
from sklearn import neighbors
from sklearn import metrics
from sklearn import model_selection
import matplotlib.pyplot as plt
knn_parameters = {'n_neighbors': \[1, 3, 5, 7, 11\],'weights':\['uniform','distance'\],'metric':\['euclidean','manhattan'\]}
knn = neighbors.KNeighborsClassifier()
features = df2.iloc\[:,:-1\]
labels = df2.target
# Train - Test Split
train_features, test_features, train_labels, test_labels = model_selection.train_test_split(features, labels, test_size=0.3, random_state=42)
\#kNN
knn_best_k = model_selection.GridSearchCV(knn, knn_parameters)
knn_best_k.fit( train_features, train_labels)
print("The best classifier for kNN is:", knn_best_k.best_estimator\_)
print("kNN accuracy is:",knn_best_k.best_score\_)
print("kNN parameters are:",knn_best_k.best_params\_)`
Was expecting optimal k but found k = 1.

Related

Unexpected results from sum using gekko variable

I am optimizing a simple problem where I am summing intermediate variables for a constraint where the sum needs to be lower than a certain budget.
When I print the sum, either using sum or np.sum, I get the following results:(((((((((((((((((((((((((((((i429+i430)+i431)+i432)+i433)+i434)+i435)+i436)+i437)+i438)+i439)+i440)+i441)+i442)+i443)+i444)+i445)+i446)+i447)+i448)+i449)+i450)+i451)+i452)+i453)+i454)+i455)+i456)+i457)+i458)
Here is the command to create the variables and the sum.
x = m.Array(m.Var, (len(bounds)),integer=True)
sums = [m.Intermediate(objective_inverse2(x,y)) for x,y in zip(x,reg_feats)]
My understanding of the intermediate variable is a variable which is dynamically calculated based on the value of x, which are decision variables.
Here is the summing function for the max budget constraint.
m.Equation(np.sum(sums) < max_budget)
Solving the problem returns an error saying there are no feasible solution, even through trivial solutions exist. Furthermore, removing this constraint returns a solution which naturally does not violate the max budget constraint.
What am I misunderstanding about the intermediate variable and how to sum them.
It is difficult to diagnose the problem without a complete, minimal problem. Here is an attempt to recreate the problem:
from gekko import GEKKO
import numpy as np
m = GEKKO()
nb = 5
x = m.Array(m.Var,nb,value=1,lb=0,ub=1,integer=True)
y = m.Array(m.Var,nb,lb=0)
i = [] # intermediate list
for xi,yi in zip(x,y):
i.append(m.Intermediate(xi*yi))
m.Maximize(m.sum(i))
m.Equation(m.sum(i)<=100)
m.options.SOLVER = 1
m.solve()
print(x)
print(y)
Instead of creating a list of Intermediates, the summation can also happen with the result of the list comprehension. This way, only one Intermediate value is created.
from gekko import GEKKO
import numpy as np
m = GEKKO()
nb = 5
x = m.Array(m.Var,nb,value=1,lb=0,ub=1,integer=True)
y = m.Array(m.Var,nb,lb=0)
sums = m.Intermediate(m.sum([xi*yi for xi,yi in zip(x,y)]))
m.Maximize(sums)
m.Equation(sums<=100)
m.options.SOLVER = 1
m.solve()
print(sums.value)
print(x)
print(y)
In both cases, the optimal solution is:
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 1.560000001336448E-002 sec
Objective : -100.000000000000
Successful solution
---------------------------------------------------
[100.0]
[[1.0] [1.0] [1.0] [1.0] [1.0]]
[[20.0] [20.0] [20.0] [20.0] [20.0]]
Try using the Gekko m.sum() function to improve solution efficiency, especially for large problems.

Python Gekko - How to use built-in maximum function with sequential solvers?

When solving models sequentially in Python GEKKO (i.e. with IMODE >= 4) fails when using the max2 and max3 functions that come with GEKKO.
This is for use cases, where np.maximum or the standard max function treat a GEKKO parameter like an array, which is not always the intended usage or can create errors when comparing against integers for example.
minimal code example:
from gekko import GEKKO
import numpy as np
m = GEKKO()
m.time = np.arange(0,20)
y = m.Var(value=5)
forcing = m.Param(value=np.arange(-5,15))
m.Equation(y.dt()== m.max2(forcing,0) * y)
m.options.IMODE=4
m.solve(disp=False)
returns:
Exception: #error: Degrees of Freedom
* Error: DOF must be zero for this mode
STOPPING...
I know from looking at the code that both max2 and max3 use inequality expressions in the equations, which understandably introduces the degrees of freedoms, so was this functionality never intended? Could there be some workaround to fix this?
Any help would be much appreciated!
Note:
I hope this is not a duplicate of How to define maximum of Intermediate and another value in Python Gekko, when using sequential solver?, but instead asking a more concise & different question, about essentially the same issue.
You can get a successful solution by switching to IMODE=6. IMODE=4 (simultaneous simulation) or IMODE=7 sequential simulation requires zero degrees of freedom. Both m.max2() and m.max3() require degrees of freedom and an optimizer to solve.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
m.time = np.arange(0,20)
y = m.Var(value=5)
forcing = m.Param(value=np.arange(-5,15))
m.Equation(y.dt()== -m.max2(forcing,0) * y)
m.options.IMODE=6
m.solve(disp=True)
The equation y.dt()== -m.max2(forcing,0) * y exponentially increases beyond machine precision so I switched the equation to something that can solve.

What is the maximum speedup vectorization can give?

I mostly develop in Python. There I noticed, that vecotrizing numpy operations gives a HUGE speedup; sometimes 1000x faster.
I've just heard in Performance: SIMD, Vectorization and Performance Tuning by James Reinders (former Intel Director), that vectorization gives at most a speedup of 16x (minute 03:00 - 03:09), but parallelization can give up to 256x speedup.
Where do those numbers come from? I thought the speedup for parallelization is the number of threads, hence 8x on my Intel i7-6700HQ?
Python vectorization example
This is one example where I see a huge difference:
import timeit
import numpy as np
def print_durations(durations):
print('min: {min:5.1f}ms, mean: {mean:5.1f}ms, max: {max:6.1f}ms (total: {len})'
.format(min=min(durations) * 10**3,
mean=np.mean(durations) * 10**3,
max=max(durations) * 10**3,
len=len(durations)
))
def test_speed(nb_items=1000):
print('## nb_items={}'.format(nb_items))
durations = timeit.repeat('cosine_similarity(mat)',
setup='from sklearn.metrics.pairwise import cosine_similarity;import numpy as np;mat = np.random.random(({}, 50))'.format(nb_items),
repeat=10, number=1)
print_durations(durations)
durations = timeit.repeat('for i, j in combinations(range({}), 2): cosine_similarity([mat[i], mat[j]])'.format(nb_items),
setup='from itertools import combinations;from sklearn.metrics.pairwise import cosine_similarity;import numpy as np;mat = np.random.random(({}, 50))'.format(nb_items),
repeat=10, number=1)
print_durations(durations)
print('First vectorized, second with loops')
test_speed(nb_items=100)
test_speed(nb_items=200)
test_speed(nb_items=300)
test_speed(nb_items=400)
test_speed(nb_items=500)

A very quick method to approximate np.random.dirichlet with large dimension

I'd like to evaluate np.random.dirichlet with large dimension as quickly as possible. More precisely, I'd like a function approximating the below by at least 10 times faster. Empirically, I observed that small-dimension-version of this function outputs one or two entries that have the order of 0.1, and every other entries are so small that they are immaterial. But this observation isn't based on any rigorous assessment. The approximation doesn't need to be so accurate, but I want something not too crude, as I'm using this noise for MCTS.
def g():
np.random.dirichlet([0.03]*4840)
>>> timeit.timeit(g,number=1000)
0.35117408499991143
Assuming your alpha is fixed over components and used for many iterations you could tabulate the ppf of the corresponding gamma distribution. This is probably available as scipy.stats.gamma.ppf but we can also use scipy.special.gammaincinv. This function seems rather slow, so this is a siginificant upfront investment.
Here is a crude implementation of the general idea:
import numpy as np
from scipy import special
class symm_dirichlet:
def __init__(self, alpha, resolution=2**16):
self.alpha = alpha
self.resolution = resolution
self.range, delta = np.linspace(0, 1, resolution,
endpoint=False, retstep=True)
self.range += delta / 2
self.table = special.gammaincinv(self.alpha, self.range)
def draw(self, n_sampl, n_comp, interp='nearest'):
if interp != 'nearest':
raise NotImplementedError
gamma = self.table[np.random.randint(0, self.resolution,
(n_sampl, n_comp))]
return gamma / gamma.sum(axis=1, keepdims=True)
import time, timeit
t0 = time.perf_counter()
X = symm_dirichlet(0.03)
t1 = time.perf_counter()
print(f'Upfront cost {t1-t0:.3f} sec')
print('Running cost per 1000 samples of width 4840')
print('tabulated {:3f} sec'.format(timeit.timeit(
'X.draw(1, 4840)', number=1000, globals=globals())))
print('np.random.dirichlet {:3f} sec'.format(timeit.timeit(
'np.random.dirichlet([0.03]*4840)', number=1000, globals=globals())))
Sample output:
Upfront cost 13.067 sec
Running cost per 1000 samples of width 4840
tabulated 0.059365 sec
np.random.dirichlet 0.980067 sec
Better check whether it is roughly correct:

Speed up Pyspark mllib job

I am running a Pyspark MLLib job on EMR.
The RDD has 98000 rows.
When I am executing Kmeans on it, it takes hours and still shows 0%.
I tried enabling maximizeresourceAllocation and increasing memory of the executors and the driver, but it is still the same.
How can I speed it up?
following is the code I am executing:
from numpy import array
from math import sqrt
import time
from pyspark.mllib.clustering import KMeans, KMeansModel
start=time.time()
clusters = KMeans.train(parsedata,88000, maxIterations=5, initializationMode="random")
def error(point):
center = clusters.centers[clusters.predict(point)]
return sqrt(sum([x**2 for x in (point - center)]))
WSSSE = parsedata.map(lambda point: error(point)).reduce(lambda x, y: x + y)
print("Within Set Sum of Squared Error = " + str(WSSSE))
print time.time()-start
Any help or suggestions greatly appreciated.

Resources