Perfect scores in multiclassclassification? - knn

I am working on a multiclass classification problem with 3 (1, 2, 3) classes being perfectly distributed. (70 instances of each class resulting in (210, 8) dataframe).
Now my data has all the 3 classes distributed in order i.e first 70 instances are class1, next 70 instances are class 2 and last 70 instances are class 3. I know that this kind of distribution will lead to good score on train set but poor score on test set as the test set has classes that the model has not seen. So I used stratify parameter in train_test_split. Below is my code:-
# SPLITTING
train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state =
69, stratify = y)
cross_val_model = cross_val_score(pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'f1_macro')
s_score = cross_val_model.mean()
def objective(trial):
model__n_neighbors = trial.suggest_int('model__n_neighbors', 1, 20)
model__metric = trial.suggest_categorical('model__metric', ['euclidean', 'manhattan',
'minkowski'])
model__weights = trial.suggest_categorical('model__weights', ['uniform', 'distance'])
params = {'model__n_neighbors' : model__n_neighbors,
'model__metric' : model__metric,
'model__weights' : model__weights}
pipe.set_params(**params)
return np.mean( cross_val_score(pipe, train_x, train_y, cv = 5,
n_jobs = -1, scoring = 'f1_macro'))
knn_study = optuna.create_study(direction = 'maximize')
knn_study.optimize(objective, n_trials = 10)
knn_study.best_params
optuna_gave_score = knn_study.best_value
pipe.set_params(**knn_study.best_params)
pipe.fit(train_x, train_y)
pred = pipe.predict(test_x)
c_matrix = confusion_matrix(test_y, pred)
c_report = classification_report(test_y, pred)
Now the problem is that I am getting perfect scores on everything. The f1 macro score from performing cv is 0.898. Below are my confusion matrix and classification report:-
14 0 0
0 14 0
0 0 14
Classification Report:-
precision recall f1-score support
1 1.00 1.00 1.00 14
2 1.00 1.00 1.00 14
3 1.00 1.00 1.00 14
accuracy 1.00 42
macro avg 1.00 1.00 1.00 42
weighted avg 1.00 1.00 1.00 42
Am I overfitting or what?

Finally got the answer. The dataset I was using was the issue. The dataset was tailor made for knn algorithm and that was why I was getting perfect scores as I was using the same algorithm.
I got came to this conclusion after I performed a clustering exercise on this dataset and the K-Means algorithm perfectly predicted the clusters.

Related

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

Fitness Proportionate Selection when some fitnesses are 0

I have a question about what to do with the fitnesses (fitness'?) that are 0 when getting the fitness proportionate probabilities. Should the container for the members be sorted by highest fitness first, then do code similar to this:
for all members of population
sum += fitness of this individual
end for
for all members of population
probability = sum of probabilities + (fitness / sum)
sum of probabilities += probability
end for
loop until new population is full
do this twice
number = Random between 0 and 1
for all members of population
if number > probability but less than next probability then you have been selected
end for
end
create offspring
end loop
My problem that I am seeing as I go through one iteration by hand with randomly generated members is that I have some member's fitness as 0, but when getting the probability of those members, it keeps the same probability as the last non zero member. Is there a way I can separate the non zero probabilities from the zero probabilities? I was thinking that even if I sort based on highest fitness, the last non zero member would have the same probability as the zero probabilities.
Consider this example:
individual fitness(i) probability(i) partial_sum(i)
1 10 10/20 = 0.50 0.50
2 3 3/20 = 0.15 0.5+0.15 = 0.65
3 2 2/20 = 0.10 0.5+0.15+0.1 = 0.75
4 0 0/20 = 0.00 0.5+0.15+0.1+0.0 = 0.75
5 5 5/20 = 0.25 0.5+0.15+0.1+0.0+0.25 = 1.00
------
Sum 20
Now if number = Random between [0;1[ we are going to pick individual i if:
individual condition
1 0.00 <= number < partial_sum(1) = 0.50
2 0.50 = partial_sum(1) <= number < partial_sum(2) = 0.65
3 0.65 = partial_sum(2) <= number < partial_sum(3) = 0.75
4 0.75 = partial_sum(3) <= number < partial_sum(4) = 0.75
5 0.75 = partial_sum(4) <= number < partial_sum(5) = 1.00
If an individual has fitness 0 (e.g. I4) it cannot be selected because of its selection condition (e.g. I4 has the associated condition 0.75 <= number < 0.75).

Algorithm for optimal expected amount in a profit/loss game

I came upon the following question recently,
"You have a box which has G green and B blue coins. Pick a random coin, G gives a profit of +1 and blue a loss of -1. If you play optimally what is the expected profit."
I was thinking of using a brute force algorithm where I consider all possibilities of combinations of green and blue coins but I'm sure there must be a better solution for this (range of B and G was from 0 to 5000). Also what does playing optimally mean? Does it mean that if i pick all blue coins then I would continue playing till all green coins are also picked? If so then this means I shouldn't consider all possibilities of green and blue coins?
The "obvious" answer is to play whenever there's more green coins than blue coins. In fact, this is wrong. For example, if there's 999 green coins and 1000 blue coins, here's a strategy that takes an expected profit:
Take 2 coins
If GG -- stop with a profit of 2
if BG or GB -- stop with a profit of 0
if BB -- take all the remaining coins for a profit of -1
Since the first and last possibilities both occur with near 25% probability, your overall expectation is approximately 0.25*2 - 0.25*1 = 0.25
This is just a simple strategy in one extreme example that shows that the problem is not as simple as it first seems.
In general, the expectations with g green coins and b blue coins is given by a recurrence relation:
E(g, 0) = g
E(0, b) = 0
E(g, b) = max(0, g(E(g-1, b) + 1)/(b+g) + b(E(g, b-1) - 1)/(b+g))
The max in the final row occurs because if it's -EV to play, then you're better stopping.
These recurrence relations can be solved using dynamic programming in O(gb) time.
from fractions import Fraction as F
def gb(G, B):
E = [[F(0, 1)] * (B+1) for _ in xrange(G+1)]
for g in xrange(G+1):
E[g][0] = F(g, 1)
for b in xrange(1, B+1):
for g in xrange(1, G+1):
E[g][b] = max(0, (g * (E[g-1][b]+1) + b * (E[g][b-1]-1)) * F(1, (b+g)))
for row in E:
for v in row:
print '%5.2f' % v,
print
print
return E[G][B]
print gb(8, 10)
Output:
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2.00 1.33 0.67 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3.00 2.25 1.50 0.85 0.34 0.00 0.00 0.00 0.00 0.00 0.00
4.00 3.20 2.40 1.66 1.00 0.44 0.07 0.00 0.00 0.00 0.00
5.00 4.17 3.33 2.54 1.79 1.12 0.55 0.15 0.00 0.00 0.00
6.00 5.14 4.29 3.45 2.66 1.91 1.23 0.66 0.23 0.00 0.00
7.00 6.12 5.25 4.39 3.56 2.76 2.01 1.34 0.75 0.30 0.00
8.00 7.11 6.22 5.35 4.49 3.66 2.86 2.11 1.43 0.84 0.36
7793/21879
From this you can see that the expectation is positive to play with 8 green and 10 blue coins (EV=7793/21879 ~= 0.36), and you even have positive expectation with 2 green and 3 blue coins (EV=0.2)
Simple and intuitive answer:
you should start off with an estimate for the total number of blue and green coins. After each pick you will update this estimate. If you estimate there are more blue coins than green coins at any point you should stop.
Example:
you start and you pick a coin. Its green so you estimate 100% of the coins are green. You pick a blue so you estimate 50% of coins are green. You pick another blue coin so you estimate 33% of the coins are green. At this point is isn't worth playing anymore, according to your estimate, so you stop.
This answer is wrong; see Paul Hankin's answer for counterexamples and a proper analysis. I leave this answer here as a learning example for all of us.
Assuming that your choice is only when to stop picking coins, you continue as long as G > B. That part is simple. If you start with G < B, then you never start drawing, and your gain is 0. For G = B, no strategy will get you a mathematical advantage; the gain there is also 0.
For the expected reward, take this in two steps:
(1) Expected value on any draw sequence. Do this recursively, figuring the chance of getting green or blue on the first draw, and then the expected values for the new state (G-1, B) or (G, B-1). You will quickly see that the expected value of any given draw number (such as all possibilities for the 3rd draw) is the same as the original.
Therefore, your expected value on any draw is e = (G-B) / (G+B). Your overall expected value is e * d, where d is the number of draws you choose.
(2) What is the expected number of draws? How many times do you expect to draw before G = B? I'll leave this as an exercise for the student, but note the previous idea of doing this recursively. You might find it easier to describe the state of the game as (extra, total), where extra = G-B and total = G+B.
Illustrative exercise: given G=4, B=2, what is the chance that you'll draw GG on the first two draws (and then stop the game)? What is the gain from that? How does that compare with the (4-2)/(4+2) advantage on each draw?

How to implement branch selection based on probability?

I want the program to choose something with a set probability. For example, there is 0.312 probability of choosing path A and 0.688 probability of choosing path B. The only way I can think is a naive way to select randomly from the interval 0-1 and checking if <=0.312. Is there some better approach that extends to more than 2 elements?
Following is a way to do it with more efficiently than multiple if else statements: -
Suppose
a = 0.2, b = 0.35, c = 0.15, d = 0.3.
Make an array where p[0] corresponds to a and p[1] corresponds to b and so on
run a loop evaluating sum of probabilties
p[0] = 0.2
p[1] = 0.2 + 0.35 = 0.55
p[2] = 0.55 + 0.15 = 0.70
p[3] = 0.70 + 0.30 = 1
Generate a random number in [0,1]. Do binary search on p for random number. The interval that search returns will be your branch
eg.
random no = 0.6
result = binarySearch(0.6)
result = 2 using above intervals
2 => branch c

order of growth in algorithms

Suppose that you time a program as a function of N and produce
the following table.
N seconds
-------------------
19683 0.00
59049 0.00
177147 0.01
531441 0.08
1594323 0.44
4782969 2.46
14348907 13.58
43046721 74.99
129140163 414.20
387420489 2287.85
Estimate the order of growth of the running time as a function of N.
Assume that the running time obeys a power law T(N) ~ a N^b. For your
answer, enter the constant b. Your answer will be marked as correct
if it is within 1% of the target answer - we recommend using
two digits after the decimal separator, e.g., 2.34.
Can someone explain how to calculate this?
Well, it is a simple mathematical problem.
I : a*387420489^b = 2287.85 -> a = 387420489^b/2287.85
II: a*43046721^b = 74.99 -> a = 43046721^b/74.99
III: (I and II)-> 387420489^b/2287.85 = 43046721^b/74.99 ->
-> http://www.purplemath.com/modules/solvexpo2.htm
Use logarithms to solve.
1.You should calculate the ratio of the growth change from one row to the one next
N seconds
--------------------
14348907 13.58
43046721 74.99
129140163 414.2
387420489 2287.85
2.Calculate the change's ratio for N
43046721 / 14348907 = 3
129140163 / 43046721 = 3
therefore the rate of change for N is 3.
3.Calculate the change's ratio for seconds
74.99 / 13.58 = 5.52
Now let check the ratio between one more pare of rows to be sure
414.2 / 74.99 = 5.52
so the change's ratio for seconds is 5.52
4.Build the following equitation
3^b = 5.52
b = 1.55
Finally we get that the order of growth of the running time is 1.55.

Resources