How to use both binary and continuous features in the k-Nearest-Neighbor algorithm? - algorithm

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:
Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.
I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?

You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.
It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.

If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,
ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)
In R, for instance, you can write this function:
ev_scaled = function(x) {
(x - min(x)) / (max(x) - min(x))
}
which works like this:
# generate some data:
# v1, v2 are two expectation variables in the same dataset
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
[1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
[1] 0.2 3.5 5.1 5.6 8.0 8.3 9.9 11.3 15.5 19.4
> mean(v1)
[1] 325
> mean(v2)
[1] 8.68
# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
[1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
[1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
[1] 0.5
> mean(v2_scaled)
[1] 0.442
> range(v1_scaled)
[1] 0 1
> range(v2_scaled)
[1] 0 1

You can also try Mahalanobis distance instead of Euclidean.

Related

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

A variant of the Knapsack algorithm

I have a list of items, a, b, c,..., each of which has a weight and a value.
The 'ordinary' Knapsack algorithm will find the selection of items that maximises the value of the selected items, whilst ensuring that the weight is below a given constraint.
The problem I have is slightly different. I wish to minimise the value (easy enough by using the reciprocal of the value), whilst ensuring that the weight is at least the value of the given constraint, not less than or equal to the constraint.
I have tried re-routing the idea through the ordinary Knapsack algorithm, but this can't be done. I was hoping there is another combinatorial algorithm that I am not aware of that does this.
In the german wiki it's formalized as:
finite set of objects U
w: weight-function
v: value-function
w: U -> R
v: U -> R
B in R # constraint rhs
Find subset K in U subject to:
sum( w(u) <= B ) | all w in K
such that:
max sum( v(u) ) | all u in K
So there is no restriction like nonnegativity.
Just use negative weights, negative values and a negative B.
The basic concept is:
sum( w(u) ) <= B | all w in K
<->
-sum( w(u) ) >= -B | all w in K
So in your case:
classic constraint: x0 + x1 <= B | 3 + 7 <= 12 Y | 3 + 10 <= 12 N
becomes: -x0 - x1 <= -B |-3 - 7 <=-12 N |-3 - 10 <=-12 Y
So for a given implementation it depends on the software if this is allowed. In terms of the optimization-problem, there is no problem. The integer-programming formulation for your case is as natural as the classic one (and bounded).
Python Demo based on Integer-Programming
Code
import numpy as np
import scipy.sparse as sp
from cylp.cy import CyClpSimplex
np.random.seed(1)
""" INSTANCE """
weight = np.random.randint(50, size = 5)
value = np.random.randint(50, size = 5)
capacity = 50
""" SOLVE """
n = weight.shape[0]
model = CyClpSimplex()
x = model.addVariable('x', n, isInt=True)
model.objective = value # MODIFICATION: default = minimize!
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int) # assumes existence
print("INSTANCE")
print(" weights: ", weight)
print(" values: ", value)
print(" capacity: ", capacity)
print("Solution")
print(x_sol)
print("sum weight: ", x_sol.dot(weight))
print("value: ", x_sol.dot(value))
Small remarks
This code is just a demo using a somewhat low-level like library and there are other tools available which might be better suited (e.g. windows: pulp)
it's the classic integer-programming formulation from wiki modifies as mentioned above
it will scale very well as the underlying solver is pretty good
as written, it's solving the 0-1 knapsack (only variable bounds would need to be changed)
Small look at the core-code:
# create model
model = CyClpSimplex()
# create one variable for each how-often-do-i-pick-this-item decision
# variable needs to be integer (or binary for 0-1 knapsack)
x = model.addVariable('x', n, isInt=True)
# the objective value of our IP: a linear-function
# cylp only needs the coefficients of this function: c0*x0 + c1*x1 + c2*x2...
# we only need our value vector
model.objective = value # MODIFICATION: default = minimize!
# WARNING: typically one should always use variable-bounds
# (cylp problems...)
# workaround: express bounds lower_bound <= var <= upper_bound as two constraints
# a constraint is an affine-expression
# sp.eye creates a sparse-diagonal with 1's
# example: sp.eye(3) * x >= 5
# 1 0 0 -> 1 * x0 + 0 * x1 + 0 * x2 >= 5
# 0 1 0 -> 0 * x0 + 1 * x1 + 0 * x2 >= 5
# 0 0 1 -> 0 * x0 + 0 * x1 + 1 * x2 >= 5
model += sp.eye(n) * x >= np.zeros(n) # could be improved
model += sp.eye(n) * x <= np.ones(n) # """
# cylp somewhat outdated: need numpy's matrix class
# apart from that it's just the weight-constraint as defined at wiki
# same affine-expression as above (but only a row-vector-like matrix)
model += np.matrix(-weight) * x <= -capacity # MODIFICATION
# internal conversion of type neeeded to treat it as IP (or else it would be
LP)
cbcModel = model.getCbcModel()
cbcModel.logLevel = True
status = cbcModel.solve()
# type-casting
x_sol = np.array(cbcModel.primalVariableSolution['x'].round()).astype(int)
Output
Welcome to the CBC MILP Solver
Version: 2.9.9
Build Date: Jan 15 2018
command line - ICbcModel -solve -quit (default strategy 1)
Continuous objective value is 4.88372 - 0.00 seconds
Cgl0004I processed model has 1 rows, 4 columns (4 integer (4 of which binary)) and 4 elements
Cutoff increment increased from 1e-05 to 0.9999
Cbc0038I Initial state - 0 integers unsatisfied sum - 0
Cbc0038I Solution found of 5
Cbc0038I Before mini branch and bound, 4 integers at bound fixed and 0 continuous
Cbc0038I Mini branch and bound did not improve solution (0.00 seconds)
Cbc0038I After 0.00 seconds - Feasibility pump exiting with objective of 5 - took 0.00 seconds
Cbc0012I Integer solution of 5 found by feasibility pump after 0 iterations and 0 nodes (0.00 seconds)
Cbc0001I Search completed - best objective 5, took 0 iterations and 0 nodes (0.00 seconds)
Cbc0035I Maximum depth 0, 0 variables fixed on reduced cost
Cuts at root node changed objective from 5 to 5
Probing was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Gomory was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Knapsack was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Clique was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
MixedIntegerRounding2 was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
FlowCover was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
TwoMirCuts was tried 0 times and created 0 cuts of which 0 were active after adding rounds of cuts (0.000 seconds)
Result - Optimal solution found
Objective value: 5.00000000
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.00
INSTANCE
weights: [37 43 12 8 9]
values: [11 5 15 0 16]
capacity: 50
Solution
[0 1 0 1 0]
sum weight: 51
value: 5

Clustering points to reach global minimum

For a number of points, for example, 100 points, each two has a 'connection' (a number), the goal of the algorithm is to split those points into a given number of clusters (like 5 clusters), minimized the total connections insides clusters.
input:
A matrix with shape n * n, the matrix[i][j] describe the connection between point i and j (the matrix should be symmetry matrix). The cluster number m.
output:
m clusters for n points. And the total connections inside clusters are minimized.
T= ∑(C⊆m)∑(i,j⊆C)M_ij
(Goal is to minimize T above)
For example: 5 points with the matrix
1 2 3 4 5
1 0.1 0.1 0.3 0.5 0.7
2 0.1 0.1 0.7 0.9 1.1
3 0.3 0.7 0.5 0.1 0.2
4 0.5 0.9 0.1 0.3 0.5
5 0.7 1.1 0.2 0.5 0.1
To split into 2 clusters, the splitting
Cluster 1: {1,2}
Cluster 2: {3,4,5}
has the total internal connection of T = C1 + C2 = M12 + M34 + M35 + M45 = 0.1 + 0.1 + 0.2 + 0.5 = 0.9
The splitting
Cluster 1: {1,3,4}
Cluster 2: {2,5}
Has the total internal connection T = C1 + C2 = M13 + M14 + M34 + M25 = 0.3 + 0.5 + 0.1 + 1.1 = 2.0
The goal is to find the lowest internal connection
This is easy when n and m is small, just loop all possible case to find the global minimum. but when n and m become bigger, iteration is not possible.
I have tried Kernighan–Lin algorithm to solve this problem. Initialize with random splitting, then defined two behavior, inserting the point into another cluster, and swap two points in two clusters, each time to find the behavior that can lower the total connections insides clusters mostly. Applied this behavior, then re-calculate again, until no more insertion/swapping can lower the total connections. (Greedy algorithm Strategy).
However it can only reach local minimum, with different initialization, the results also are different. Is there a standard way to solve this problem to reach the global minimum?
The problem is supposedly NP hard, so either you use a local optimum,or you have to try all O(k^n) possibilities.
You can use a local optimum to bound your search, but there is no guarantee that this helps much.

How to create a uniformly random matrix in Julia?

l want to get a matrix with uniformly random values sampled from [-1,2]
x= rand([-1,2],(3,3))
3x3 Array{Int64,2}:
-1 -1 -1
2 -1 -1
-1 -1 -1
but it takes into consideration just -1 and 2, and I'm looking for continuous values for instance -0.9 , 0.75, -0.09, 1.80.
How can I do that?
Note: I am assuming here that you're looking for uniform random variables.
You can also use the Distributions package:
## Pkg.add("Distributions") # If you don't already have it installed.
using Distributions
rand(Uniform(-1,2), 3,3)
I do quite like isebarn's solution though, as it gets you thinking about the actual properties of the underlying probability distributions.
for random number in range [a,b]
rand() * (b-a) + a
and it works for a matrix aswell
rand(3,3) * (2 - (-1)) - 1
3x3 Array{Float64,2}:
1.85611 0.456955 -0.0219579
1.91196 -0.0352324 0.0296134
1.63924 -0.567682 0.45602
You need to use a FloatRange{Float64} with the dessired step:
julia> rand(-1.0:0.01:2.0, 3, 3)
3x3 Array{Float64,2}:
0.79 1.73 0.95
0.73 1.4 -0.46
1.42 1.68 -0.55

Math algorithm question

I'm not sure if this can be done without some determining factor....but wanted to see if someone knew of a way to do this.
I want to create a shifting scale for numbers.
Let's say I have the number 26000. I want the outcome of this algorithm to be 6500; or 25% of the original number. But if I have the number 5000, I want the outcome to be 2500; or 50% of the original number.
The percentages don't have to be exact, this is just an example.
I just want to have like a sine wave sort of thing. As the input number gets higher, the output number is a lower percentage of the input.
Does that make sense?
Plot some points in Excel and use the "show formula" option on the line.
Something like f(x) = x / log x?
x | f(x)
=======================
26000 | 5889 (22.6 %)
5000 | 1351 (27.2 %)
100000 | 20000 (20 %)
1000000 | 166666 (16.6 %)
Just a simple example. You can tweak it by playing with the base of the logarithm, by adding multiplicative constants on the numerator (x) or denominator (log x), by using square roots, squaring (or taking the root of) log x or x etc.
Here's what f(x) = 2*log(x)^2*sqrt(x) gives:
x | f(x)
=======================
26000 | 6285 (24 %)
5000 | 1934 (38 %)
500 | 325 (65 %)
100 | 80 (80 %)
1000000 | 72000 (7.2 %)
100000 | 15811 (15 %)
A suitable logarithmic scale might help.
It may be possible to define the function you want exactly if you specify a third transformation in addition to the two you've already mentioned. If you have some specific aim in mind, it's quite likely to fit a well-known mathematical definition which at least one poster could identify for you. It does sound as though you're talking about a logarithmic function. You'll have to be more specific about your requirements to define a useful algorithm however.
I'd suggest you play with the power law family of functions, c*x^a == c * pow(x,a) where a is the power. If you want an exact fraction of your answer, you would choose a=1 and it would just be a constant fraction. But you want the percentage to slowly decrease, so you could choose a<1. For example, we might choose a = 0.9 and c = 0.2 and get
1 0.2
10 1.59
100 12.6
1000 100.2
So it ranges from 20% at 1 to 10% at 1000. You can pick smaller a to make the fraction decrease more rapidly. (And you can scale everything to fit your range.)
In particular, if c*5000^a = 2500 and c*26000^a = 6500, then by dividing we get (5.1)^a = 2.6 which we can solve as a = log(2.6)/log(5.1) = 0.58648.... Then we plug back in to get c*147.69 = 2500 so c = 16.927...
Now the progression goes like so
1000 973
3000 1853
5000 2500
10000 3754
15000 4762
26000 6574
50000 9648
90000 13618
This is somewhat similar to simple compression schemes used for analogue audio. See Wikipedia entry for Companding.

Resources