How can I find k minimum bounding rectangles to enclose all the given points? - algorithm

Given a parameter k for the number of boxes and n data points, is there anyway I can find or approximate k axis-aligned bounding rectangles that enclose all the points while keeping the sum of the area of the rectangles minimum?

One way is to directly write this as a mathematical optimization problem.
A high-level optimization model can look like as follows:
We first define the decision variables:
r(k,c) = coordinates for k-th box (e.g. c={x,y,w,h})
continuous variable with appropriate bounds
x(i,k) = 1 if point i is assigned to box k
0 otherwise
binary variable
Then the 2D model can look like:
minimize sum(k, r(k,w)*r(k,h)) (sum of areas)
sum(k, x(i,k)) = 1 (assign point to one box)
x(i,k) = 1 ==> point i is inside box k (can be formulated as linear big-M constraints
or indicator constraints )
For testing, I generated 30 points in [0,1]x[0,1] unit box.
Using |k|=5 boxes, we get:
This can be directly generalized to more dimensions (plotting becomes more difficult).
This is essentially a model that combines an assignment problem (for the x variables) and a location problem (for the r variables). It probably only works for relatively small data sets. For this example, I used Gurobi (non-convex MIQP) and this is the proven globally optimal solution. It is noted that even for higher dimensional problems, we can reformulate things into a non-convex MIQCP (i.e. solvable by Gurobi).
For completeness, the data and results for the above model were:
---- 52 PARAMETER p data points
x y
i1 0.806 0.173
i2 0.530 0.149
i3 0.648 0.692
i4 0.352 0.020
i5 0.431 0.554
i6 0.641 0.775
i7 0.235 0.781
i8 0.268 0.082
i9 0.973 0.114
i10 0.874 0.667
i11 0.756 0.968
i12 0.199 0.240
i13 0.220 0.261
i14 0.989 0.172
i15 0.066 0.930
i16 0.806 0.832
i17 0.105 0.029
i18 0.229 0.094
i19 0.130 0.903
i20 0.437 0.728
i21 0.248 0.575
i22 0.360 0.516
i23 0.710 0.746
i24 0.704 0.746
i25 0.185 0.936
i26 0.817 0.673
i27 0.463 0.578
i28 0.089 0.657
i29 0.973 0.691
i30 0.894 0.078
---- 52 VARIABLE x.L assignment variables
k1 k2 k3 k4 k5
i1 1.000
i2 1.000
i3 1.000
i4 1.000
i5 1.000
i6 1.000
i7 1.000
i8 1.000
i9 1.000
i10 1.000
i11 1.000
i12 1.000
i13 1.000
i14 1.000
i15 1.000
i16 1.000
i17 1.000
i18 1.000
i19 1.000
i20 1.000
i21 1.000
i22 1.000
i23 1.000
i24 1.000
i25 1.000
i26 1.000
i27 1.000
i28 1.000
i29 1.000
i30 1.000
---- 52 VARIABLE r.L rectangles
x y w h
k1 0.066 0.575 0.181 0.361
k2 0.105 0.020 0.247 0.241
k3 0.360 0.516 0.103 0.211
k4 0.530 0.078 0.459 0.095
k5 0.641 0.667 0.332 0.301
---- 52 VARIABLE z.L = 0.290 objective

One way of doing it, and I have absolutely no idea if it's any good:
Start with N zero-size boxes enclosing N points.
Find two points with shortest distance that are in different boxes, and replace two of their boxes with one enclosing box. Now you have one box less.
Repeat previous step until you have K boxes.
And now that we have an algorithm that takes as closer to the goal with each iteration we can use A* search to find the best solution, in case the straightforward one isn't the best. Heuristic function is of course the total covered area.

Related

Why does coxph() combined with cluster() give much smaller standard errors than other methods to adjust for clustering (e.g. coxme() or frailty()?

I am working on a dataset to test the association between empirical antibiotics (variable emp, the antibiotics are cefuroxime or ceftriaxone compared with a reference antibiotic) and 30-day mortality (variable mort30). The data comes from patients admitted in 6 hospitals (variable site2) with a specific type of infection. Therefore, I would like to adjust for this clustering of patients on hospital level.
First I did this using the coxme() function for mixed models. However, based on visual inspection of the Schoenfeld residuals there were violations of the proportional hazards assumption and I tried adding a time transformation (tt) to the model. Unfortunately, the coxme() does not offer the possibility for time transformations.
Therfore, I tried other options to adjust for the clustering, including coxph() combined with frailty() and cluster. Surprisingly, the standard errors I get using the cluster() option are much smaller than using the coxme() or frailty().
**Does anyone know what is the explanation for this and which option would provide the most reliable estimates?
**
1) Using coxme:
> uni.mort <- coxme(Surv(FUdur30, mort30num) ~ emp + (1 | site2), data = total.pop)
> summary(uni.mort)
Cox mixed-effects model fit by maximum likelihood
Data: total.pop
events, n = 58, 253
Iterations= 24 147
NULL Integrated Fitted
Log-likelihood -313.8427 -307.6543 -305.8967
Chisq df p AIC BIC
Integrated loglik 12.38 3.00 0.0061976 6.38 0.20
Penalized loglik 15.89 3.56 0.0021127 8.77 1.43
Model: Surv(FUdur30, mort30num) ~ emp + (1 | site2)
Fixed coefficients
coef exp(coef) se(coef) z p
empCefuroxime 0.5879058 1.800214 0.6070631 0.97 0.33
empCeftriaxone 1.3422317 3.827576 0.5231278 2.57 0.01
Random effects
Group Variable Std Dev Variance
site2 Intercept 0.2194737 0.0481687
> confint(uni.mort)
2.5 % 97.5 %
empCefuroxime -0.6019160 1.777728
empCeftriaxone 0.3169202 2.367543
2) Using frailty()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp + frailty(site2), data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp + frailty(site2),
data = total.pop)
n= 253, number of events= 58
coef se(coef) se2 Chisq DF p
empCefuroxime 0.6302 0.6023 0.6010 1.09 1.0 0.3000
empCeftriaxone 1.3559 0.5221 0.5219 6.75 1.0 0.0094
frailty(site2) 0.40 0.3 0.2900
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.878 0.5325 0.5768 6.114
empCeftriaxone 3.880 0.2577 1.3947 10.796
Iterations: 7 outer, 27 Newton-Raphson
Variance of random effect= 0.006858179 I-likelihood = -307.8
Degrees of freedom for terms= 2.0 0.3
Concordance= 0.655 (se = 0.035 )
Likelihood ratio test= 12.87 on 2.29 df, p=0.002
3) Using cluster()
uni.mort <- coxph(Surv(FUdur30, mort30num) ~ emp, cluster = site2, data = total.pop)
> summary(uni.mort)
Call:
coxph(formula = Surv(FUdur30, mort30num) ~ emp, data = total.pop,
cluster = site2)
n= 253, number of events= 58
coef exp(coef) se(coef) robust se z Pr(>|z|)
empCefuroxime 0.6405 1.8975 0.6009 0.3041 2.106 0.035209 *
empCeftriaxone 1.3594 3.8937 0.5218 0.3545 3.834 0.000126 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
empCefuroxime 1.897 0.5270 1.045 3.444
empCeftriaxone 3.894 0.2568 1.944 7.801
Concordance= 0.608 (se = 0.027 )
Likelihood ratio test= 12.08 on 2 df, p=0.002
Wald test = 15.38 on 2 df, p=5e-04
Score (logrank) test = 10.69 on 2 df, p=0.005, Robust = 5.99 p=0.05
(Note: the likelihood ratio and score tests assume independence of
observations within a cluster, the Wald and robust score tests do not).
>

Algorithm for optimal expected amount in a profit/loss game

I came upon the following question recently,
"You have a box which has G green and B blue coins. Pick a random coin, G gives a profit of +1 and blue a loss of -1. If you play optimally what is the expected profit."
I was thinking of using a brute force algorithm where I consider all possibilities of combinations of green and blue coins but I'm sure there must be a better solution for this (range of B and G was from 0 to 5000). Also what does playing optimally mean? Does it mean that if i pick all blue coins then I would continue playing till all green coins are also picked? If so then this means I shouldn't consider all possibilities of green and blue coins?
The "obvious" answer is to play whenever there's more green coins than blue coins. In fact, this is wrong. For example, if there's 999 green coins and 1000 blue coins, here's a strategy that takes an expected profit:
Take 2 coins
If GG -- stop with a profit of 2
if BG or GB -- stop with a profit of 0
if BB -- take all the remaining coins for a profit of -1
Since the first and last possibilities both occur with near 25% probability, your overall expectation is approximately 0.25*2 - 0.25*1 = 0.25
This is just a simple strategy in one extreme example that shows that the problem is not as simple as it first seems.
In general, the expectations with g green coins and b blue coins is given by a recurrence relation:
E(g, 0) = g
E(0, b) = 0
E(g, b) = max(0, g(E(g-1, b) + 1)/(b+g) + b(E(g, b-1) - 1)/(b+g))
The max in the final row occurs because if it's -EV to play, then you're better stopping.
These recurrence relations can be solved using dynamic programming in O(gb) time.
from fractions import Fraction as F
def gb(G, B):
E = [[F(0, 1)] * (B+1) for _ in xrange(G+1)]
for g in xrange(G+1):
E[g][0] = F(g, 1)
for b in xrange(1, B+1):
for g in xrange(1, G+1):
E[g][b] = max(0, (g * (E[g-1][b]+1) + b * (E[g][b-1]-1)) * F(1, (b+g)))
for row in E:
for v in row:
print '%5.2f' % v,
print
print
return E[G][B]
print gb(8, 10)
Output:
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2.00 1.33 0.67 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3.00 2.25 1.50 0.85 0.34 0.00 0.00 0.00 0.00 0.00 0.00
4.00 3.20 2.40 1.66 1.00 0.44 0.07 0.00 0.00 0.00 0.00
5.00 4.17 3.33 2.54 1.79 1.12 0.55 0.15 0.00 0.00 0.00
6.00 5.14 4.29 3.45 2.66 1.91 1.23 0.66 0.23 0.00 0.00
7.00 6.12 5.25 4.39 3.56 2.76 2.01 1.34 0.75 0.30 0.00
8.00 7.11 6.22 5.35 4.49 3.66 2.86 2.11 1.43 0.84 0.36
7793/21879
From this you can see that the expectation is positive to play with 8 green and 10 blue coins (EV=7793/21879 ~= 0.36), and you even have positive expectation with 2 green and 3 blue coins (EV=0.2)
Simple and intuitive answer:
you should start off with an estimate for the total number of blue and green coins. After each pick you will update this estimate. If you estimate there are more blue coins than green coins at any point you should stop.
Example:
you start and you pick a coin. Its green so you estimate 100% of the coins are green. You pick a blue so you estimate 50% of coins are green. You pick another blue coin so you estimate 33% of the coins are green. At this point is isn't worth playing anymore, according to your estimate, so you stop.
This answer is wrong; see Paul Hankin's answer for counterexamples and a proper analysis. I leave this answer here as a learning example for all of us.
Assuming that your choice is only when to stop picking coins, you continue as long as G > B. That part is simple. If you start with G < B, then you never start drawing, and your gain is 0. For G = B, no strategy will get you a mathematical advantage; the gain there is also 0.
For the expected reward, take this in two steps:
(1) Expected value on any draw sequence. Do this recursively, figuring the chance of getting green or blue on the first draw, and then the expected values for the new state (G-1, B) or (G, B-1). You will quickly see that the expected value of any given draw number (such as all possibilities for the 3rd draw) is the same as the original.
Therefore, your expected value on any draw is e = (G-B) / (G+B). Your overall expected value is e * d, where d is the number of draws you choose.
(2) What is the expected number of draws? How many times do you expect to draw before G = B? I'll leave this as an exercise for the student, but note the previous idea of doing this recursively. You might find it easier to describe the state of the game as (extra, total), where extra = G-B and total = G+B.
Illustrative exercise: given G=4, B=2, what is the chance that you'll draw GG on the first two draws (and then stop the game)? What is the gain from that? How does that compare with the (4-2)/(4+2) advantage on each draw?

plot matrix with lines gnuplot

I'm making a chart but I would like to use lines rather than points.
Using the style of lines, all the points are connected and the graph has a network appearance, which I don't want.
set grid
set ticslevel 0.1
set samples 51, 51
set isosamples 20, 20
set border 1+2+4+8
unset key
splot 'matrix.dat' matrix
part of data to matrix plot
0.261 0.665 0.225 0.382 0.255 0.574 0.356
0.338 0.845 0.0363 0.167 0.727 0.0805 0.764
0.225 0.196 0.107 0.153 0.347 0.338 0.168
0.157 0.443 0.0671 0.135 0.312 0.408 0.362
0.151 0.281 0.0572 0.103 0.309 0.49 0.242
0.12 0.336 0.0604 0.173 0.19 0.395 0.153
0.119 0.173 0.0336 0.145 0.156 0.219 0.177
0.123 0.0452 0.0165 0.149 0.0932 0.0663 0.133
0.123 0.0741 0.00373 0.136 0.0346 0.485 0.131
0.111 0.241 0.0124 0.105 0.0127 1.01 0.122
0.096 0.475 0.0194 0.0569 0.0284 1.67 0.102
0.0777 0.773 0.0175 0.00929 0.0375 2.42 0.0831
0.059 1.11 0.0123 0.0322 0.0408 3.23 0.0635
0.0438 1.48 6.44E-4 0.0659 0.0265 4.07 0.0445
0.0349 1.92 0.0192 0.078 0.00585 4.92 0.0254
0.0392 2.42 0.0446 0.0632 0.0306 5.73 0.00774
0.0518 2.97 0.0745 0.031 0.0729 6.46 0.00716
This cannot be done automatically. You must determine the rows and columns of your matrix. First, to get the number of rows, use
stats 'matrix.dat' using 1 nooutput
rows = STATS_records
For the number of columns, use then
stats 'matrix.dat' matrix nooutput
cols = STATS_records/rows
And now plot every line
unset key
splot for [i=0:cols-1] 'matrix.dat' matrix every ::i::i lt 1 with lines
Result (with 4.6.4) is:
I think Christoph's solution is just what you need, but to make the point clear, by providing the matrix and using splot matrix alone will just generate a mesh.
So you will need to specify the lines with complete X, Y and Z vectors and then plot them using splot with lines/linespoints. I'm adding an example below in case it may be helpful for anyone else.
You arrange your data file as follows:
10 1 0.261 2 0.665 3 0.225 4 0.382 5 0.255 6 0.574 7 0.356
20 1 0.338 2 0.845 3 0.0363 4 0.167 5 0.727 6 0.0805 7 0.764
30 1 0.225 2 0.196 3 0.107 4 0.153 5 0.347 6 0.338 7 0.168
40 1 0.157 2 0.443 3 0.0671 4 0.135 5 0.312 6 0.408 7 0.362
And then plot as follows:
set grid
set ticslevel 0.1
#set samples 51, 51
#set isosamples 20, 20
#set border 1+2+4+8
unset key
splot 'matrix.dat' using 1:2:3 with linespoints, \
'matrix.dat' using 1:4:5 with linespoints, \
'matrix.dat' using 1:6:7 with linespoints, \
'matrix.dat' using 1:8:9 with linespoints, \
'matrix.dat' using 1:10:11 with linespoints, \
'matrix.dat' using 1:12:13 with linespoints, \
'matrix.dat' using 1:14:15 with linespoints
With the resultant plot

Column-wise sort or top-n of matrix

I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:
> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92
But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.
Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.
If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.
Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.
Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.
I have put down a quick testing framework for the solutions proposed so far.
library(rbenchmark)
sort.q <- function(m) {
sort(m, method='quick')
}
sort.p <- function(m) {
mm <- sort(m, partial=TOP)[1:TOP]
sort(mm)
}
sort.all.g <- function(f) {
function(m) {
o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
}
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)
apply.sort.g <- function(f) {
function(m) {
apply(m, 2, f)[1:TOP,]
}
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)
bb <- NULL
SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5
for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
print(c(SIZE, TOP))
TOP <- min(TOP, SIZE)
m <- matrix(runif(SIZE*SIZE), floor(SIZE))
if (SIZE < 1000) {
mr <- apply.sort(m)
stopifnot(apply.sort.q(m) == mr)
stopifnot(apply.sort.p(m) == mr)
stopifnot(sort.all(m) == mr)
stopifnot(sort.all.q(m) == mr)
}
b <- benchmark(apply.sort(m),
apply.sort.q(m),
apply.sort.p(m),
sort.all(m),
sort.all.q(m),
columns= c("test", "elapsed", "relative",
"user.self", "sys.self"),
replications=1,
order=NULL)
b$SIZE <- SIZE
b$TOP <- TOP
b$test <- factor(x=b$test, levels=b$test)
bb <- rbind(bb, b)
}
}
ftable(xtabs(user.self ~ SIZE+test+TOP, bb))
The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.
Please feel free to add your own sorting routine :-)
TOP 10 31 100 316
SIZE test
31 apply.sort(m) 0.004 0.012 0.000 0.000
apply.sort.q(m) 0.008 0.016 0.000 0.000
apply.sort.p(m) 0.008 0.020 0.000 0.000
sort.all(m) 0.000 0.008 0.000 0.000
sort.all.q(m) 0.000 0.004 0.000 0.000
100 apply.sort(m) 0.012 0.016 0.028 0.000
apply.sort.q(m) 0.016 0.016 0.036 0.000
apply.sort.p(m) 0.020 0.020 0.040 0.000
sort.all(m) 0.000 0.004 0.008 0.000
sort.all.q(m) 0.004 0.004 0.004 0.000
316 apply.sort(m) 0.060 0.060 0.056 0.060
apply.sort.q(m) 0.064 0.060 0.060 0.072
apply.sort.p(m) 0.064 0.068 0.108 0.076
sort.all(m) 0.016 0.016 0.020 0.024
sort.all.q(m) 0.020 0.016 0.024 0.024
1000 apply.sort(m) 0.356 0.276 0.276 0.292
apply.sort.q(m) 0.348 0.316 0.288 0.296
apply.sort.p(m) 0.256 0.264 0.276 0.320
sort.all(m) 0.268 0.244 0.213 0.244
sort.all.q(m) 0.260 0.232 0.200 0.208
3162 apply.sort(m) 1.997 1.948 2.012 2.108
apply.sort.q(m) 1.916 1.880 1.892 1.901
apply.sort.p(m) 1.300 1.316 1.376 1.544
sort.all(m) 2.424 2.452 2.432 2.480
sort.all.q(m) 2.188 2.184 2.265 2.244
10000 apply.sort(m) 18.193 18.466 18.781 18.965
apply.sort.q(m) 15.837 15.861 15.977 16.313
apply.sort.p(m) 9.005 9.108 9.304 9.925
sort.all(m) 26.030 25.710 25.722 26.686
sort.all.q(m) 23.341 23.645 24.010 24.073
31622 apply.sort(m) 201.265 197.568 196.181 196.104
apply.sort.q(m) 163.190 160.810 158.757 160.050
apply.sort.p(m) 82.337 81.305 80.641 82.490
sort.all(m) 296.239 288.810 289.303 288.954
sort.all.q(m) 260.872 249.984 254.867 252.087
Does
apply(m, 2, sort)
do the job? :)
Or for top-10, say, use:
apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})
Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.
R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine
set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)
system.time(sm <- apply(m, 2, sort))
user system elapsed
2.62 0.14 2.79
The first 5 columns:
sm[1:15, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
[2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
[3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
[4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
[5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
[6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
[7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
[8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
[9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03
They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).
topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
thresh = thresh + calcsd
subs = which(vec > thresh)
}
while (length(subs) < topn) {
thresh = thresh - calcsd
subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]
The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.
Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.
Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.
My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.
What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!
Cheers :)

How to use both binary and continuous features in the k-Nearest-Neighbor algorithm?

My feature vector has both continuous (or widely ranging) and binary components. If I simply use Euclidean distance, the continuous components will have a much greater impact:
Representing symmetric vs. asymmetric as 0 and 1 and some less important ratio ranging from 0 to 100, changing from symmetric to asymmetric has a tiny distance impact compared to changing the ratio by 25.
I can add more weight to the symmetry (by making it 0 or 100 for example), but is there a better way to do this?
You could try using the normalized Euclidean distance, described, for example, at the end of the first section here.
It simply scales every feature (continuous or discrete) by its standard deviation. This is more robust than, say, scaling by the range (max-min) as suggested by another poster.
If i correctly understand your question, normalizing (aka 'rescaling) each dimension or column in the data set is the conventional technique for dealing with over-weighting dimensions, e.g.,
ev_scaled = (ev_raw - ev_min) / (ev_max - ev_min)
In R, for instance, you can write this function:
ev_scaled = function(x) {
(x - min(x)) / (max(x) - min(x))
}
which works like this:
# generate some data:
# v1, v2 are two expectation variables in the same dataset
# but have very different 'scale':
> v1 = seq(100, 550, 50)
> v1
[1] 100 150 200 250 300 350 400 450 500 550
> v2 = sort(sample(seq(.1, 20, .1), 10))
> v2
[1] 0.2 3.5 5.1 5.6 8.0 8.3 9.9 11.3 15.5 19.4
> mean(v1)
[1] 325
> mean(v2)
[1] 8.68
# now normalize v1 & v2 using the function above:
> v1_scaled = ev_scaled(v1)
> v1_scaled
[1] 0.000 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1.000
> v2_scaled = ev_scaled(v2)
> v2_scaled
[1] 0.000 0.172 0.255 0.281 0.406 0.422 0.505 0.578 0.797 1.000
> mean(v1_scaled)
[1] 0.5
> mean(v2_scaled)
[1] 0.442
> range(v1_scaled)
[1] 0 1
> range(v2_scaled)
[1] 0 1
You can also try Mahalanobis distance instead of Euclidean.

Resources