Given a parameter k for the number of boxes and n data points, is there anyway I can find or approximate k axis-aligned bounding rectangles that enclose all the points while keeping the sum of the area of the rectangles minimum?
One way is to directly write this as a mathematical optimization problem.
A high-level optimization model can look like as follows:
We first define the decision variables:
r(k,c) = coordinates for k-th box (e.g. c={x,y,w,h})
continuous variable with appropriate bounds
x(i,k) = 1 if point i is assigned to box k
0 otherwise
binary variable
Then the 2D model can look like:
minimize sum(k, r(k,w)*r(k,h)) (sum of areas)
sum(k, x(i,k)) = 1 (assign point to one box)
x(i,k) = 1 ==> point i is inside box k (can be formulated as linear big-M constraints
or indicator constraints )
For testing, I generated 30 points in [0,1]x[0,1] unit box.
Using |k|=5 boxes, we get:
This can be directly generalized to more dimensions (plotting becomes more difficult).
This is essentially a model that combines an assignment problem (for the x variables) and a location problem (for the r variables). It probably only works for relatively small data sets. For this example, I used Gurobi (non-convex MIQP) and this is the proven globally optimal solution. It is noted that even for higher dimensional problems, we can reformulate things into a non-convex MIQCP (i.e. solvable by Gurobi).
For completeness, the data and results for the above model were:
---- 52 PARAMETER p data points
x y
i1 0.806 0.173
i2 0.530 0.149
i3 0.648 0.692
i4 0.352 0.020
i5 0.431 0.554
i6 0.641 0.775
i7 0.235 0.781
i8 0.268 0.082
i9 0.973 0.114
i10 0.874 0.667
i11 0.756 0.968
i12 0.199 0.240
i13 0.220 0.261
i14 0.989 0.172
i15 0.066 0.930
i16 0.806 0.832
i17 0.105 0.029
i18 0.229 0.094
i19 0.130 0.903
i20 0.437 0.728
i21 0.248 0.575
i22 0.360 0.516
i23 0.710 0.746
i24 0.704 0.746
i25 0.185 0.936
i26 0.817 0.673
i27 0.463 0.578
i28 0.089 0.657
i29 0.973 0.691
i30 0.894 0.078
---- 52 VARIABLE x.L assignment variables
k1 k2 k3 k4 k5
i1 1.000
i2 1.000
i3 1.000
i4 1.000
i5 1.000
i6 1.000
i7 1.000
i8 1.000
i9 1.000
i10 1.000
i11 1.000
i12 1.000
i13 1.000
i14 1.000
i15 1.000
i16 1.000
i17 1.000
i18 1.000
i19 1.000
i20 1.000
i21 1.000
i22 1.000
i23 1.000
i24 1.000
i25 1.000
i26 1.000
i27 1.000
i28 1.000
i29 1.000
i30 1.000
---- 52 VARIABLE r.L rectangles
x y w h
k1 0.066 0.575 0.181 0.361
k2 0.105 0.020 0.247 0.241
k3 0.360 0.516 0.103 0.211
k4 0.530 0.078 0.459 0.095
k5 0.641 0.667 0.332 0.301
---- 52 VARIABLE z.L = 0.290 objective
One way of doing it, and I have absolutely no idea if it's any good:
Start with N zero-size boxes enclosing N points.
Find two points with shortest distance that are in different boxes, and replace two of their boxes with one enclosing box. Now you have one box less.
Repeat previous step until you have K boxes.
And now that we have an algorithm that takes as closer to the goal with each iteration we can use A* search to find the best solution, in case the straightforward one isn't the best. Heuristic function is of course the total covered area.
I'm making a chart but I would like to use lines rather than points.
Using the style of lines, all the points are connected and the graph has a network appearance, which I don't want.
set grid
set ticslevel 0.1
set samples 51, 51
set isosamples 20, 20
set border 1+2+4+8
unset key
splot 'matrix.dat' matrix
part of data to matrix plot
0.261 0.665 0.225 0.382 0.255 0.574 0.356
0.338 0.845 0.0363 0.167 0.727 0.0805 0.764
0.225 0.196 0.107 0.153 0.347 0.338 0.168
0.157 0.443 0.0671 0.135 0.312 0.408 0.362
0.151 0.281 0.0572 0.103 0.309 0.49 0.242
0.12 0.336 0.0604 0.173 0.19 0.395 0.153
0.119 0.173 0.0336 0.145 0.156 0.219 0.177
0.123 0.0452 0.0165 0.149 0.0932 0.0663 0.133
0.123 0.0741 0.00373 0.136 0.0346 0.485 0.131
0.111 0.241 0.0124 0.105 0.0127 1.01 0.122
0.096 0.475 0.0194 0.0569 0.0284 1.67 0.102
0.0777 0.773 0.0175 0.00929 0.0375 2.42 0.0831
0.059 1.11 0.0123 0.0322 0.0408 3.23 0.0635
0.0438 1.48 6.44E-4 0.0659 0.0265 4.07 0.0445
0.0349 1.92 0.0192 0.078 0.00585 4.92 0.0254
0.0392 2.42 0.0446 0.0632 0.0306 5.73 0.00774
0.0518 2.97 0.0745 0.031 0.0729 6.46 0.00716
This cannot be done automatically. You must determine the rows and columns of your matrix. First, to get the number of rows, use
stats 'matrix.dat' using 1 nooutput
rows = STATS_records
For the number of columns, use then
stats 'matrix.dat' matrix nooutput
cols = STATS_records/rows
And now plot every line
unset key
splot for [i=0:cols-1] 'matrix.dat' matrix every ::i::i lt 1 with lines
Result (with 4.6.4) is:
I think Christoph's solution is just what you need, but to make the point clear, by providing the matrix and using splot matrix alone will just generate a mesh.
So you will need to specify the lines with complete X, Y and Z vectors and then plot them using splot with lines/linespoints. I'm adding an example below in case it may be helpful for anyone else.
You arrange your data file as follows:
10 1 0.261 2 0.665 3 0.225 4 0.382 5 0.255 6 0.574 7 0.356
20 1 0.338 2 0.845 3 0.0363 4 0.167 5 0.727 6 0.0805 7 0.764
30 1 0.225 2 0.196 3 0.107 4 0.153 5 0.347 6 0.338 7 0.168
40 1 0.157 2 0.443 3 0.0671 4 0.135 5 0.312 6 0.408 7 0.362
And then plot as follows:
set grid
set ticslevel 0.1
#set samples 51, 51
#set isosamples 20, 20
#set border 1+2+4+8
unset key
splot 'matrix.dat' using 1:2:3 with linespoints, \
'matrix.dat' using 1:4:5 with linespoints, \
'matrix.dat' using 1:6:7 with linespoints, \
'matrix.dat' using 1:8:9 with linespoints, \
'matrix.dat' using 1:10:11 with linespoints, \
'matrix.dat' using 1:12:13 with linespoints, \
'matrix.dat' using 1:14:15 with linespoints
With the resultant plot
I have been told that in order to calculate the expected residence time for a set of states I can use the following approach:
Construct a Markov Chain with index i,j being the probability of transition from state i to state j.
Transpose the matrix, so that each column contains the inbound probabilities for that state.
Invert the diagonal so that a value p becomes (1-p).
Add a row at the bottom, containing 1's
Construct a coefficient vector with 0's and the last element 1
Solve it. The resulting vector should contain the expected residence time for the various states
Let me give an example:
I have the initial Markov Chain:
0.25 ; 0.25 ; 0.25 ; 0.25
0.00 ; 0.50 ; 0.50 ; 0.00
0.33 ; 0.33 ; 0.33 ; 0.00
0.00 ; 0.00 ; 0.50 ; 0.50
After step 1-3 it looks like this:
0.75 ; 0.00 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.67 ; 0.50
0.25 ; 0.00 ; 0.00 ; 0.50
I add the last line:
0.75 ; 0.00 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.33 ; 0.00
0.25 ; 0.50 ; 0.67 ; 0.50
0.25 ; 0.00 ; 0.00 ; 0.50
1.00 ; 1.00 ; 1.00 ; 1.00
The coefficient will be the following vector:
0 ; 0 ; 0 ; 0 ; 1
The added line of 1's should enforce, that the solution sums to 1. However, my solution is the set:
{0.42; 0.84; -0.79; 0.32}
Which sums to 0.79, so clearly something is wrong.
I also note, that the expected residence time of state 3 is negative, which in my mind should not be possible.
I have it implemented in Java and I use Commons.Math to handle the matrix calculations. I have tried the various algorithms described in the documentation, but I get the same result.
I have also tried to substitute one of the rows with the line of 1's in order to make the matrix square. When I do that, I get the following set of solutions:
{0.79; 0.79; -1.79; 1.2}
Even though the probabilities sum to 1 they must still be wrong as they should be in the range 0..1 AND sum to 1.
Is this an entirely wrong approach to the problem? Where am I off?
Unfortunately I am not very mathematical, but I hope I have given enough information for you to see the problem.
I found the answer:
Let all probabilities p but the diagonal be -p in step 3:
0.75 ; -0.00 ; -0.33 ; -0.00
-0.25 ; 0.50 ; -0.33 ; -0.00
-0.25 ; -0.50 ; 0.67 ; -0.50
-0.25 ; -0.00 ; -0.00 ; 0.50
I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:
> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92
But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.
Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.
If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.
Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.
Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.
I have put down a quick testing framework for the solutions proposed so far.
library(rbenchmark)
sort.q <- function(m) {
sort(m, method='quick')
}
sort.p <- function(m) {
mm <- sort(m, partial=TOP)[1:TOP]
sort(mm)
}
sort.all.g <- function(f) {
function(m) {
o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
}
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)
apply.sort.g <- function(f) {
function(m) {
apply(m, 2, f)[1:TOP,]
}
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)
bb <- NULL
SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5
for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
print(c(SIZE, TOP))
TOP <- min(TOP, SIZE)
m <- matrix(runif(SIZE*SIZE), floor(SIZE))
if (SIZE < 1000) {
mr <- apply.sort(m)
stopifnot(apply.sort.q(m) == mr)
stopifnot(apply.sort.p(m) == mr)
stopifnot(sort.all(m) == mr)
stopifnot(sort.all.q(m) == mr)
}
b <- benchmark(apply.sort(m),
apply.sort.q(m),
apply.sort.p(m),
sort.all(m),
sort.all.q(m),
columns= c("test", "elapsed", "relative",
"user.self", "sys.self"),
replications=1,
order=NULL)
b$SIZE <- SIZE
b$TOP <- TOP
b$test <- factor(x=b$test, levels=b$test)
bb <- rbind(bb, b)
}
}
ftable(xtabs(user.self ~ SIZE+test+TOP, bb))
The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.
Please feel free to add your own sorting routine :-)
TOP 10 31 100 316
SIZE test
31 apply.sort(m) 0.004 0.012 0.000 0.000
apply.sort.q(m) 0.008 0.016 0.000 0.000
apply.sort.p(m) 0.008 0.020 0.000 0.000
sort.all(m) 0.000 0.008 0.000 0.000
sort.all.q(m) 0.000 0.004 0.000 0.000
100 apply.sort(m) 0.012 0.016 0.028 0.000
apply.sort.q(m) 0.016 0.016 0.036 0.000
apply.sort.p(m) 0.020 0.020 0.040 0.000
sort.all(m) 0.000 0.004 0.008 0.000
sort.all.q(m) 0.004 0.004 0.004 0.000
316 apply.sort(m) 0.060 0.060 0.056 0.060
apply.sort.q(m) 0.064 0.060 0.060 0.072
apply.sort.p(m) 0.064 0.068 0.108 0.076
sort.all(m) 0.016 0.016 0.020 0.024
sort.all.q(m) 0.020 0.016 0.024 0.024
1000 apply.sort(m) 0.356 0.276 0.276 0.292
apply.sort.q(m) 0.348 0.316 0.288 0.296
apply.sort.p(m) 0.256 0.264 0.276 0.320
sort.all(m) 0.268 0.244 0.213 0.244
sort.all.q(m) 0.260 0.232 0.200 0.208
3162 apply.sort(m) 1.997 1.948 2.012 2.108
apply.sort.q(m) 1.916 1.880 1.892 1.901
apply.sort.p(m) 1.300 1.316 1.376 1.544
sort.all(m) 2.424 2.452 2.432 2.480
sort.all.q(m) 2.188 2.184 2.265 2.244
10000 apply.sort(m) 18.193 18.466 18.781 18.965
apply.sort.q(m) 15.837 15.861 15.977 16.313
apply.sort.p(m) 9.005 9.108 9.304 9.925
sort.all(m) 26.030 25.710 25.722 26.686
sort.all.q(m) 23.341 23.645 24.010 24.073
31622 apply.sort(m) 201.265 197.568 196.181 196.104
apply.sort.q(m) 163.190 160.810 158.757 160.050
apply.sort.p(m) 82.337 81.305 80.641 82.490
sort.all(m) 296.239 288.810 289.303 288.954
sort.all.q(m) 260.872 249.984 254.867 252.087
Does
apply(m, 2, sort)
do the job? :)
Or for top-10, say, use:
apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})
Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.
R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine
set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)
system.time(sm <- apply(m, 2, sort))
user system elapsed
2.62 0.14 2.79
The first 5 columns:
sm[1:15, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
[2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
[3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
[4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
[5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
[6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
[7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
[8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
[9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03
They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).
topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
thresh = thresh + calcsd
subs = which(vec > thresh)
}
while (length(subs) < topn) {
thresh = thresh - calcsd
subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]
The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.
Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.
Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.
My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.
What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!
Cheers :)