python average of random sample many times - random

I am working with pandas and I wish to sample 2 stocks from each trade date and store as part of the dataset the average "Stock_Change" and the average "Vol_Change" for the given day in question based on the sample taken (in this case, 2 stocks per day). The actual data is much larger spanning years and hundreds of names. My sample will be of 100 names, I just use 2 for the purposes of this question.
Sample data set:
In [3]:
df
​
Out[3]:
Date Symbol Stock_Change Vol_Change
0 1/1/2008 A -0.05 0.07
1 1/1/2008 B -0.06 0.17
2 1/1/2008 C -0.05 0.07
3 1/1/2008 D 0.05 0.13
4 1/1/2008 E -0.03 -0.10
5 1/2/2008 A 0.03 -0.17
6 1/2/2008 B 0.08 0.34
7 1/2/2008 C 0.03 0.17
8 1/2/2008 D 0.06 0.24
9 1/2/2008 E 0.02 0.16
10 1/3/2008 A 0.02 0.05
11 1/3/2008 B 0.01 0.39
12 1/3/2008 C 0.05 -0.17
13 1/3/2008 D -0.01 0.37
14 1/3/2008 E -0.06 0.23
15 1/4/2008 A 0.03 0.31
16 1/4/2008 B -0.07 0.16
17 1/4/2008 C -0.06 0.29
18 1/4/2008 D 0.00 0.09
19 1/4/2008 E 0.00 -0.02
20 1/5/2008 A 0.04 -0.04
21 1/5/2008 B -0.06 0.16
22 1/5/2008 C -0.08 0.07
23 1/5/2008 D 0.09 0.16
24 1/5/2008 E 0.06 0.18
25 1/6/2008 A 0.00 0.22
26 1/6/2008 B 0.08 -0.13
27 1/6/2008 C 0.07 0.18
28 1/6/2008 D 0.03 0.32
29 1/6/2008 E 0.01 0.29
30 1/7/2008 A -0.08 -0.10
31 1/7/2008 B -0.09 0.23
32 1/7/2008 C -0.09 0.26
33 1/7/2008 D 0.02 -0.01
34 1/7/2008 E -0.05 0.11
35 1/8/2008 A -0.02 0.36
36 1/8/2008 B 0.03 0.17
37 1/8/2008 C 0.00 -0.05
38 1/8/2008 D 0.08 -0.13
39 1/8/2008 E 0.07 0.18
One other point, the samples can not contain the same security more than once (sample without replacement). My guess is that this a good R question but I don't know the last thing about R . .
I have no idea of even how to start this question.
thanks in advance for any help.
Edit by OP
I tried this but don't seem to be able to get it to work on a the group-by dataframe (grouped by Symbol and Date):
In [35]:
import numpy as np
import pandas as pd
from random import sample
​
# create random index
​
rindex = np.array(sample(range(len(df)), 10))
​
# get 10 random rows from df
dfr = df.ix[rindex]
In [36]:
dfr
Out[36]:
Date Symbol Stock_Change Vol_Change
6 1/2/2008 B 8% 34%
1 1/2/2008 B -6% 17%
37 1/3/2008 C 0% -5%
25 1/1/2008 A 0% 22%
3 1/4/2008 D 5% 13%
12 1/3/2008 C 5% -17%
10 1/1/2008 A 2% 5%
2 1/3/2008 C -5% 7%
26 1/2/2008 B 8% -13%
17 1/3/2008 C -6% 29%
OP Edit #2
As I read the question I realize that I may not have been very clear. What I want to do is sample the data many times (call it X) for each day and in essence end up with X times "# of dates" as my new dataset. This may not look like it makes sense with the data i am showing but my actual data has 500 names and 2 years (2x365 = 730) of dates and I wish to sample 50 random names for each day for a total of 50 x 730 = 36500 data points.
first attempt gave this:
In [10]:
# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, sample_size=3):
symbols = group.Symbol.values
symbols_selected = np.random.choice(symbols, size=sample_size, replace=False)
return group.loc[group.Symbol.isin(symbols_selected)]
​
df.groupby(['Date']).apply(get_subsample).reset_index(drop=True)
​
Out[10]:
Date Symbol Stock_Change Vol_Change
0 1/1/2008 A -5% 7%
1 1/1/2008 A 3% -17%
2 1/1/2008 A 2% 5%
3 1/1/2008 A 3% 31%
4 1/1/2008 A 4% -4%
5 1/1/2008 A 0% 22%
6 1/1/2008 A -8% -10%
7 1/1/2008 A -2% 36%
8 1/2/2008 B -6% 17%
9 1/2/2008 B 8% 34%
10 1/2/2008 B 1% 39%
11 1/2/2008 B -7% 16%
12 1/2/2008 B -6% 16%
13 1/2/2008 B 8% -13%
14 1/2/2008 B -9% 23%
15 1/2/2008 B 3% 17%
16 1/3/2008 C -5% 7%
17 1/3/2008 C 3% 17%
18 1/3/2008 C 5% -17%
19 1/3/2008 C -6% 29%
20 1/3/2008 C -8% 7%
21 1/3/2008 C 7% 18%
22 1/3/2008 C -9% 26%
23 1/3/2008 C 0% -5%
24 1/4/2008 D 5% 13%
25 1/4/2008 D 6% 24%
26 1/4/2008 D -1% 37%
27 1/4/2008 D 0% 9%
28 1/4/2008 D 9% 16%
29 1/4/2008 D 3% 32%
30 1/4/2008 D 2% -1%
31 1/4/2008 D 8% -13%
32 1/5/2008 E -3% -10%
33 1/5/2008 E 2% 16%
34 1/5/2008 E -6% 23%
35 1/5/2008 E 0% -2%
36 1/5/2008 E 6% 18%
37 1/5/2008 E 1% 29%
38 1/5/2008 E -5% 11%
39 1/5/2008 E 7% 18%

import pandas as pd
import numpy as np
# replicate your data structure
# ==============================
np.random.seed(0)
dates = pd.date_range('2008-01-01', periods=100, freq='B')
symbols = 'A B C D E'.split()
multi_index = pd.MultiIndex.from_product([dates, symbols], names=['Date', 'Symbol'])
stock_change = np.random.randn(500)
vol_change = np.random.randn(500)
df = pd.DataFrame({'Stock_Change': stock_change, 'Vol_Change': vol_change}, index=multi_index).reset_index()
# do sampling: get a random subsample with size 3 out of 5 symbols for each date
# ==============================
def get_subsample(group, X=100, sample_size=3):
frame = pd.DataFrame(columns=['sample_{}'.format(x) for x in range(1,X+1)])
for col in frame.columns.values:
frame[col] = group.loc[group.Symbol.isin(np.random.choice(symbols, size=sample_size, replace=False)), ['Stock_Change', 'Vol_Change']].mean()
return frame.mean(axis=1)
result = df.groupby(['Date']).apply(get_subsample)
Out[169]:
Stock_Change Vol_Change
Date
2008-01-01 1.3937 0.2005
2008-01-02 0.0406 -0.7280
2008-01-03 0.6073 -0.2699
2008-01-04 0.2310 0.7415
2008-01-07 0.0718 -0.7269
2008-01-08 0.3808 -0.0584
2008-01-09 -0.5595 -0.2968
2008-01-10 0.3919 -0.2741
2008-01-11 -0.4856 0.0386
2008-01-14 -0.4700 -0.4090
... ... ...
2008-05-06 0.1510 0.1628
2008-05-07 -0.1452 0.2824
2008-05-08 -0.4626 0.2173
2008-05-09 -0.2984 0.6324
2008-05-12 -0.3817 0.7698
2008-05-13 0.5796 -0.4318
2008-05-14 0.2875 0.0067
2008-05-15 0.0269 0.3559
2008-05-16 0.7374 0.1065
2008-05-19 -0.4428 -0.2014
[100 rows x 2 columns]

Related

Fi score -Sklearn

What is the F1-score of the model in the following? I used scikit learn package.
print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
<BLANKLINE>
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
<BLANKLINE>
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
This article explains it pretty well
Basically it's
F1 = 2 * precision * recall / (precision + recall)

How do I rank this data by percentage and total possible?

Given this set in Excel:
Group Enrolled Eligible Percent
A 0 76 0%
B 10 92 11%
C 0 38 0%
D 2 50 4%
E 0 111 0%
F 4 86 5%
G 3 97 3%
H 4 178 2%
I 2 77 3%
J 0 64 0%
K 0 37 0%
L 11 54 20%
Is there a way to sort (for charting) to achieve the following order?
Group Enrolled Eligible Percent
L 11 54 20%
B 10 92 11%
F 4 86 5%
D 2 50 4%
G 3 97 3%
I 2 77 3%
H 4 178 2%
K 0 37 0%
C 0 38 0%
J 0 64 0%
A 0 76 0%
E 0 111 0%
My goal is to rank/visualize using these criteria:
Percent desc (when Enrolled > 0)
Eligible asc (when Enrolled = 0)
After writing this question, the answer looks obvious: sort by Percent descending, then Eligible ascending (when Percent or Enrolled = 0). But I feel like I'm missing an obvious method/term to achieve the results I'm looking for.
Thanks.
With Google spreadsheet Query is the easy way. Goal 1:
=QUERY(A1:D13,"Select A,B,C,D Where B>0 Order By D desc,1")
Goal 2:
=QUERY(A1:D13,"Select A,B,C,D Where B=0 Order By C ,1")
The term you're missing is SORT
Here's the formula you are looking for:
=SORT(A1:D13,4,0,3,1)
Note:
Numbers should be formatted as Numbers.

storing multiway data from for loop

I have the following three-way data (I X J X K) for my polymerization system: Z (23x4x3)
Z(:,:,1) = [0 6.70 NaN NaN
0.14 5.79 27212.52 17735.36
0.26 5.04 26545.98 17279.95
0.35 4.43 26007.91 16902.22
0.43 3.92 25567.61 16586.18
0.49 3.50 25202.48 16319.65
0.54 3.15 24898.99 16094.87
0.59 2.85 24648.07 15906.19
0.63 2.60 24441.06 15748.28
0.66 2.38 24270.42 15616.51
0.68 2.20 24130.05 15506.90
0.71 2.05 24014.78 15415.87
0.73 1.92 23921.74 15341.59
0.74 1.80 23847.57 15281.63
0.76 1.70 23789.06 15233.54
0.77 1.61 23744.29 15195.99
0.78 1.54 23710.83 15167.01
0.79 1.47 23687.05 15145.38
0.80 1.41 23671.47 15129.72
0.81 1.36 23662.99 15119.14
0.81 1.31 23660.58 15112.77
0.82 1.27 23663.32 15109.86
0.82 1.23 23670.44 15109.74];
Z(:,:,2) = [0 6.70 NaN NaN
0.17 5.63 24826.03 16191.26
0.30 4.80 24198.87 15757.83
0.40 4.14 23720.27 15417.52
0.47 3.61 23347.38 15147.16
0.54 3.19 23058.01 14933.52
0.59 2.85 22836.18 14766.65
0.63 2.57 22667.24 14637.38
0.66 2.34 22539.27 14537.68
0.69 2.15 22445.60 14463.08
0.71 2.00 22379.90 14409.04
0.73 1.87 22336.70 14371.44
0.75 1.76 22311.74 14347.04
0.76 1.66 22301.57 14333.13
0.77 1.58 22303.32 14327.31
0.78 1.51 22314.83 14327.75
0.79 1.45 22334.27 14333.00
0.80 1.40 22360.11 14341.81
0.81 1.36 22391.09 14353.22
0.81 1.32 22426.11 14366.39
0.82 1.28 22464.22 14380.67
0.82 1.25 22504.61 14395.53
0.82 1.23 22546.61 14410.57];
Z(:,:,3) = [0 6.70 NaN NaN
0.19 5.45 22687.71 14805.97
0.34 4.53 22119.24 14408.55
0.44 3.84 21720.37 14120.95
0.52 3.31 21437.68 13912.54
0.58 2.90 21244.60 13766.39
0.63 2.59 21117.60 13667.05
0.66 2.34 21040.03 13602.91
0.69 2.14 21000.70 13565.85
0.72 1.98 20990.89 13549.24
0.73 1.85 21003.53 13547.54
0.75 1.74 21033.19 13556.41
0.76 1.65 21075.85 13572.54
0.77 1.58 21128.37 13593.46
0.78 1.52 21188.17 13617.25
0.79 1.47 21253.16 13642.44
0.80 1.42 21321.69 13668.02
0.80 1.39 21392.34 13693.18
0.81 1.36 21463.83 13717.38
0.81 1.33 21535.27 13740.33
0.81 1.31 21605.87 13761.81
0.82 1.29 21674.84 13781.70
0.82 1.27 21741.68 13799.97];
where I is time (y-axis), J is variables (x-axis) and K is batch (z-axis). However, since I want to use this data to do PCA and PLS analysis, I must change this (time x variables x batch) dimension to (batch (I) x variables (J) x time (K)) dimension, means that the new Z is Z(3 x 4 x 23).
To perform this I can extract the first row value from each slab (K dimension) and rearrange them as a new matrix slab using the following command:
T1=squeeze(Z(1,:,:))’
Thus, I use for loop to get the results for all 23 slabs. But I cant (dont know how to) store the results in workspace except for the last one. The command I used:
[I,J,K] = size(Z);
SLAB = zeros(K,J,I); %preallocating the matrix; where I=23,J=4,K=3
for t = 1 : I %here I = 23
slab = squeeze(Z(t,:,:))’; %removing semicolon here I can see the wanted results in command window
SLAB = slab;
end
HOpe anyone here can help me on this.
Thank you
I found the solution;
since I know the slab will have size of (K,J,I), so must provide the same format in the for loop:
[I,J,K] = size(Z);
SLAB = zeros(K,J,I); %preallocating the matrix; where I=23,J=4,K=3
for t = 1 : I %here I = 23
slab(:,:,t) = squeeze(Z(t,:,:))’;
end

Why is it slower to prespecify type in a data.frame?

I was preallocating a big data.frame to fill in later, which I normally do with NA's like this:
n <- 1e6
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
and I wondered if it would make things any faster later if I specified data types up front, so I tested
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
> system.time(f1())
user system elapsed
0.219 0.042 0.260
> system.time(f2())
user system elapsed
1.018 0.052 1.072
So it was actually much slower! I tried again with a factor column too, and the difference wasn't closer to 2x than 4x, but I'm curious about why this is slower, and wonder if it is ever appropriate to initialize with data types rather than NA's.
--
Edit: Flodel pointed out that 1:n is integer, not numeric. With that correction the runtimes are nearly identical; of course it hurts to incorrectly specify a data type and change it later!
Assigning any data to a large data frame takes time. If you're going to assign your data all at once in a vector (as you should), it's much faster not to assign the c2 and c3 columns in the original definition at all. For example:
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size= n, replace = TRUE)
}
print(system.time(f1()))
# user system elapsed
# 0.194 0.023 0.216
print(system.time(f2()))
# user system elapsed
# 0.336 0.037 0.374
print(system.time(f3()))
# user system elapsed
# 0.057 0.007 0.063
The reason for this is that when you preassign, a column of length n is created. eg
str(data.frame(x=1:2, y = character(2)))
## 'data.frame': 2 obs. of 2 variables:
## $ x: int 1 2
## $ y: Factor w/ 1 level "": 1 1
Note that the character column has been converted to factor which will be slower than setting stringsAsFactors = F.
#David Robinson's answer is correct, but I will add some profiling here to show how to investigate why some thngs are slower than you might expect.
The best thing to do here is to do some profiling to see what is being called, that can give a clue as to why some things calls are slower than others
library(profr)
profr(f1())
## Read 9 items
## f level time start end leaf source
## 8 f1 1 0.16 0.00 0.16 FALSE <NA>
## 9 data.frame 2 0.04 0.00 0.04 TRUE base
## 10 $<- 2 0.02 0.04 0.06 FALSE base
## 11 sample 2 0.04 0.06 0.10 TRUE base
## 12 $<- 2 0.06 0.10 0.16 FALSE base
## 13 $<-.data.frame 3 0.12 0.04 0.16 TRUE base
profr(f2())
## Read 15 items
## f level time start end leaf source
## 8 f2 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.frame 2 0.12 0.00 0.12 TRUE base
## 10 : 2 0.02 0.12 0.14 TRUE base
## 11 $<- 2 0.02 0.18 0.20 FALSE base
## 12 sample 2 0.02 0.20 0.22 TRUE base
## 13 $<- 2 0.06 0.22 0.28 FALSE base
## 14 as.data.frame 3 0.08 0.04 0.12 FALSE base
## 15 $<-.data.frame 3 0.10 0.18 0.28 TRUE base
## 16 as.data.frame.character 4 0.08 0.04 0.12 FALSE base
## 17 factor 5 0.08 0.04 0.12 FALSE base
## 18 unique 6 0.06 0.04 0.10 FALSE base
## 19 match 6 0.02 0.10 0.12 TRUE base
## 20 unique.default 7 0.06 0.04 0.10 TRUE base
profr(f3())
## Read 4 items
## f level time start end leaf source
## 8 f3 1 0.06 0.00 0.06 FALSE <NA>
## 9 $<- 2 0.02 0.00 0.02 FALSE base
## 10 sample 2 0.04 0.02 0.06 TRUE base
## 11 $<-.data.frame 3 0.02 0.00 0.02 TRUE base
clearly f2() is slower than f1() as there is a lot of character to factor conversions, and recreating levels etc.
For efficient use of memory I would suggest the data.table package. This avoids (as much as possible) the internal copying of objects
library(data.table)
f4 <- function(){
f <- data.table(c1 = 1:n)
f[,c2:=1L:n]
f[,c3:=sample(LETTERS, size= n, replace = TRUE)]
}
system.time(f1())
## user system elapsed
## 0.15 0.02 0.18
system.time(f2())
## user system elapsed
## 0.19 0.00 0.19
system.time(f3())
## user system elapsed
## 0.09 0.00 0.09
system.time(f4())
## user system elapsed
## 0.04 0.00 0.04
Note, that using data.table you could add two columns at once (and by reference)
# Thanks to #Thell for pointing this out.
f[,`:=`(c('c2','c3'), list(1L:n, sample(LETTERS,n, T))), with = F]
EDIT -- functions that will return the required object (Well picked up #Dwin)
n= 1e7
f1 <- function() {
a <- data.frame(c1 = 1:n, c2 = NA, c3 = NA)
a$c2 <- 1:n
a$c3 <- sample(LETTERS, size = n, replace = TRUE)
a
}
f2 <- function() {
b <- data.frame(c1 = 1:n, c2 = numeric(n), c3 = character(n))
b$c2 <- 1:n
b$c3 <- sample(LETTERS, size = n, replace = TRUE)
b
}
f3 <- function() {
c <- data.frame(c1 = 1:n)
c$c2 <- 1:n
c$c3 <- sample(LETTERS, size = n, replace = TRUE)
c
}
f4 <- function() {
f <- data.table(c1 = 1:n)
f[, `:=`(c2, 1L:n)]
f[, `:=`(c3, sample(LETTERS, size = n, replace = TRUE))]
}
system.time(f1())
## user system elapsed
## 1.62 0.34 2.13
system.time(f2())
## user system elapsed
## 2.14 0.66 2.79
system.time(f3())
## user system elapsed
## 0.78 0.25 1.03
system.time(f4())
## user system elapsed
## 0.37 0.08 0.46
profr(f1())
## Read 105 items
## f level time start end leaf source
## 8 f1 1 2.08 0.00 2.08 FALSE <NA>
## 9 data.frame 2 0.66 0.00 0.66 FALSE base
## 10 : 2 0.02 0.66 0.68 TRUE base
## 11 $<- 2 0.32 0.84 1.16 FALSE base
## 12 sample 2 0.40 1.16 1.56 TRUE base
## 13 $<- 2 0.32 1.76 2.08 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.12 0.54 0.66 TRUE base
## 17 $<-.data.frame 3 1.24 0.84 2.08 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f2())
## Read 145 items
## f level time start end leaf source
## 8 f2 1 2.88 0.00 2.88 FALSE <NA>
## 9 data.frame 2 1.40 0.00 1.40 FALSE base
## 10 : 2 0.04 1.40 1.44 TRUE base
## 11 $<- 2 0.36 1.64 2.00 FALSE base
## 12 sample 2 0.40 2.00 2.40 TRUE base
## 13 $<- 2 0.36 2.52 2.88 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 numeric 3 0.06 0.02 0.08 TRUE base
## 16 character 3 0.04 0.08 0.12 TRUE base
## 17 as.data.frame 3 1.06 0.12 1.18 FALSE base
## 18 unlist 3 0.20 1.20 1.40 TRUE base
## 19 $<-.data.frame 3 1.24 1.64 2.88 TRUE base
## 20 as.data.frame.integer 4 0.04 0.12 0.16 TRUE base
## 21 as.data.frame.numeric 4 0.16 0.18 0.34 TRUE base
## 22 as.data.frame.character 4 0.78 0.40 1.18 FALSE base
## 23 factor 5 0.74 0.40 1.14 FALSE base
## 24 as.data.frame.vector 5 0.04 1.14 1.18 TRUE base
## 25 unique 6 0.38 0.40 0.78 FALSE base
## 26 match 6 0.32 0.78 1.10 TRUE base
## 27 unique.default 7 0.38 0.40 0.78 TRUE base
profr(f3())
## Read 37 items
## f level time start end leaf source
## 8 f3 1 0.72 0.00 0.72 FALSE <NA>
## 9 data.frame 2 0.10 0.00 0.10 FALSE base
## 10 : 2 0.02 0.10 0.12 TRUE base
## 11 $<- 2 0.08 0.14 0.22 FALSE base
## 12 sample 2 0.26 0.22 0.48 TRUE base
## 13 $<- 2 0.16 0.56 0.72 FALSE base
## 14 : 3 0.02 0.00 0.02 TRUE base
## 15 as.data.frame 3 0.04 0.02 0.06 FALSE base
## 16 unlist 3 0.02 0.08 0.10 TRUE base
## 17 $<-.data.frame 3 0.58 0.14 0.72 TRUE base
## 18 as.data.frame.integer 4 0.04 0.02 0.06 TRUE base
profr(f4())
## Read 15 items
## f level time start end leaf source
## 8 f4 1 0.28 0.00 0.28 FALSE <NA>
## 9 data.table 2 0.02 0.00 0.02 FALSE data.table
## 10 [ 2 0.26 0.02 0.28 FALSE base
## 11 : 3 0.02 0.00 0.02 TRUE base
## 12 [.data.table 3 0.26 0.02 0.28 FALSE <NA>
## 13 eval 4 0.26 0.02 0.28 FALSE base
## 14 eval 5 0.26 0.02 0.28 FALSE base
## 15 : 6 0.02 0.02 0.04 TRUE base
## 16 sample 6 0.24 0.04 0.28 TRUE base

Column-wise sort or top-n of matrix

I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:
> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92
But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.
Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.
If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.
Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.
Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.
I have put down a quick testing framework for the solutions proposed so far.
library(rbenchmark)
sort.q <- function(m) {
sort(m, method='quick')
}
sort.p <- function(m) {
mm <- sort(m, partial=TOP)[1:TOP]
sort(mm)
}
sort.all.g <- function(f) {
function(m) {
o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
}
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)
apply.sort.g <- function(f) {
function(m) {
apply(m, 2, f)[1:TOP,]
}
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)
bb <- NULL
SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5
for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
print(c(SIZE, TOP))
TOP <- min(TOP, SIZE)
m <- matrix(runif(SIZE*SIZE), floor(SIZE))
if (SIZE < 1000) {
mr <- apply.sort(m)
stopifnot(apply.sort.q(m) == mr)
stopifnot(apply.sort.p(m) == mr)
stopifnot(sort.all(m) == mr)
stopifnot(sort.all.q(m) == mr)
}
b <- benchmark(apply.sort(m),
apply.sort.q(m),
apply.sort.p(m),
sort.all(m),
sort.all.q(m),
columns= c("test", "elapsed", "relative",
"user.self", "sys.self"),
replications=1,
order=NULL)
b$SIZE <- SIZE
b$TOP <- TOP
b$test <- factor(x=b$test, levels=b$test)
bb <- rbind(bb, b)
}
}
ftable(xtabs(user.self ~ SIZE+test+TOP, bb))
The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.
Please feel free to add your own sorting routine :-)
TOP 10 31 100 316
SIZE test
31 apply.sort(m) 0.004 0.012 0.000 0.000
apply.sort.q(m) 0.008 0.016 0.000 0.000
apply.sort.p(m) 0.008 0.020 0.000 0.000
sort.all(m) 0.000 0.008 0.000 0.000
sort.all.q(m) 0.000 0.004 0.000 0.000
100 apply.sort(m) 0.012 0.016 0.028 0.000
apply.sort.q(m) 0.016 0.016 0.036 0.000
apply.sort.p(m) 0.020 0.020 0.040 0.000
sort.all(m) 0.000 0.004 0.008 0.000
sort.all.q(m) 0.004 0.004 0.004 0.000
316 apply.sort(m) 0.060 0.060 0.056 0.060
apply.sort.q(m) 0.064 0.060 0.060 0.072
apply.sort.p(m) 0.064 0.068 0.108 0.076
sort.all(m) 0.016 0.016 0.020 0.024
sort.all.q(m) 0.020 0.016 0.024 0.024
1000 apply.sort(m) 0.356 0.276 0.276 0.292
apply.sort.q(m) 0.348 0.316 0.288 0.296
apply.sort.p(m) 0.256 0.264 0.276 0.320
sort.all(m) 0.268 0.244 0.213 0.244
sort.all.q(m) 0.260 0.232 0.200 0.208
3162 apply.sort(m) 1.997 1.948 2.012 2.108
apply.sort.q(m) 1.916 1.880 1.892 1.901
apply.sort.p(m) 1.300 1.316 1.376 1.544
sort.all(m) 2.424 2.452 2.432 2.480
sort.all.q(m) 2.188 2.184 2.265 2.244
10000 apply.sort(m) 18.193 18.466 18.781 18.965
apply.sort.q(m) 15.837 15.861 15.977 16.313
apply.sort.p(m) 9.005 9.108 9.304 9.925
sort.all(m) 26.030 25.710 25.722 26.686
sort.all.q(m) 23.341 23.645 24.010 24.073
31622 apply.sort(m) 201.265 197.568 196.181 196.104
apply.sort.q(m) 163.190 160.810 158.757 160.050
apply.sort.p(m) 82.337 81.305 80.641 82.490
sort.all(m) 296.239 288.810 289.303 288.954
sort.all.q(m) 260.872 249.984 254.867 252.087
Does
apply(m, 2, sort)
do the job? :)
Or for top-10, say, use:
apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})
Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.
R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine
set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)
system.time(sm <- apply(m, 2, sort))
user system elapsed
2.62 0.14 2.79
The first 5 columns:
sm[1:15, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
[2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
[3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
[4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
[5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
[6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
[7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
[8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
[9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03
They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).
topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
thresh = thresh + calcsd
subs = which(vec > thresh)
}
while (length(subs) < topn) {
thresh = thresh - calcsd
subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]
The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.
Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.
Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.
My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.
What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!
Cheers :)

Resources