I have the below dummy dataframe:
import pandas
tab_size = 300000
tab_de_merde = [[i*1, 0, i*3, [i*7%3, i*11%3] ] for i in range(tab_size)]
colnames = ['Id', "Date", "Account","Value"]
indexnames = ['Id']
df = pandas.DataFrame(tab_de_merde, columns = colnames ).set_index(indexnames)
And I want to check if the column "Value" contains a 0.
I've tried 3 different solution and I was wondering if the third one (Python Vectorization) was correctly implemented since it doesn't seem to fasten the whole code.
%timeit df[[(0 in x) for x in df['Value'].values]]
#108 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df['Value'].apply(lambda x: 0 in x)]
#86.2 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def list_contains_element(np_array): return [(0 in x) for x in np_array]
%timeit df[list_contains_element(df['Value'].values)]
#106 ms ± 807 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I would be very glad if someone could help me understand better how to be faster with vector manipulation.
There are accessors such as h2o.logloss and h2o.residual_deviance but I want the penalized likelihood which is the value optimized during training. Is there a way to access this value without re-computing it by hand?
Yes you can get this from the scoring history table h2o.scoreHistory(model) which will give you something like:
Scoring History:
timestamp duration iterations negative_log_likelihood objective
1 2019-03-19 11:58:57 0.000 sec 0 509.21546 1.34004
2 2019-03-19 11:58:58 0.053 sec 1 489.50869 1.28818
3 2019-03-19 11:58:58 0.073 sec 2 488.06524 1.28438
4 2019-03-19 11:58:58 0.090 sec 3 487.83225 1.28377
5 2019-03-19 11:58:58 0.106 sec 4 487.79622 1.28367
6 2019-03-19 11:58:58 0.116 sec 5 487.78870 1.28365
7 2019-03-19 11:58:58 0.126 sec 6 487.78701 1.28365
8 2019-03-19 11:58:58 0.136 sec 7 487.78659 1.28365
Which gives you the negative log likelihood and the penalized negative log likelihood - which is in the objective column.
so you would just need to grab the last iteration to get the value of interest, which you can do with:
score_table = h2o.scoreHistory(model)
score_table[length(score_table),'objective']
where model is your H2O-3 model object.
Scala 2.11.8
I'm measuring iteration through flattened and non-flattened iterator. I wrote the following benchmark:
#State(Scope.Benchmark)
class SerializeBenchmark
var list = List(
List("test", 12, 34, 56),
List("test-test-test", 123, 444, 0),
List("test-test-test-tes", 145, 443, 4333),
List("testdsfg-test-test-tes", 3145, 435, 333),
List("test-tessdfgsdt-tessdfgt-tes", 1455, 43, 333),
List("tesewrt-test-tessdgdsft-tes", 13345, 4533, 3222333),
List("ewrtes6yhgfrtyt-test-test-tes", 122245, 433444, 322233),
List("tserfest-test-testtryfgd-tes", 143345, 43, 3122233),
List("test-reteytest-test-tes", 1121145, 4343, 3331212),
List("test-test-ertyeu6test-tes", 14115, 4343, 33433),
List("test-lknlkkn;lkntest-ertyeu6test-tes", 98141115, 4343, 33433),
List("tkknknest-test-ertyeu6test-tes", 914111215, 488343, 33433),
List("test-test-ertyeu6test-tes", 1411125, 437743, 93433),
List("test-test-ertyeu6testo;kn;lkn;lk-tes", 14111215, 5409343, 39823),
List("telnlkkn;lnih98st-test-ertyeu6test-tes", 1557215, 498343, 3377433)
)
#Benchmark
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Array(Mode.AverageTime))
def flattenerd(bh: Blackhole): Any = {
list.iterator.flatten.foreach(bh.consume)
}
#Benchmark
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Array(Mode.AverageTime))
def raw(bh: Blackhole): Any = {
list.iterator.foreach(_.foreach(bh.consume))
}
}
After running these benchmarks several times I got the following results:
Benchmark Mode Cnt Score Error Units
SerializeBenchmark.flattenerd avgt 5 10311,373 ± 1189,448 ns/op
SerializeBenchmark.raw avgt 5 3463,902 ± 141,145 ns/op
Almost 3 times difference in performance. And the larger I make the source list the bigger performance difference. Why?
I expected some performance difference but not 3 times.
I re-ran your test with a bit more iterations running under the hs_gc profile.
These are the results:
[info] Benchmark Mode Cnt Score Error Units
[info] IteratorFlatten.flattenerd avgt 50 0.708 â–’ 0.120 us/op
[info] IteratorFlatten.flattenerd:â•–sun.gc.collector.0.invocations avgt 50 8.840 â–’ 2.259 ?
[info] IteratorFlatten.raw avgt 50 0.367 â–’ 0.014 us/op
[info] IteratorFlatten.raw:â•–sun.gc.collector.0.invocations avgt 50 0 ?
IteratorFlatten.flattenerd had an average of 8 GC cycles during the test runs, where raw had 0. This means that because of the noise generated by the allocation by FlattenOps (the wrapper class and it's method, particularly hasNext which allocates an iterator per list), which is what is needed in order to provide the flatten method on Iterator, we suffer in running time.
If I re-run the test and give it a minimum heap size of 2G, the results get closer:
[info] Benchmark Mode Cnt Score Error Units
[info] IteratorFlatten.flattenerd avgt 50 0.615 â–’ 0.041 us/op
[info] IteratorFlatten.raw avgt 50 0.434 â–’ 0.064 us/op
The gist of it is, the more you allocate, the more work the GC has to do, more pauses, slower execution.
Note that these kind of micro benchmarks are very fragile and may yield different results. Make sure you measure enough allocations for the stats to become significant.
I use OpenJDK JMH 0.9.3 and sometimes i get a result log file like this one below.
What means (*interrupt*) here?
Forking 1 times using command: [/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/...]
# Run progress: 53,85% complete, ETA 00:01:26
# Fork: 1 of 1
# Warmup Iteration 1: 19950765,000 us/op
# Warmup Iteration 2: (*interrupt*) (*interrupt*) (*interrupt*) 18107134,000 us/op
# Warmup Iteration 3: (*interrupt*) 14439157,500 us/op
Iteration 1: 13571806,000 us/op
Iteration 2: 7484946,500 us/op
Iteration 3: (*interrupt*) 12386565,000 us/op
Iteration 4: (*interrupt*) 7245477,500 us/op
Iteration 5: (*interrupt*) 9047236,000 us/op
Result: 9947206,200 ±(99.9%) 11103651,130 us/op [Average]
Statistics: (min, avg, max) = (7245477,500, 9947206,200, 13571806,000), stdev = 2883582,937
Confidence interval (99.9%): [-1156444,930, 21050857,330]
That means your workload was interrupted by JMH, quite possibly because it overrun the iteration time. It prints "(interrupt)" to let you know the score was obtained with active intervention of JMH, and may be unreliable.
You can annotate your benchmark with #Timeout(time = 1, timeUnit = TimeUnit.HOURS) annotation and it will know that it will take longer. By default when you run some batch Measurement (without time limit set), it will default to 10 minutes time limit.
I need to sort a matrix so that all elements stay in their columns and each column is in ascending order. Is there a vectorized column-wise sort for a matrix or a data frame in R? (My matrix is all-positive and bounded by B, so I can add j*B to each cell in column j and do a regular one-dimensional sort:
> set.seed(100523); m <- matrix(round(runif(30),2), nrow=6); m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.47 0.32 0.29 0.54 0.38
[2,] 0.38 0.91 0.76 0.43 0.92
[3,] 0.71 0.32 0.48 0.16 0.85
[4,] 0.88 0.83 0.61 0.95 0.72
[5,] 0.16 0.57 0.70 0.82 0.05
[6,] 0.77 0.03 0.75 0.26 0.05
> offset <- rep(seq_len(5), rep(6, 5)); offset
[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
> m <- matrix(sort(m + offset), nrow=nrow(m)) - offset; m
[,1] [,2] [,3] [,4] [,5]
[1,] 0.16 0.03 0.29 0.16 0.05
[2,] 0.38 0.32 0.48 0.26 0.05
[3,] 0.47 0.32 0.61 0.43 0.38
[4,] 0.71 0.57 0.70 0.54 0.72
[5,] 0.77 0.83 0.75 0.82 0.85
[6,] 0.88 0.91 0.76 0.95 0.92
But is there something more beautiful already included?) Otherwise, what would be the fastest way if my matrix has around 1M (10M, 100M) entries (roughly a square matrix)? I'm worried about the performance penalty of apply and friends.
Actually, I don't need "sort", just "top n", with n being around 30 or 100, say. I am thinking about using apply and the partial parameter of sort, but I wonder if this is cheaper than just doing a vectorized sort. So, before doing benchmarks on my own, I'd like to ask for advice by experienced users.
If you want to use sort, ?sort indicates that method = "quick" can be twice as fast as the default method with on the order of 1 million elements.
Start with apply(m, 2, sort, method = "quick") and see if that provides sufficient speed.
Do note the comments on this in ?sort though; ties are sorted in a non-stable manner.
I have put down a quick testing framework for the solutions proposed so far.
library(rbenchmark)
sort.q <- function(m) {
sort(m, method='quick')
}
sort.p <- function(m) {
mm <- sort(m, partial=TOP)[1:TOP]
sort(mm)
}
sort.all.g <- function(f) {
function(m) {
o <- matrix(rep(seq_len(SIZE), rep(SIZE, SIZE)), nrow=SIZE)
matrix(f(m+o), nrow=SIZE)[1:TOP,]-o[1:TOP,]
}
}
sort.all <- sort.all.g(sort)
sort.all.q <- sort.all.g(sort.q)
apply.sort.g <- function(f) {
function(m) {
apply(m, 2, f)[1:TOP,]
}
}
apply.sort <- apply.sort.g(sort)
apply.sort.p <- apply.sort.g(sort.p)
apply.sort.q <- apply.sort.g(sort.q)
bb <- NULL
SIZE_LIMITS <- 3:9
TOP_LIMITS <- 2:5
for (SIZE in floor(sqrt(10)^SIZE_LIMITS)) {
for (TOP in floor(sqrt(10)^TOP_LIMITS)) {
print(c(SIZE, TOP))
TOP <- min(TOP, SIZE)
m <- matrix(runif(SIZE*SIZE), floor(SIZE))
if (SIZE < 1000) {
mr <- apply.sort(m)
stopifnot(apply.sort.q(m) == mr)
stopifnot(apply.sort.p(m) == mr)
stopifnot(sort.all(m) == mr)
stopifnot(sort.all.q(m) == mr)
}
b <- benchmark(apply.sort(m),
apply.sort.q(m),
apply.sort.p(m),
sort.all(m),
sort.all.q(m),
columns= c("test", "elapsed", "relative",
"user.self", "sys.self"),
replications=1,
order=NULL)
b$SIZE <- SIZE
b$TOP <- TOP
b$test <- factor(x=b$test, levels=b$test)
bb <- rbind(bb, b)
}
}
ftable(xtabs(user.self ~ SIZE+test+TOP, bb))
The results so far indicate that for all but the biggest matrices, apply really hurts performance unless doing a "top n". For "small" matrices < 1e6, just sorting the whole thing without apply is competitive. For "huge" matrices, sorting the whole array becomes slower than apply. Using partial works best for "huge" matrices and is only a slight loss for "small" matrices.
Please feel free to add your own sorting routine :-)
TOP 10 31 100 316
SIZE test
31 apply.sort(m) 0.004 0.012 0.000 0.000
apply.sort.q(m) 0.008 0.016 0.000 0.000
apply.sort.p(m) 0.008 0.020 0.000 0.000
sort.all(m) 0.000 0.008 0.000 0.000
sort.all.q(m) 0.000 0.004 0.000 0.000
100 apply.sort(m) 0.012 0.016 0.028 0.000
apply.sort.q(m) 0.016 0.016 0.036 0.000
apply.sort.p(m) 0.020 0.020 0.040 0.000
sort.all(m) 0.000 0.004 0.008 0.000
sort.all.q(m) 0.004 0.004 0.004 0.000
316 apply.sort(m) 0.060 0.060 0.056 0.060
apply.sort.q(m) 0.064 0.060 0.060 0.072
apply.sort.p(m) 0.064 0.068 0.108 0.076
sort.all(m) 0.016 0.016 0.020 0.024
sort.all.q(m) 0.020 0.016 0.024 0.024
1000 apply.sort(m) 0.356 0.276 0.276 0.292
apply.sort.q(m) 0.348 0.316 0.288 0.296
apply.sort.p(m) 0.256 0.264 0.276 0.320
sort.all(m) 0.268 0.244 0.213 0.244
sort.all.q(m) 0.260 0.232 0.200 0.208
3162 apply.sort(m) 1.997 1.948 2.012 2.108
apply.sort.q(m) 1.916 1.880 1.892 1.901
apply.sort.p(m) 1.300 1.316 1.376 1.544
sort.all(m) 2.424 2.452 2.432 2.480
sort.all.q(m) 2.188 2.184 2.265 2.244
10000 apply.sort(m) 18.193 18.466 18.781 18.965
apply.sort.q(m) 15.837 15.861 15.977 16.313
apply.sort.p(m) 9.005 9.108 9.304 9.925
sort.all(m) 26.030 25.710 25.722 26.686
sort.all.q(m) 23.341 23.645 24.010 24.073
31622 apply.sort(m) 201.265 197.568 196.181 196.104
apply.sort.q(m) 163.190 160.810 158.757 160.050
apply.sort.p(m) 82.337 81.305 80.641 82.490
sort.all(m) 296.239 288.810 289.303 288.954
sort.all.q(m) 260.872 249.984 254.867 252.087
Does
apply(m, 2, sort)
do the job? :)
Or for top-10, say, use:
apply(m, 2 ,function(x) {sort(x,dec=TRUE)[1:10]})
Performance is strong - for 1e7 rows and 5 cols (5e7 numbers in total), my computer took around 9 or 10 seconds.
R is very fast at matrix calculations. A matrix with 1e7 elements in 1e4 columns gets sorted in under 3 seconds on my machine
set.seed(1)
m <- matrix(runif(1e7), ncol=1e4)
system.time(sm <- apply(m, 2, sort))
user system elapsed
2.62 0.14 2.79
The first 5 columns:
sm[1:15, 1:5]
[,1] [,2] [,3] [,4] [,5]
[1,] 2.607703e-05 0.0002085913 9.364448e-05 0.0001937598 1.157424e-05
[2,] 9.228056e-05 0.0003156713 4.948019e-04 0.0002542199 2.126186e-04
[3,] 1.607228e-04 0.0003988042 5.015987e-04 0.0004544661 5.855639e-04
[4,] 5.756689e-04 0.0004399747 5.762535e-04 0.0004621083 5.877446e-04
[5,] 6.932740e-04 0.0004676797 5.784736e-04 0.0004749235 6.470268e-04
[6,] 7.856274e-04 0.0005927107 8.244428e-04 0.0005443178 6.498618e-04
[7,] 8.489799e-04 0.0006210336 9.249109e-04 0.0005917936 6.548134e-04
[8,] 1.001975e-03 0.0006522120 9.424880e-04 0.0007702231 6.569310e-04
[9,] 1.042956e-03 0.0007237203 1.101990e-03 0.0009826915 6.810103e-04
[10,] 1.246256e-03 0.0007968422 1.117999e-03 0.0009873926 6.888523e-04
[11,] 1.337960e-03 0.0009294956 1.229132e-03 0.0009997757 8.671272e-04
[12,] 1.372295e-03 0.0012221676 1.329478e-03 0.0010375632 8.806398e-04
[13,] 1.583430e-03 0.0012781983 1.433513e-03 0.0010662393 8.886999e-04
[14,] 1.603961e-03 0.0013518191 1.458616e-03 0.0012068383 8.903167e-04
[15,] 1.673268e-03 0.0013697683 1.590524e-03 0.0013617468 1.024081e-03
They say there's a fine line between genius and madness... take a look at this and see what you think of the idea. As in the question, the goal is to find the top 30 elements of a vector vec that might be long (1e7, 1e8, or more elements).
topn = 30
sdmult = max(1,qnorm(1-(topn/length(vec))))
sdmin = 1e-5
acceptmult = 10
calcsd = max(sd(vec),sdmin)
calcmn = mean(vec)
thresh = calcmn + sdmult*calcsd
subs = which(vec > thresh)
while (length(subs) > topn * acceptmult) {
thresh = thresh + calcsd
subs = which(vec > thresh)
}
while (length(subs) < topn) {
thresh = thresh - calcsd
subs = which(vec > thresh)
}
topvals = sort(vec[subs],dec=TRUE)[1:topn]
The basic idea is that even if we don't know much about the distribution of vec, we'd certainly expect the highest values in vec to be several standard deviations above the mean. If vec were normally distributed, then the qnorm expression on line 2 gives a rough idea how many sd's above the mean we'd need to look to find the highest topn values (e.g. if vec contains 1e8 values, the top 30 values are likely to be located in the region starting 5 sd's above the mean.) Even if vec isn't normal, this assumption is unlikely to be massively far away from the truth.
Ok, so we compute the mean and sd of vec, and use these to propose a threshold to look above - a certain number of sd's above the mean. We're hoping to find in this upper tail a subset of slightly more than topn values. If we do, we can sort it and easily identify the highest topn values - which will be the highest topn values in vec overall.
Now the exact rules here can probably be tweaked a bit, but the idea is that we need to guard against the original threshold being "out" for some reason. We therefore exploit the fact that it's quick to check how many elements lie above a certain threshold. So, we first raise the threshold, in increments of calcsd, until there are fewer than 10 * topn elements above the threshold. Then, if needed. we reduce thresh (again in steps of calcsd) until we definitely have at least topn elements above the threshold. This bi-directional search should always lead to a "threshold set" whose size is fairly close to topn (hopefully within a factor of 10 or 100). As topn is relatively small (typical value 30), it will be really fast to sort this threshold set, which of course immediately gives us the highest topn elements in the original vector vec.
My claim is that the calculations involved in generating a decent threshold set are all quick in R, so if only the top 30 or so elements of a very large vector are required, this indirect approach will beat any approach that involves sorting the whole vector.
What do you think?! If you think it's an interesting idea, please like/vote up :) I'll look at doing some proper timings but my initial tests on randomly generated data were really promising - it'd be great to test it out on "real" data though...!
Cheers :)