OpenJDK JMH sometimes prints (*interrupt*) in results. What does this exactly mean? - performance

I use OpenJDK JMH 0.9.3 and sometimes i get a result log file like this one below.
What means (*interrupt*) here?
Forking 1 times using command: [/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/...]
# Run progress: 53,85% complete, ETA 00:01:26
# Fork: 1 of 1
# Warmup Iteration 1: 19950765,000 us/op
# Warmup Iteration 2: (*interrupt*) (*interrupt*) (*interrupt*) 18107134,000 us/op
# Warmup Iteration 3: (*interrupt*) 14439157,500 us/op
Iteration 1: 13571806,000 us/op
Iteration 2: 7484946,500 us/op
Iteration 3: (*interrupt*) 12386565,000 us/op
Iteration 4: (*interrupt*) 7245477,500 us/op
Iteration 5: (*interrupt*) 9047236,000 us/op
Result: 9947206,200 ±(99.9%) 11103651,130 us/op [Average]
Statistics: (min, avg, max) = (7245477,500, 9947206,200, 13571806,000), stdev = 2883582,937
Confidence interval (99.9%): [-1156444,930, 21050857,330]

That means your workload was interrupted by JMH, quite possibly because it overrun the iteration time. It prints "(interrupt)" to let you know the score was obtained with active intervention of JMH, and may be unreliable.
You can annotate your benchmark with #Timeout(time = 1, timeUnit = TimeUnit.HOURS) annotation and it will know that it will take longer. By default when you run some batch Measurement (without time limit set), it will default to 10 minutes time limit.

Related

Where can I find documentation for benchmarkdotnet output

In the output from BenchmarkDotNet I have the following lines:
For the first benchmark
WorkloadResult 100: 1 op, 614219700.00 ns, 614.2197 ms/op
GC: 123 1 0 518085976 1
Threading: 2 0 1
For the second benchmark
WorkloadResult 73: 1 op, 464890400.00 ns, 464.8904 ms/op
GC: 14 1 0 59217312 1
Threading: 7469 0 1
What do the values in GC and Threading mean?
You can find it at https://benchmarkdotnet.org/articles/guides/how-it-works.html and at https://adamsitnik.com/the-new-Memory-Diagnoser/#how-to-read-the-results.
To tell the long story short, you should care only about the results printed in the table, which are explained with Legend printed below it.

Numpy structured array performance

I've got a look-up problem that boils down to the following situation.
Three columns with positive integers. For some value i, which values in 'column_3' have a value in 'column_1' below i and a value in 'column_2' above i?
import numpy as np
rows = 1e6
i = 5e8
ts = np.zeros((rows,), dtype=[('column_1','int64'),('column_2','int64'),('column_3','int64')])
ts['column_1'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_2'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_3'] = np.random.randint(low=0,high=1e9,size=rows)
This is the operation I'd like to optimize:
%%timeit
a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
Is there anything I'm overlooking that could make this faster?
Would be grateful for any advice!!
Assigning your 3 arrays to A,B,C at creation as well:
In [3]: %%timeit
...: a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
...:
22.5 ms ± 838 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %%timeit
...: a = C[(A < i)&(B > i)]
...:
...:
9.36 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using a,b,c = ts['column_1'],ts['column_2'],ts['column_3'] instead falls in between.
Those are variants and timings you can play with. As I can see it just minor differences due to indexing differences. Nothing like an order of magnitude difference.

Why is iterating through flattened iterator slow?

Scala 2.11.8
I'm measuring iteration through flattened and non-flattened iterator. I wrote the following benchmark:
#State(Scope.Benchmark)
class SerializeBenchmark
var list = List(
List("test", 12, 34, 56),
List("test-test-test", 123, 444, 0),
List("test-test-test-tes", 145, 443, 4333),
List("testdsfg-test-test-tes", 3145, 435, 333),
List("test-tessdfgsdt-tessdfgt-tes", 1455, 43, 333),
List("tesewrt-test-tessdgdsft-tes", 13345, 4533, 3222333),
List("ewrtes6yhgfrtyt-test-test-tes", 122245, 433444, 322233),
List("tserfest-test-testtryfgd-tes", 143345, 43, 3122233),
List("test-reteytest-test-tes", 1121145, 4343, 3331212),
List("test-test-ertyeu6test-tes", 14115, 4343, 33433),
List("test-lknlkkn;lkntest-ertyeu6test-tes", 98141115, 4343, 33433),
List("tkknknest-test-ertyeu6test-tes", 914111215, 488343, 33433),
List("test-test-ertyeu6test-tes", 1411125, 437743, 93433),
List("test-test-ertyeu6testo;kn;lkn;lk-tes", 14111215, 5409343, 39823),
List("telnlkkn;lnih98st-test-ertyeu6test-tes", 1557215, 498343, 3377433)
)
#Benchmark
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Array(Mode.AverageTime))
def flattenerd(bh: Blackhole): Any = {
list.iterator.flatten.foreach(bh.consume)
}
#Benchmark
#OutputTimeUnit(TimeUnit.NANOSECONDS)
#BenchmarkMode(Array(Mode.AverageTime))
def raw(bh: Blackhole): Any = {
list.iterator.foreach(_.foreach(bh.consume))
}
}
After running these benchmarks several times I got the following results:
Benchmark Mode Cnt Score Error Units
SerializeBenchmark.flattenerd avgt 5 10311,373 ± 1189,448 ns/op
SerializeBenchmark.raw avgt 5 3463,902 ± 141,145 ns/op
Almost 3 times difference in performance. And the larger I make the source list the bigger performance difference. Why?
I expected some performance difference but not 3 times.
I re-ran your test with a bit more iterations running under the hs_gc profile.
These are the results:
[info] Benchmark Mode Cnt Score Error Units
[info] IteratorFlatten.flattenerd avgt 50 0.708 â–’ 0.120 us/op
[info] IteratorFlatten.flattenerd:â•–sun.gc.collector.0.invocations avgt 50 8.840 â–’ 2.259 ?
[info] IteratorFlatten.raw avgt 50 0.367 â–’ 0.014 us/op
[info] IteratorFlatten.raw:â•–sun.gc.collector.0.invocations avgt 50 0 ?
IteratorFlatten.flattenerd had an average of 8 GC cycles during the test runs, where raw had 0. This means that because of the noise generated by the allocation by FlattenOps (the wrapper class and it's method, particularly hasNext which allocates an iterator per list), which is what is needed in order to provide the flatten method on Iterator, we suffer in running time.
If I re-run the test and give it a minimum heap size of 2G, the results get closer:
[info] Benchmark Mode Cnt Score Error Units
[info] IteratorFlatten.flattenerd avgt 50 0.615 â–’ 0.041 us/op
[info] IteratorFlatten.raw avgt 50 0.434 â–’ 0.064 us/op
The gist of it is, the more you allocate, the more work the GC has to do, more pauses, slower execution.
Note that these kind of micro benchmarks are very fragile and may yield different results. Make sure you measure enough allocations for the stats to become significant.

What does OpenJDK JMH "score error" exactly mean?

I am using http://openjdk.java.net/projects/code-tools/jmh/ for benchmarking and i get a result like:
Benchmark Mode Samples Score Score error Units
o.a.f.c.j.b.TestClass.test1 avgt 5 2372870,600 210897,743 us/op
o.a.f.c.j.b.TestClass.test2 avgt 5 2079931,850 394727,671 us/op
o.a.f.c.j.b.TestClass.test3 avgt 5 26585,818 21105,739 us/op
o.a.f.c.j.b.TestClass.test4 avgt 5 19113,230 8012,852 us/op
o.a.f.c.j.b.TestClass.test5 avgt 5 2586,413 1949,487 us/op
o.a.f.c.j.b.TestClass.test6 avgt 5 1942,963 1619,967 us/op
o.a.f.c.j.b.TestClass.test7 avgt 5 233,902 73,861 us/op
o.a.f.c.j.b.TestClass.test8 avgt 5 191,970 126,682 us/op
What does the column "Score error" exactly mean and how to interpret it?
This is the margin of error for the score. In most cases, that is a half of confidence interval. Think about it as if there is a "±" sign between "Score" and "Score error". In fact, the human-readable log will show that:
Result: 1.986 ±(99.9%) 0.009 ops/ns [Average]
Statistics: (min, avg, max) = (1.984, 1.986, 1.990), stdev = 0.002
Confidence interval (99.9%): [1.977, 1.995]
# Run complete. Total time: 00:00:12
Benchmark Mode Samples Score Score error Units
o.o.j.s.HelloWorld.hello thrpt 5 1.986 0.009 ops/ns

Reduce computing time for reshape

I have the following dataset, which I would like to reshape from wide to long format:
Name Code CURRENCY 01/01/1980 02/01/1980 03/01/1980 04/01/1980
Abengoa 4256 USD 1.53 1.54 1.51 1.52
Adidas 6783 USD 0.23 0.54 0.61 0.62
The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:
library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")
However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?
Many thanks!
Benchmarks Summary:
Using Stack (as suggested by #AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table begins to outperform stack
Here is an option using data.table
dtt <- data.table(data)
# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")
# name of columns being transformed
dateCols <- setdiff(names(data), nvc)
# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
dtt[, Date := d]
cols <- c(nvc, "Date", d)
setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))
## Results:
dtt2
# Name Code CURRENCY Date value
# 1: Abengoa 4256 USD X_01_01_1980 1.53
# 2: Adidas 6783 USD X_01_01_1980 0.23
# 3: Abengoa 4256 USD X_02_01_1980 1.54
# 4: Adidas 6783 USD X_02_01_1980 0.54
# 5: ... <cropped>
Updated Benchmarks with larger sample data
As per the suggestion from #AnandaMahto, below are benchmarks using a large (larger) sample data.
Please feel free to improve any of the methods used below and/or add new methods.
Benchmarks
Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
DT <- quote({ nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))
# SAMPLE SIZE: ROWS = 900; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 Stack 0.813 0.623 0.192 5
# 2.530 DT 2.057 2.035 0.026 5
# 40.470 Resh 32.902 18.410 14.602 5
# 40.578 Resh2 32.990 18.419 14.728 5
# SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.00 DT 2.407 2.336 0.076 5
# 1.08 Stack 2.600 1.626 0.983 5
# SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 DT 10.450 7.418 3.058 5
# 2.232 Stack 23.329 14.180 9.266 5
Sample Data Creation
# rm(list=ls(all=TRUE))
set.seed(1)
LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")
size <- 900
dateSamples <- 380
startDate <- as.Date("1980-01-01")
Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)
Dates <- seq(startDate, length.out=dateSamples, by="mon")
Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2
# Calling the sample dataframe `data` to keep consistency, but I dont like this practice
data <- data.frame(Name, Code, CURRENCY,
matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
)
data[1:6, 1:8]
# Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
# 1 AJAAQNFA 3389 YEN 0.37 0.33 3.58 4.33 1.06
# 2 BJAARNFA 4348 YEN 1.14 2.69 2.57 0.27 3.02
# 3 CJAASNFA 6154 USD 2.47 3.72 3.32 0.36 4.85
# 4 DJAATNFA 9171 USD 2.22 2.48 0.71 0.79 2.85
# 5 EJAAUNFA 2814 USD 2.63 2.17 1.66 0.55 3.12
# 6 FJAAVNFA 9081 USD 1.92 1.47 3.51 3.23 3.68
From the question :
data <- read.csv("data.csv")
and
... for .csv files that are about 50MB big, it already takes about two
hours ...
So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv. Assuming you're including that in your timing as well as melt (it isn't clear).
Default arguments to read.csv are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors, colClasses) such as :
http://cran.r-project.org/doc/manuals/R-data.html
Quickly reading very large tables as dataframes
But I'd suggest fread (since data.table 1.8.7). To get a feel for fread its manual page in raw text form is here:
https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread
The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.
Then the stack/reshape/melt answers are next order, if I guessed correctly.
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
Update: Results of some more benchmarks
Using #RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
Ouch! data.table really turns the tables on that last run!

Resources