Can I use 0 and 1 values in quantilesExact - clickhouse

From the quantile function documentation:
We recommend using a level value in the range of [0.01, 0.99]. Don't use a level value equal to 0 or 1 – use the min and max functions for these cases.
Does this also applies for quantileExact and quantilesExact functions?
In my experiments, I've found that quantileExact(0) = min and quantileExact(1) = max, but cannot be sure about it.

That recommendation is not about accuracy but about complexity of quantile*.
quantileExact is much much heavier than max min.
See the time difference, min / max 8 times faster even on a small dataset.
create table Speed Engine=MergeTree order by X
as select number X from numbers(1000000000);
SELECT min(X), max(X) FROM Speed;
┌─min(X)─┬────max(X)─┐
│ 0 │ 999999999 │
└────────┴───────────┘
1 rows in set. Elapsed: 1.040 sec. Processed 1.00 billion rows, 8.00 GB (961.32 million rows/s., 7.69 GB/s.)
SELECT quantileExact(0)(X), quantileExact(1)(X) FROM Speed;
┌─quantileExact(0)(X)─┬─quantileExact(1)(X)─┐
│ 0 │ 999999999 │
└─────────────────────┴─────────────────────┘
1 rows in set. Elapsed: 8.561 sec. Processed 1.00 billion rows, 8.00 GB (116.80 million rows/s., 934.43 MB/s.)

It turns out it is safe to use 0 and 1 values for quantileExact and quantilesExact functions.

Related

Oracle Format Models Million / Billion

I am looking to take a number, and truncate / round it and display it with a notation indicator like this:
10000000 --> 10M
1000000045.76 --> 100B
If < 1 Million, just leave it as is. If >= 1 Million, display the M / B / T notation. Does Oracle have a format model for this?

How understand the granularity and block in ClickHouse?

I am not clear about these two words.
Whether does one block have a fixed number of rows?
Whether is one block the minimum unit to read from disk?
Whether are different blocks stored in different files?
Whether is the range of one block bigger than granule? That means, one block can have several granules skip indices.
https://clickhouse.tech/docs/en/operations/table_engines/mergetree/#primary-keys-and-indexes-in-queries
Primary key is sparsed. By default it contains 1 value of each 8192 rows (= 1 granule).
Let's disable adaptive granularity (for the test) -- index_granularity_bytes=0
create table X (A Int64)
Engine=MergeTree order by A
settings index_granularity=16,index_granularity_bytes=0;
insert into X select * from numbers(32);
index_granularity=16 -- 32 rows = 2 granule , primary index have 2 values 0 and 16
select marks, primary_key_bytes_in_memory from system.parts where table = 'X';
┌─marks─┬─primary_key_bytes_in_memory─┐
│ 2 │ 16 │
└───────┴─────────────────────────────┘
16 bytes === 2 values of INT64.
Adaptive index granularity means that granules size various. Because wide rows (many bytes) needs (for performance) fewer (<8192) rows in granule.
index_granularity_bytes = 10MB ~ 1k row * 8129. So each granule have 10MB. If rows size 100k (long Strings), granule will have 100 rows (not 8192).
Skip index granules GRANULARITY 3 -- means that an index will store one value for each 3 table granules.
create table X (A Int64, B Int64, INDEX IX1 (B) TYPE minmax GRANULARITY 4)
Engine=MergeTree order by A
settings index_granularity=16,index_granularity_bytes=0;
insert into X select number, number from numbers(128);
128/16 = 8, table have 8 granules, INDEX IX1 stores 2 values of minmax (8/4)
So minmax index stores 2 values -- (0..63) and (64..128)
0..63 -- points to the first 4 table's granules.
64..128 -- points to the second 4 table' granules.
set send_logs_level='debug'
select * from X where B=77
[ 84 ] <Debug> dw.X (SelectExecutor): **Index `IX1` has dropped 1 granules**
[ 84 ] <Debug> dw.X (SelectExecutor): Selected 1 parts by date, 1 parts by key, **4 marks** to read from 1 ranges
SelectExecutor checked skip index - 4 table granules can be skipped because 77 is not in 0..63 .
And another 4 granules must be read ( 4 marks ) because 77 in (64..128) -- some of that 4 granules have B=77.
https://clickhouse.tech/docs/en/development/architecture/#block
Block can contain any number of rows.
For example 1 row blocks:
set max_block_size=1;
SELECT * FROM numbers_mt(1000000000) LIMIT 3;
┌─number─┐
│ 0 │
└────────┘
┌─number─┐
│ 2 │
└────────┘
┌─number─┐
│ 3 │
└────────┘
set max_block_size=100000000000;
create table X (A Int64) Engine=Memory;
insert into X values(1);
insert into X values(2);
insert into X values(3);
SELECT * FROM X;
┌─A─┐
│ 1 │
└───┘
┌─A─┐
│ 3 │
└───┘
┌─A─┐
│ 2 │
└───┘
3 rows in block
drop table X;
create table X (A Int64) Engine=Memory;
insert into X values(1)(2)(3);
select * from X
┌─A─┐
│ 1 │
│ 2 │
│ 3 │
└───┘

Hive percentile_approx function is broken, isn't it?

I am using Hive 1.2.1000.2.4.2.0-258.
There are 4850000+ rows in the table, 14511 rows of A between 73 and 74, and 3 cols- group_id, A and B.
Group_id is actually equal to 0.
Almost all of A and B are integers.
I was using the following scripts to find statistic summaries from a table:
select group_id, --group_id=0 a constant
percentile_approx(A , 0.5) as A_mdn,
percentile_approx(A , 0.25) as A_Q1,
percentile_approx(A , 0.75) as A_Q3,
percentile_approx(A , array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile_approx(B , 0.5) as B_mdn,
percentile_approx(B , 0.25) as B_Q1,
percentile_approx(B , 0.75) as B_Q3,
percentile_approx(B , array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id;
The result I got is:
0
73.21058033222496
73.21058033222496
462.16968382794516
[73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496,73.21058033222496]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,8.11278644563614,17.0]
Then I change the code as following:
select group_id, --group_id=0 a constant
percentile(cast(A as bigint), 0.5) as A_mdn,
percentile(cast(A as bigint), 0.25) as A_Q1,
percentile(cast(A as bigint), 0.75) as A_Q3,
percentile(cast(A as bigint), array(0.2,0.15, 0.1,0.05,0.025,0.001)) as A_i,
min(A) as min_A,
percentile(cast(B as bigint), 0.5) as B_mdn,
percentile(cast(B as bigint), 0.25) as B_Q1,
percentile(cast(B as bigint), 0.75) as B_Q3,
percentile(cast(B as bigint), array(0.8,0.85, 0.9, 0.95,0.975)) as B_i
from table
group by group_id
The new result is:
0
72.0
6.0
762.0
[3.0,1.0,1.0,0.0,0.0,0.0]
0.0
1.0
1.0
2.0
[2.0,3.0,4.0,9.0,17.0]
To double check the truth, I also load this table to R. Following is the R-result:
A:
Min 0
Q1: 6
Median: 72
Q3: 762
0.2 quantile: 3
0.15 quantile: 1.5
0.1 quantile: 1
0.05 quantile: 0
0.025 quantile:0
0.001 quantile:0
B
Q1: 1
Median: 1
Q3: 2
0.8 quantile: 2
0.85 quantile: 3
0.9 quantile: 4
0.95 quantile: 9
0.975 quantile:17
Obviously, R result is consistent with percentile function, but percentile_approx gives me the wrong answer.
Yeah, the percentile_approx doesn't have any approximation guarantees, except when you set the accuracy to be greater than or equal to the # of data points.
The source for it is here: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java
From a quick reading, the gist is that it creates accuracy buckets, and then when it runs out of buckets it merges buckets by finding the two closest buckets and combining them with a weighted sum.
This will break with various inputs though. In particular, if you have datapoints that are very high/very low and are spaced far away from each other, it will break the algorithm. If you first clip your data to be in a range where there are not many outliers, it should perform better.
You might consider randomly sampling the data and computing non-approx percentile instead if your data is too skewed though.
This function returns a true value if "all" the values are integers. You said that almost all of A and B are integers.
Try to cast the complete column A to int and see if you come close to the answer.
I don't think, you will ever get exactly the same answer as R because R's percentile function most likely takes non-integers also.
One way to get the exact answer would be to write your own UDF and use it instead.
Hope this helps!

order of growth in algorithms

Suppose that you time a program as a function of N and produce
the following table.
N seconds
-------------------
19683 0.00
59049 0.00
177147 0.01
531441 0.08
1594323 0.44
4782969 2.46
14348907 13.58
43046721 74.99
129140163 414.20
387420489 2287.85
Estimate the order of growth of the running time as a function of N.
Assume that the running time obeys a power law T(N) ~ a N^b. For your
answer, enter the constant b. Your answer will be marked as correct
if it is within 1% of the target answer - we recommend using
two digits after the decimal separator, e.g., 2.34.
Can someone explain how to calculate this?
Well, it is a simple mathematical problem.
I : a*387420489^b = 2287.85 -> a = 387420489^b/2287.85
II: a*43046721^b = 74.99 -> a = 43046721^b/74.99
III: (I and II)-> 387420489^b/2287.85 = 43046721^b/74.99 ->
-> http://www.purplemath.com/modules/solvexpo2.htm
Use logarithms to solve.
1.You should calculate the ratio of the growth change from one row to the one next
N seconds
--------------------
14348907 13.58
43046721 74.99
129140163 414.2
387420489 2287.85
2.Calculate the change's ratio for N
43046721 / 14348907 = 3
129140163 / 43046721 = 3
therefore the rate of change for N is 3.
3.Calculate the change's ratio for seconds
74.99 / 13.58 = 5.52
Now let check the ratio between one more pare of rows to be sure
414.2 / 74.99 = 5.52
so the change's ratio for seconds is 5.52
4.Build the following equitation
3^b = 5.52
b = 1.55
Finally we get that the order of growth of the running time is 1.55.

Reduce computing time for reshape

I have the following dataset, which I would like to reshape from wide to long format:
Name Code CURRENCY 01/01/1980 02/01/1980 03/01/1980 04/01/1980
Abengoa 4256 USD 1.53 1.54 1.51 1.52
Adidas 6783 USD 0.23 0.54 0.61 0.62
The data consists of stock prices for different firms on each day from 1980 to 2013. Therefore, I have 8,612 columns in my wide data (and a abou 3,000 rows). Now, I am using the following command to reshape the data into long format:
library(reshape)
data <- read.csv("data.csv")
data1 <- melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date")
However, for .csv files that are about 50MB big, it already takes about two hours. The computing time shouldn't be driven by weak hardware, since I am running this on a 2.7 GHz Intel Core i7 with 16GB of RAM. Is there any other more efficient way to do this?
Many thanks!
Benchmarks Summary:
Using Stack (as suggested by #AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table begins to outperform stack
Here is an option using data.table
dtt <- data.table(data)
# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")
# name of columns being transformed
dateCols <- setdiff(names(data), nvc)
# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
dtt[, Date := d]
cols <- c(nvc, "Date", d)
setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))
## Results:
dtt2
# Name Code CURRENCY Date value
# 1: Abengoa 4256 USD X_01_01_1980 1.53
# 2: Adidas 6783 USD X_01_01_1980 0.23
# 3: Abengoa 4256 USD X_02_01_1980 1.54
# 4: Adidas 6783 USD X_02_01_1980 0.54
# 5: ... <cropped>
Updated Benchmarks with larger sample data
As per the suggestion from #AnandaMahto, below are benchmarks using a large (larger) sample data.
Please feel free to improve any of the methods used below and/or add new methods.
Benchmarks
Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
DT <- quote({ nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))
# SAMPLE SIZE: ROWS = 900; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 Stack 0.813 0.623 0.192 5
# 2.530 DT 2.057 2.035 0.026 5
# 40.470 Resh 32.902 18.410 14.602 5
# 40.578 Resh2 32.990 18.419 14.728 5
# SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.00 DT 2.407 2.336 0.076 5
# 1.08 Stack 2.600 1.626 0.983 5
# SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 DT 10.450 7.418 3.058 5
# 2.232 Stack 23.329 14.180 9.266 5
Sample Data Creation
# rm(list=ls(all=TRUE))
set.seed(1)
LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")
size <- 900
dateSamples <- 380
startDate <- as.Date("1980-01-01")
Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)
Dates <- seq(startDate, length.out=dateSamples, by="mon")
Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2
# Calling the sample dataframe `data` to keep consistency, but I dont like this practice
data <- data.frame(Name, Code, CURRENCY,
matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
)
data[1:6, 1:8]
# Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
# 1 AJAAQNFA 3389 YEN 0.37 0.33 3.58 4.33 1.06
# 2 BJAARNFA 4348 YEN 1.14 2.69 2.57 0.27 3.02
# 3 CJAASNFA 6154 USD 2.47 3.72 3.32 0.36 4.85
# 4 DJAATNFA 9171 USD 2.22 2.48 0.71 0.79 2.85
# 5 EJAAUNFA 2814 USD 2.63 2.17 1.66 0.55 3.12
# 6 FJAAVNFA 9081 USD 1.92 1.47 3.51 3.23 3.68
From the question :
data <- read.csv("data.csv")
and
... for .csv files that are about 50MB big, it already takes about two
hours ...
So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv. Assuming you're including that in your timing as well as melt (it isn't clear).
Default arguments to read.csv are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors, colClasses) such as :
http://cran.r-project.org/doc/manuals/R-data.html
Quickly reading very large tables as dataframes
But I'd suggest fread (since data.table 1.8.7). To get a feel for fread its manual page in raw text form is here:
https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread
The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.
Then the stack/reshape/melt answers are next order, if I guessed correctly.
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
In many cases, stack works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
Update: Results of some more benchmarks
Using #RicardoSaporta's benchmarking procedure, I have benchmarked data.table against what I've called "Manual" data.frame creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack, if performance really is the primary concern.
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
Ouch! data.table really turns the tables on that last run!

Resources