how to generate random numbers in a specific range in Clickhouse - clickhouse

clickhouse only has a few random functions like rand(). but, how can i create random numbers under a specific range. let's say create numbers in the range of 0-50
for example something like:
select rand(0,50) as random_0_50
random_0_50
5
12
32
0
27

clickhouse lets you run module function under select.
so u can decide your range x by using select rand() % x
in the case about the code will be:
select rand() % 50 random_0_50

Clickhouse has built-in generateRandom() function that helps quickly generate and populate data
Here is an article with this function https://medium.com/p/45e92c2645c5

rand function returns a number between [0 : 4294967295]. You can get modulo of it to constrain between the numbers you want like this:
SELECT rand() % 51 AS random_0_50
Query id: e4addce1-37b2-44a7-ab51-f37b6ef4ff58
┌─random_0_50─┐
│ 13 │
└─────────────┘
Or you can create a function for it:
CREATE FUNCTION rand_range AS (lower_bound, upper_bound) -> (lower_bound + (rand() % (upper_bound + 1)))
SELECT rand_range(0, 50) AS random_0_50
Query id: 9efa566d-e825-4e34-b228-e6ba6e210b16
┌─random_0_50─┐
│ 50 │
└─────────────┘

Related

Report Builder Expressions Help Sum three fields then divide by 5

I currently have three columns in report builder that look like this.
PU PI LO Total SUM
0 13 31 44
The Total Sum column is an expression that sums the first three columns with =Fields!Put_Away.Value+Fields!Picked.Value+Fields!Loaded.Value. I now want to create one more column that grabs the sum of of those three fields and divides it by 5. How do I do this? I tried =Fields!PU.Value+Fields!PI.Value+Fields!LO.Value/5 but it gives me 19.2 as the result of the example above.
You need to use brackets.
Currently you are doing =Fields!Put_Away.Value+Fields!Picked.Value+Fields!Loaded.Value/5, which converts to 0 + 13 + 31 / 5, or if we include the inferred brackets, 0 + 13 + (31/5).
You want =(Fields!Put_Away.Value+Fields!Picked.Value+Fields!Loaded.Value)/5, which becomes (0 + 13 + 31)/5

How understand the granularity and block in ClickHouse?

I am not clear about these two words.
Whether does one block have a fixed number of rows?
Whether is one block the minimum unit to read from disk?
Whether are different blocks stored in different files?
Whether is the range of one block bigger than granule? That means, one block can have several granules skip indices.
https://clickhouse.tech/docs/en/operations/table_engines/mergetree/#primary-keys-and-indexes-in-queries
Primary key is sparsed. By default it contains 1 value of each 8192 rows (= 1 granule).
Let's disable adaptive granularity (for the test) -- index_granularity_bytes=0
create table X (A Int64)
Engine=MergeTree order by A
settings index_granularity=16,index_granularity_bytes=0;
insert into X select * from numbers(32);
index_granularity=16 -- 32 rows = 2 granule , primary index have 2 values 0 and 16
select marks, primary_key_bytes_in_memory from system.parts where table = 'X';
┌─marks─┬─primary_key_bytes_in_memory─┐
│ 2 │ 16 │
└───────┴─────────────────────────────┘
16 bytes === 2 values of INT64.
Adaptive index granularity means that granules size various. Because wide rows (many bytes) needs (for performance) fewer (<8192) rows in granule.
index_granularity_bytes = 10MB ~ 1k row * 8129. So each granule have 10MB. If rows size 100k (long Strings), granule will have 100 rows (not 8192).
Skip index granules GRANULARITY 3 -- means that an index will store one value for each 3 table granules.
create table X (A Int64, B Int64, INDEX IX1 (B) TYPE minmax GRANULARITY 4)
Engine=MergeTree order by A
settings index_granularity=16,index_granularity_bytes=0;
insert into X select number, number from numbers(128);
128/16 = 8, table have 8 granules, INDEX IX1 stores 2 values of minmax (8/4)
So minmax index stores 2 values -- (0..63) and (64..128)
0..63 -- points to the first 4 table's granules.
64..128 -- points to the second 4 table' granules.
set send_logs_level='debug'
select * from X where B=77
[ 84 ] <Debug> dw.X (SelectExecutor): **Index `IX1` has dropped 1 granules**
[ 84 ] <Debug> dw.X (SelectExecutor): Selected 1 parts by date, 1 parts by key, **4 marks** to read from 1 ranges
SelectExecutor checked skip index - 4 table granules can be skipped because 77 is not in 0..63 .
And another 4 granules must be read ( 4 marks ) because 77 in (64..128) -- some of that 4 granules have B=77.
https://clickhouse.tech/docs/en/development/architecture/#block
Block can contain any number of rows.
For example 1 row blocks:
set max_block_size=1;
SELECT * FROM numbers_mt(1000000000) LIMIT 3;
┌─number─┐
│ 0 │
└────────┘
┌─number─┐
│ 2 │
└────────┘
┌─number─┐
│ 3 │
└────────┘
set max_block_size=100000000000;
create table X (A Int64) Engine=Memory;
insert into X values(1);
insert into X values(2);
insert into X values(3);
SELECT * FROM X;
┌─A─┐
│ 1 │
└───┘
┌─A─┐
│ 3 │
└───┘
┌─A─┐
│ 2 │
└───┘
3 rows in block
drop table X;
create table X (A Int64) Engine=Memory;
insert into X values(1)(2)(3);
select * from X
┌─A─┐
│ 1 │
│ 2 │
│ 3 │
└───┘

How to use results of [with totals] modifier

We have modifier [with totals] that can summarize values across all rows and get the total result with key value=0 or null or smth like this
The problem is that I don't understand how I can use these values in the next calculations
Maybe I'm using the wrong format
select processing_date,count(*)
from `telegram.message`
where processing_date>='2019-05-01'
group by processing_date with totals
The documentation says that
You can use WITH TOTALS in subqueries, including subqueries in the
JOIN clause (in this case, the respective total values are combined).
Example subqueries in the JOIN (CH tests scripts in github):
SELECT k, s1, s2
FROM
(
SELECT intDiv(number, 3) AS k, sum(number) AS s1
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
GROUP BY k WITH TOTALS
)
ANY LEFT JOIN
(
SELECT intDiv(number, 4) AS k, sum(number) AS s2
FROM
(
SELECT *
FROM system.numbers
LIMIT 10
)
GROUP BY k WITH TOTALS
) USING (k)
ORDER BY k ASC
/* Result:
┌─k─┬─s1─┬─s2─┐
│ 0 │ 3 │ 6 │
│ 1 │ 12 │ 22 │
│ 2 │ 21 │ 17 │
│ 3 │ 9 │ 0 │
└───┴────┴────┘
Totals:
┌─k─┬─s1─┬─s2─┐
│ 0 │ 45 │ 45 │
└───┴────┴────┘
*/
As a workaround, you can combine results of several totals using client libraries.
Using "with rollup" instead of "with totals" decides problems with format

Subset a data frame in R based on above and below a threshold value

I searched a lot to find similar post to my post below but no luck yet
I have 1 column of data like below (extracted from original big file having many columns)
C1
0
1
2
3
4
3
3
2
1
From this data I want to generate a new column C2 where in C2 should just indicate where my C1 column values are above and below a threshold compared to max value.
In this case max(C1) is 4. So If set threshold of 2 then the new data should be like below.
C1 C2
0 0
1 0
2 1
3 1
4 1
3 1
3 1
2 1
1 0
Note: My data always have a increasing trend upto some point and then decreasing trend after that.
I know how to do simple plain subset on a particular column but I am not getting the logic to subset when there is a increasing and decreasing trend.
Thanks in advance.
I would use the plyr package in r, and use an ifelse statement as a part of the mutate function. I will write my code and then explain. I assume you already have the C1 vector in a data frame named df
install.packages('plyr')
library(plyr)
df2 <- mutate(df, c2 = ifelse(c1 >= 2,1,0))
The mutate function creates a new column that satisfies whatever function you wish. In this case I used the ifelse function similar to excel's IF() function that inputs:
Condition , What happens if True , What happens if false.
Hope that helps =)

Stata - How to Generate Random Integers

I am learning Stata and want to know how to generate random integers (without replacement). If I had 10 total rows, I would want each row to have a unique integer from 1 to 10 assigned to it. In R, one could simply do:
sample(1:10, 10)
But it seems more difficult to do in Stata. From this Stata page, I saw:
generate ui = floor((b-a+1)*runiform() + a)
If I substitute a=1 and b=10, I get something close to what I want, but it samples with replacement.
After getting that part figured out, how would I handle the following wrinkle: my data come in pairs. For example, in the 10 observations, there are 5 groups of 2. Each group of 2 has a unique identifier. How would I arrange the groups (and not the observations) in random order? The data would look something like this:
obs group mem value
1 A x 9345
2 A y 129
3 B x 251
4 B y 373
5 C x 788
6 C y 631
7 D x 239
8 D y 481
9 E x 224
10 E y 585
obs is the observation number. group is the group the observation (row) belongs to. mem is the member identifier in the group. Each group has one x and one y in it.
First question:
You could just shuffle observation identifiers.
set obs 10
gen y = _n
gen rnd = runiform()
sort rnd
Or in Mata
jumble(1::10)
Second question: Several ways. Here's one.
gen rnd = runiform()
bysort group (rnd): replace rnd = rnd[1]
sort rnd
General comment: For reproducibility, set the random number seed beforehand.
set seed 2803
or whatever.

Resources