I know ClickHouse may be not the proper database to get millions of row (only a few columns) back with SQL like below:
select col1, col2, .., date from table where col0='a1' and date >= 'start_date' and date 'end_date'
But currently I face a situation on that. Could someone tell me the performance for that?
1. Maximize rows queried from one "select" SQL
2. Any optimization for fast response
Thanks in advance.
CH able to return millions rows per sec.
Test with 100mil rows / 3GB:
CREATE TABLE XX
ENGINE = MergeTree
ORDER BY A AS
SELECT
number AS A,
toDate(number % 103) AS B,
toString(number) AS C,
number % 1003 AS D
FROM numbers(100000000)
time curl -o t 'http://localhost:8123/?query=select%20A,%20B,%20C,%20D%20from%20XX'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3115M 0 3115M 0 0 413M 0 --:--:-- 0:00:07 --:--:-- 404M
real 0m7.840s
time clickhouse-client -q 'select A,B,C,D from XX' > /dev/null
real 0m6.451s
1 mil rows / 30MB
time curl -o /dev/null 'http://localhost:8123/?query=select%20A,%20B,%20C,%20D%20from%20XX%20prewhere%20B=toDate(1)'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 30.2M 0 30.2M 0 0 93.0M 0 --:--:-- --:--:-- --:--:-- 92.7M
real 0m0.329s
time clickhouse-client -q 'select A,B,C,D from XX prewhere B=toDate(1)' > /dev/null
real 0m0.344s
Related
Where a user gives a set of inputs from one table, e.g. "request_table" a:
User Input
Value
Field Name in Database
Product
Deposit
product_type
Deposit Term (months)
24
term
Deposit Amount
200,000
amount
Customer Type
Charity
customer_type
Existing Customer
Y
existing_customer
Would like to use the product selection to pick out SQL scripts embedded in a "pricing_table" b, where the price is made up of components, each of which are affected by one or more of the above inputs:
Product
Grid
Measures
Value1
Value1Min
Value1Max
Value2
Value2Min
Value2Max
Price
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
0
12
0
100000
1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
12
36
0
100000
2
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
36
9999
0
100000
3
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
0
12
100000
500000
1.1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
12
36
100000
500000
2.1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
36
9999
100000
500000
3.1
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
0
12
500000
99999999
1.2
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
12
36
500000
99999999
2.2
Deposit
Term_Amount
a.term>=b.value1min and a.term<b.value2 max and a.amount>=b.value2min and a.amount<b.value2max
36
9999
500000
99999999
3.2
Deposit
Customer_Type
a.customer_type=b.value1
Personal
0
Deposit
Customer_Type
a.customer_type=b.value1
Charity
0.1
Deposit
Customer_Type
a.customer_type=b.value1
Business
-0.1
Deposit
Existing_Customer
a.existing_customer=b.value1
Y
0.1
Deposit
Existing_Customer
a.existing_customer=b.value1
N
0
Where the query is: select distinct measures from pricing_table where product=(select product_type from request_table). This gives multiple rows where SQL logic is held.
Would like to run this SQL logic in a LOOP, e.g.:
select b.* from pricing_table b where :measures
This would return all rows where the specific metrics are matched.
Doing it this way as the exact columns in the input can grow to hundreds, so don't want a really wide table.
Any help appreciated thanks.
I've creating tables but am unsure how to loop the measures, and apply the values from that field in a looped query thanks.
In a PL/SQL pipelined function, you can build the SQL query and open a cursor on it, loop on the results and PIPE the rows.
I have a Matrix that is calculating % and it works properly for 1 row but not multiples.
Its calculating the individual departments in the row by item number to equal 100 %.
When using multiple rows it calculates all the rows together for a total of 100%.
This is not what I want.
I want all rows to act like the first pic with 1 row calculating across the row.
lenter image description herelike this dept 1 dept 2 dept 3 total item 1 71% 14% 14% 100% item 2 50% 25% 25% 100%
I have figured this out, so this is how I needed to have my sql SUM(B.RDCQTY) OVER (partition by RDICDE) AS SMDSTRDCQTY and RDCQTY / SUM(B.RDCQTY) OVER (PARTITION BY RDICDE) AS PER and in the last cte SUM(PER) OVER (PARTITION BY RDICDE) AS TTLPER then in SSRS the percentage column as sum(per) and the total % column as =(Fields!TTLPER.Value) Now the report is calulating properly per row.fixedpic
From the quantile function documentation:
We recommend using a level value in the range of [0.01, 0.99]. Don't use a level value equal to 0 or 1 – use the min and max functions for these cases.
Does this also applies for quantileExact and quantilesExact functions?
In my experiments, I've found that quantileExact(0) = min and quantileExact(1) = max, but cannot be sure about it.
That recommendation is not about accuracy but about complexity of quantile*.
quantileExact is much much heavier than max min.
See the time difference, min / max 8 times faster even on a small dataset.
create table Speed Engine=MergeTree order by X
as select number X from numbers(1000000000);
SELECT min(X), max(X) FROM Speed;
┌─min(X)─┬────max(X)─┐
│ 0 │ 999999999 │
└────────┴───────────┘
1 rows in set. Elapsed: 1.040 sec. Processed 1.00 billion rows, 8.00 GB (961.32 million rows/s., 7.69 GB/s.)
SELECT quantileExact(0)(X), quantileExact(1)(X) FROM Speed;
┌─quantileExact(0)(X)─┬─quantileExact(1)(X)─┐
│ 0 │ 999999999 │
└─────────────────────┴─────────────────────┘
1 rows in set. Elapsed: 8.561 sec. Processed 1.00 billion rows, 8.00 GB (116.80 million rows/s., 934.43 MB/s.)
It turns out it is safe to use 0 and 1 values for quantileExact and quantilesExact functions.
The MQTT 3.1.1 documentation is very clear and helpful, however I am having trouble understanding the meaning of one section regarding the Keep Alive byte structure in the connect message.
The documentation states:
The Keep Alive is a time interval measured in seconds. Expressed as a 16-bit word, it is the maximum time interval that is permitted to elapse between the point at which the Client finishes transmitting one Control Packet and the point it starts sending the next.
And gives an example of a keep alive payload:
Keep Alive MSB (0) 0 0 0 0 0 0 0 0
Keep Alive LSB (10) 0 0 0 0 1 0 1 0
I have interpreted this to represent a keep alive interval of 10 seconds, as the interval is given in seconds and that makes the most sense. However I'm not sure how you would represent longer intervals of, for example, 10 minutes.
Finally, would the maximum keep alive interval of 65535 seconds (~18 hours) be represented by these bytes
Keep Alive MSB (255) 1 1 1 1 1 1 1 1
Keep Alive LSB (255) 1 1 1 1 1 1 1 1
Thank you for your help
2^16=65536 seconds 65536/60 = 1092.27 minutes 1092.27/60 = 18.20 hours
0.20hour*60 = 12minutes y 0.27min*60 = 16.2sec
result: 18 hours,12minutes, 16sec
10 minutes = 600 seconds
600 in binary -> 0000 0010 0101 1000
And yes 65535 is the largest number that can be represented by a 16bit binary field, but there are very few situations where an 18 hour keep alive interval would make sense.
I want to ask the correct bucketing and tablesample way.
There is a table X which I created by
CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC
Then I inserted 500M of rows into X by
set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW
Then I want to count or search some rows with condition. roughly,
SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
But I'd better to USE tablesample as I clustered X (action_id, classifier).
So, the better query will be
SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'
Is there any wrong above?
But I can't not find any performance gain between these two query.
query1 and RESULT( with no tablesample.)
SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 s
--------------------------------------------------------------------------------
It scans full data.
query 2 and RESULT
SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.82 s
--------------------------------------------------------------------------------
It ALSO scans full data.
query 2 RESULT WHAT I EXPECTED.
Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.xx s
--------------------------------------------------------------------------------
Values of action_id and classifier are well distributed and there is no skewed data.
So I want to ask you what will be a correct query to target Only 1 Bucket and Use 1 map??