Clickhouse: Difference in row results between 2 versions - clickhouse

In version 19.13.3.26 the following query returns 1 row:
select -1 as brandidTotal, toDecimal64(Sum(CostpriceSEK),2) as costprice
from mytable
where Stylenumber = 'a row which does not exist'
group by brandidTotal
But in version 22.2.2.1 it returns an empty result (which i can understand, since where does not find any rows)
It seems like the aggregate function SUM has changed behaviour. (if second column is removed, both returns an empty set)
Is it possible to make version 22X handle it like 19x does?

--empty_result_for_aggregation_by_constant_keys_on_empty_set
Return empty result when aggregating by constant keys on empty set.
select -1 x, count() from (select 1 yyy where 0) group by x;
0 rows in set. Elapsed: 0.002 sec.
set empty_result_for_aggregation_by_constant_keys_on_empty_set=0;
select -1 x, count() from (select 1 yyy where 0) group by x;
┌──x─┬─count()─┐
│ -1 │ 0 │
└────┴─────────┘
to enable it by default for all
cat /etc/clickhouse-server/users.d/const_aggr_emp.xml
<?xml version="1.0" ?>
<yandex>
<profiles>
<default>
<empty_result_for_aggregation_by_constant_keys_on_empty_set>0</empty_result_for_aggregation_by_constant_keys_on_empty_set>
</default>
</profiles>
</yandex>

Related

Data skipping index for Map or pair-wise arrays in Clickhouse?

I am migrating a table from Postgres to Clickhouse, and one of the columns is a jsonb column which includes custom attributes. These attributes can be different per tenant, hence we currently have 100k different custom attributes' keys stored in postgres.
I checked Clickhouse's semi-structured JSON data options, and it seems we can use either Map(String, String) or 2 Array(String) columns holding the keys and values.
However I cannot make a proper assessment which one is best, as I get pretty similar results.
To test performance I created the following table:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String)
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 8192;
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(200000000);
--- data look like these:
SELECT *
FROM maptest
LIMIT 1
Query id: 9afcb888-94d9-42ec-a4b3-1d73b8cadde0
┌─k─┬─keys────────┬─values─┬─map─────────────┐
│ 0 │ ['custom0'] │ ['0'] │ {'custom0':'0'} │
└───┴─────────────┴────────┴─────────────────┘
However, whichever method I choose to query for a specific key-value pair, I always get the whole table scanned. e.g.
SELECT count()
FROM maptest
WHERE length(arrayFilter((v, k) -> ((k = 'custom2') AND (v = '2')), values, keys)) > 0
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 10.541 sec. Processed 200.00 million rows, 9.95 GB (18.97 million rows/s., 943.85 MB/s.)
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 11.142 sec. Processed 200.00 million rows, 8.35 GB (17.95 million rows/s., 749.32 MB/s.)
SELECT count()
FROM maptest
WHERE (values[indexOf(keys, 'custom2')]) = '2'
┌─count()─┐
│ 2299 │
└─────────┘
1 row in set. Elapsed: 3.458 sec. Processed 200.00 million rows, 9.95 GB (57.83 million rows/s., 2.88 GB/s.)
Any suggestions on data skipping indexes for any of the 2 options?
You can add data skipping index for a map field, although you will need to set lower index_granularity to get to the most optimal values between index size and how many granules will be skipped. You should build your index using mapValues (or mapKeys, depending on your needs) map function:
CREATE TABLE maptest
(
`k` Int64,
`keys` Array(String),
`values` Array(String),
`map` Map(String, String),
INDEX b mapValues(map) TYPE tokenbf_v1(2048, 16, 42) GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY k
SETTINGS index_granularity = 2048; -- < lowered index_granularity!
insert into maptest
select
number,
mapKeys(map(concat('custom', toString(number%87000)), toString(number%87000))),
mapValues(map(concat('custom', toString(number%87000)), toString(number%87000))),
map(concat('custom', toString(number%87000)), toString(number%87000))
from numbers(20000000);
Now let's test it:
set send_logs_level='trace';
SELECT count()
FROM maptest
WHERE (map['custom2']) = '2';
(...)
[LAPTOP-ASLS2SOJ] 2023.02.01 11:44:52.171103 [ 96 ] {3638972e-baf3-4b48-bf10-7b944e46fc64} <Debug> default.maptest (11baab32-a7a8-4b0f-b879-ad1541cbe282) (SelectExecutor): Index `b` has dropped 9123/9767 granules.
(...)
┌─count()─┐
│ 230 │
└─────────┘
(...)
1 row in set. Elapsed: 0.107 sec. Processed 1.32 million rows, 54.52 MB (12.30 million rows/s., 508.62 MB/s.)

Clickhouse. Get value from json

I use Clickhouse database. There is a table with string column (data). All rows contains data like:
'[{"a":23, "b":1}]'
'[{"a":7, "b":15}]'
I wanna get all values of key "b".
1
15
Next query:
Select JSONExtractInt('data', 0, 'b') from table
return 0 all time. How i can get values of key "b"?
SELECT tupleElement(JSONExtract(j, 'Array(Tuple(a Int64, b Int64))'), 'b')[1] AS res
FROM
(
SELECT '[{"a":23, "b":1}]' AS j
UNION ALL
SELECT '[{"a":7, "b":15}]'
)
┌─res─┐
│ 1 │
└─────┘
┌─res─┐
│ 15 │
└─────┘

Hive: Exception when using LAG with window function

I'm trying to calculate a time difference between 2 rows and applied the solution from this SO question. However I get an exception:
> org.apache.hive.service.cli.HiveSQLException: Error while compiling
> statement: FAILED: SemanticException Failed to breakup Windowing
> invocations into Groups. At least 1 group must only depend on input
> columns. Also check for circular dependencies. Underlying error:
> Expecting left window frame boundary for function
> LAG((tok_table_or_col time), 1, 0) Window
> Spec=[PartitioningSpec=[partitionColumns=[(tok_table_or_col
> client_id)]orderColumns=[(tok_table_or_col time) ASC
> NULLS_FIRST]]window(type=ROWS, start=1 PRECEDING, end=currentRow)] as
> LAG_window_0 to be unbounded. Found : 1
HiveQL:
SELECT id, loc, LAG(time, 1, 0) OVER (PARTITION BY id, loc ORDER BY time ROWS 1 PRECEDING) - time AS response_time FROM mytable
How to I fix this? What is the issue?
EDIT:
Sample data:
id loc time
0 1 1414250523591
0 1 1414250523655
1 2 1414250523655
1 2 1414250523661
1 3 1414250523661
1 3 1414250523662
And what I want is the difference of time between rows with same id and loc (always pairs of 2).
EDIT2: I should also mention I'm new to hadoop/hive ecosystem.
So as the error said, the window should be unbounded. So I just removed the ROWS clause and now at least it is doing something but it still is wrong. So I just wanted to check what the LAG value actually is:
SELECT id, loc, LAG(time, 1) OVER (PARTITION BY id, loc ORDER BY time) AS lag_col FROM mytable
And I get this as output:
id loc lag_col
1 2 null
1 2 -1
1 3 null
1 3 -1
The null is clear because I removed the default value but why -1? Are the large values in time column leading to somekind of overflow? Column is defined as bigint so it should actually fit without problem but maybe there is a conversion to int during the query?

DateWise Query with sum count in Oracle

Below is the one is my actual resulset in oracle database
TIMESTAMP SUCESS FAILURE
26-01-2017 1 0
31-01-2017 0 1
If i select from 26-01-2017 to 31-01-2017 .Query has to return like this below
expected resultset
Timestamp 26-01-2017 27-01-2017 28-01-2017 29-01-2017 30-01-2017 31-01-2017
Sucess 1 0 0 0 0 0 |
Failure 0 0 0 0 0 0
Please can anyone give me suggestions to write logic for above expected resultset?
You would need a PIVOT (I made an assumption that you always have either success or failure):
select * from (
select decode(success, 1, 'success', 'failure') as res_name,
success+failure as res,
to_char(time_stamp, 'DD-MM-YYYY') ts
from your_table)
pivot (max(res) for ts in ('26-01-2017', '27-01-2017', '28-01-2017', '29-01-2017', '30-01-2017', '31-012017'))
List of columns is always defined up front, so if you need a variable list of columns you need either generate above query or use PIVOT XML. With PIVOT XML you can use subquery instead of predefined list of variables, but you get XML back.

Passing dynamic values to order records in oracle

I want to sort a record in following way.
Arrange records in group (by ID column)
Sort the step 1 results by ascending order (by NAME column)
2.1. If NAME Column having same values, then order by FLAG column value (ascending order)
Order the Step 2 results by Order Assist column (I will be passing dynamic value to sort using order assist column)
My Query:
SELECT IDENTIFIER, CODE, INC_EXC_FLAG,ORDER_ASSIST FROM DUMMY_SORT
WHERE METHOD_ID = '1'
GROUP BY (IDENTIFIER, CODE, INC_EXC_FLAG,ORDER_ASSIST)
ORDER BY ORDER_ASSIST ASC, CODE ASC, INC_EXC_FLAG ASC
Result of above Query:
ID NAME FLAG ORDER_ASSIST
A_EC AEC 0 EC1
B_EC_DET BEC 1 EC2
A_NIT ANIT 0 NIT1
A_NIT ANIT 1 NIT1
A_NIT BNIT 0 NIT1
B_NIT_DET BNIT 0 NIT2
B_NIT_DET BNIT 1 NIT2
A_SC ASC 0 SC1
A_SC ASC 1 SC1
B_SC_DET BSC 0 SC2
B_SC_DET BSC 1 SC2
C_SC_FUN CSC 0 SC3
D_SC_GRP DSC 0 SC4
But I want to generate the result according to dynamic values of order_assist
For Example:
If I am passing dynamic value as "SC" i want to fisrt order the records SC1,SC2,SC3. Then NIT1,NIT2 . then EC1,EC2.
If I am passing dynamic value as "NITG" i want to fisrt order the records NIT1,NIT2 then SC1,SC2,SC3. then EC1,EC2.
Expected result added when dynamic value is "SC"
ID NAME FLAG ORDER_ ASSIST
A_SC ASC 0 SC1
A_SC ASC 1 SC1
B_SC_DET BSC 0 SC2
B_SC_DET BSC 1 SC2
C_SC_FUN CSC 0 SC3
D_SC_GRP DSC 0 SC4
A_NIT ANIT 0 NIT1
A_NIT ANIT 1 NIT1
A_NIT BNIT 0 NIT1
B_NIT_DET BNIT 0 NIT2
B_NIT_DET BNIT 1 NIT2
A_EC AEC 0 EC1
B_EC_DET BEC 1 EC2
Sounds like maybe you're after something like:
order by case when p_sort_param = 'SC' and order_assist like 'SC%' then 1
when p_sort_param = 'SC' and order_assist like 'NIT%' then 2
when p_sort_param = 'NITG' and order_assist like 'NIT%' then 1
when p_sort_param = 'NITG' and order_assist like 'SC%' then 2
else 3
end,
order_assist
where p_sort_param is the parameter that gets passed in to provide the "dynamic" value. This assumes you're running the query via a stored procedure. If it's a manually run query (eg. in Toad), then add a colon in front of the parameter name to make :p_sort_param.
I cannot understand your specific ordering rules, but you should be able to achieve what you want using CASE expressions:
order by
case order_assist
when 'SC' then <first thing to order by for SC>
when 'NITG' then <first thing to order by for NITG>
...
end,
case order_assist
when 'SC' then <second thing to order by for SC>
when 'NITG' then <second thing to order by for NITG>
...
end,
... etc.

Resources