I have Kafka integration objects:
CREATE TABLE topic_kafka
(
topic_data String
) ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'topic',
kafka_group_name = 'clickhouse_group',
kafka_format = 'JSONAsString',
kafka_num_consumers = 1;
CREATE TABLE topic
(
time DateTime64(3),
user_id Int32 NOT NULL,
version String
) ENGINE = MergeTree()
ORDER BY (user_id, time);
CREATE MATERIALIZED VIEW topic_consumer
TO topic AS
SELECT
JSONExtract(topic_data, 'time', 'Int64') as time,
toInt32(JSON_VALUE(topic_data, '$.data.user_id')) as user_id,
JSON_VALUE(topic_data, '$.data.version') as version
FROM topic_kafka;
And Kafka topic of json data with nested objects, like this:
{"time":1639387657456,"data":{"user_id":42,"version":"1.2.3"}}
The problem is that time has values 2282-12-31 00:00:00.000 in the topic table.
It also can be checked with the following query:
select cast (1639387657456 as DateTime64(3)) as dt
But for DML query below implicit date convertation works fine, as the documentation states:
insert into topic (time, user_id) values ( 1640811600000, 42)
I've found that such cast works fine too:
select cast (1639387657.456 as DateTime64(3)) as dt
Looks like I've missed something from the documentation.
What is the problem with view topic_consumer above? Is it ok to divide milliseconds by 1000 to convert it to DateTime explicitly?
fromUnixTimestamp64Milli
https://clickhouse.com/docs/en/sql-reference/functions/type-conversion-functions/#tounixtimestamp64nano
select fromUnixTimestamp64Milli(toInt64(1640811600000));
┌─fromUnixTimestamp64Milli(toInt64(1640811600000))─┐
│ 2021-12-29 21:00:00.000 │
└──────────────────────────────────────────────────┘
Related
I want to create a materialized view in ClickHouse that stores the final product of an aggregation function. The best practice is to store the state and in query time to calculate the final product but it's too costly to do it in query time for my use case.
Base table:
CREATE TABLE IF NOT EXISTS active_events
(
`event_name` LowCardinality(String),
`user_id` String,
`post_id` String
)
My current materialization:
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniq, String)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
POPULATE AS
SELECT
post_id,
event_name,
uniqState(user_id) unique_users_state
FROM active_events
GROUP BY post_id, event_name
FROM test_sessions
GROUP BY session_id;
And then at query time, I can use uniqMerge to calculate the exact number of users who've done a certain event.
I don't mind a small delay in the materialization but I want the full product to be calculated during ingestion rather than the query.
Here's the query:
SELECT post_id, sumIf(total, event_name = 'click') / sumIf(total, event_name = 'impression') as ctr
FROM (
SELECT post_id, event_name, uniqMerge(unique_users_state) as total
FROM inventory
WHERE event_name IN ('click', 'impression')
GROUP BY post_id, event_name
) as res
GROUP BY post_id
HAVING ctr > 0.1
ORDER BY ctr DESC
It's literally impossible.
Imagine you insert into a table some user_id - 3456, how many uniq? 1, Ok but you cannot store 1, because if you insert 3456, it still should be 1. So CH stores states and they are HLL (hyperloglog) structures and they are not fully aggregated/calculated. Because you may query group by event_name or group by event_name, post_id or without groupby.
Another problem why your query is slow. You did not provide your query so I can only make a guess that the issue is index_granularity and CH reads a lot excessive data from the disk IF you query where event_name = ...
It can be solved like this
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniq, String) CODEC(NONE) -- uniq is not compressible in 99% cases
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
Settings index_granularity=256 -- 256 instead of default 8192.
Another approach to use another HLL function because uniq is too heavy.
Try this:
CREATE MATERIALIZED VIEW IF NOT EXISTS inventory
(
`post_id` String,
`event_name` LowCardinality(String),
`unique_users_state` AggregateFunction(uniqCombined64(14), String) CODEC(NONE)
)
ENGINE = AggregatingMergeTree
ORDER BY (event_name, post_id)
Settings index_granularity=256
POPULATE AS
SELECT
post_id,
event_name,
uniqCombined64State(14)(user_id) unique_users_state
FROM active_events
GROUP BY post_id, event_name
select uniqCombined64Merge(14)(unique_users_state)
from inventory
where event_name = ...
Attention: need to use (14) in all 3 places uniqCombined64(14) / uniqCombined64State(14) / uniqCombined64Merge(14)
uniqCombined64(14) is more inaccurate than uniq but able to work 10-100 faster in some cases and with an error rate < 5%.
I'm trying to edit this code to be dynamic as I'm going to schedule it to run.
Normally I would input the date in the where statement as 'YYYY-MM-DD' and so to make it dynamic I changed it to DATE(). I'm not erroring out, but I'm also not pulling data. I just need help with format and my google searching isn't helping.
PROC SQL;
CONNECT TO Hadoop (server=disregard this top part);
CREATE TABLE raw_daily_fcast AS SELECT * FROM connection to Hadoop(
SELECT DISTINCT
a.RUN_DATE,
a.SCHEDSHIPDATE,
a.SOURCE,
a.DEST ,
a.ITEM,
b.U_OPSTUDY,
a.QTY,
c.case_pack_qty
FROM CSO.RECSHIP a
LEFT JOIN CSO.UDT_ITEMPARAM b
ON a.ITEM = b.ITEM
LEFT JOIN SCM.DIM_PROD_PLN c
ON a.ITEM = c.PLN_NBR
WHERE a.RUN_DATE = DATE()
AND a.SOURCE IN ('88001', '88003', '88004', '88006', '88008', '88010', '88011', '88012',
'88017', '88018', '88024', '88035', '88040', '88041', '88042', '88047')
);
DISCONNECT FROM Hadoop;
QUIT;
When RUN_DATE is a string you can generate the current date string in-line on the SAS side
WHERE a.RUN_DATE = %str(%')%sysfunc(date(),yymmdd10.)%str(%')
AND ...
or
WHERE a.RUN_DATE = %sysfunc(quote(%sysfunc(date(),yymmdd10.),%str(%')))
AND ...
For the case of RUN_DATE being a string containing DATE9 formatted values, change the yymmdd10. to date9.
change:
WHERE a.RUN_DATE = DATE()
to:
WHERE a.RUN_DATE = PUT(date(), YYMMDD10.) AS date
I have a query as below which is returning expected records when run from the SQL Developer
SELECT *
FROM MY_TABLE WHERE ( CRT_TS > TO_DATE('25-Aug-2016 15:08:18', 'DD-MON-YYYY HH24:MI:SS')
or UPD_TS > TO_DATE('25-Aug-2016 15:08:18', 'DD-MON-YYYY HH24:MI:SS'));
I think that we will not need to apply TO_DATE when we are passing java.util.Date object as date parameters but the below code snippet is silently returning me 0 records.
My SQL query in Java class is as below:
SELECT *
FROM MY_TABLE WHERE ( CRT_TS > :lastSuccessfulReplicationTimeStamp1
or UPD_TS > :lastSuccessfulReplicationTimeStamp2);
The code which executes the above query is as below but the below code snippet is silently returning me 0 records:
parameters.put("lastSuccessfulReplicationTimeStamp1", new java.sql.Date(outputFileMetaData.getLastSuccessfulReplicationTimeStamp().getTime()));
parameters.put("lastSuccessfulReplicationTimeStamp2", new java.sql.Date(outputFileMetaData.getLastSuccessfulReplicationTimeStamp().getTime()));
list = namedParameterJdbcTemplateOracle.query(sql, parameters, myTabRowMapper);
Please advise.
I guess you already found the answer but if anybody else needs it, here's what I've found:
java.sql.Date doesn't have time, just the date fields. Either use java.sql.Timestamp or java.util.Date. Both seems to be working for me with NamedParameterJdbcTemplate.
A little variation to above solution can be when your input(lastSuccessfulReplicationTimeStamp1/lastSuccessfulReplicationTimeStamp2) is a String instead of Date/TimeStamp (which is what i was looking for and found at this link -> may be it can help someone):
MapSqlParameterSource parameters = new MapSqlParameterSource();
parameters.addValue("lastSuccessfulReplicationTimeStamp1", lastSuccessfulReplicationTimeStamp1, Types.TIMESTAMP);
parameters.addValue("lastSuccessfulReplicationTimeStamp2", lastSuccessfulReplicationTimeStamp2, Types.TIMESTAMP);
list = namedParameterJdbcTemplateOracle.query(sql, parameters, myTabRowMapper);
When I run this code,
$entry_id = Entry::where(function($q)use($intern,$input){
$q->where('interna_id',$intern['id']);
$q->where('created_at','=','2015-06-06');
})->first(['id']);
it gives
select `id` from `entries` where (`interna_id` = 1 and `created_at` = 2015-06-06) limit 1
and as you can see, the date is not enclosed in ''. This causes a problem and MySQL will not be able to find the record. For example, if I add the '', like so,
select `id` from `entries` where (`interna_id` = 1 and `created_at` = '2015-06-06') limit 1
The query is able to retrieve the correct record.
Note:The created_at is using Date format, not DateTime
created_at is of type date for laravel. It uses the Carbon package and expects such an object. You would have to do something like:
$entry_id = Entry::where(function($q)use($intern,$input){
$q->where('interna_id',$intern['id']);
$q->where('created_at','=', Carbon::parse('2015-06-06'));
})->first(['id']);
please note you have to include the library on the page using;
use Carbon\Carbon;enter code here
Alternatively you could use DB::raw for the date as well (untested):
$entry_id = Entry::where(function($q)use($intern,$input){
$q->where('interna_id',$intern['id']);
$q->where('created_at','=', \DB::raw('2015-06-06'));
})->first(['id']);
I need some help in optimizing the following two queries which are almost similar but data selection is a little different. Here is my table definition
CREATE TABLE public.rates (
rate_id bigserial NOT NULL PRIMARY KEY,
prefix varchar(50),
rate_name varchar(30),
rate numeric(8,6),
intrastate_cost numeric(8,6),
interstate_cost numeric(8,6),
status char(3) DEFAULT 'act'::bpchar,
min_duration integer,
call_increment integer,
connection_cost numeric(8,6),
rate_type varchar(3) DEFAULT 'lcr'::character varying,
owner_type varchar(10),
start_date timestamp WITHOUT TIME ZONE,
end_date timestamp WITHOUT TIME ZONE,
rev integer,
ratecard_id integer,
/* Keys */
CONSTRAINT rates_pkey
PRIMARY KEY (id)
) WITH (
OIDS = FALSE
);
and two queries here I am using,
SELECT
rates.* ,
rc.ratecard_prefix ,
rc.default_lrn ,
rc.lrn_lookup_method ,
customers.customer_id ,
customers.balance ,
customers.channels AS customer_channels ,
customers.cps AS customer_cps ,
customers.balance AS customer_balance
FROM
rates
JOIN ratecards rc
ON rc.card_type = 'customer' AND
rc.ratecard_id = rates.ratecard_id
JOIN customers
ON rc.customer_id = customers.customer_id
WHERE
customers.status = 'act' AND
rc.status = 'act' AND
rc.customer_id = 'AB8KA191' AND
owner_type = 'customer' AND
'17606109973' LIKE concat (rc.ratecard_prefix, rates.prefix, '%') AND
rates.status = 'act' AND
now() BETWEEN rates. start_date AND
rates.end_date AND
customers.balance > 0
ORDER BY
LENGTH(PREFIX) DESC LIMIT 1;
and the second one,
SELECT
*
FROM
rates
JOIN ratecards rc
ON rc.card_type = 'carrier' AND
rc.ratecard_id = rates.ratecard_id
JOIN carriers
ON rc.carrier_id = carriers.carrier_id
JOIN carrier_switches cswitch
ON carriers.carrier_id = cswitch.carrier_id
WHERE
rates.intrastate_cost < 0.011648 AND
owner_type = 'carrier' AND
'16093960411' LIKE concat (rates.prefix, '%') AND
rates.status = 'act' AND
carriers.status = 'act' AND
now() BETWEEN rates.start_date AND
rates.end_date AND
rates.intrastate_cost <> -1 AND
cswitch.enabled = 't' AND
rates.rate_type = 'lrn' AND
rates.min_duration >= 6
ORDER BY
rates.intrastate_cost ASC,
LENGTH(rates.prefix) DESC,
cswitch.priority DESC
I created an index on field owner_type (not shown in schema above) but the query performance is not really what is expected. CPU usage becomes too high for the DB server and everything starts to slow down. Explain output for first query is here and the second one is here.
When the number of records are less in the table, things work fine, naturally, but when the number of records increases the CPU goes higher. I currently have around 341821 records in the table.
How can I improve the query execution or possibly change the query in order to speed things up?
I have set enable_bitmapscan = off because I think this gives me better performance. If set to on, every index scan is followed up with a Bitmap heap scan.
Things did ease up a little bit by changing the query to
SELECT
rates.*,
rc.ratecard_prefix,
rc.default_lrn,
rc.lrn_lookup_method,
customers.customer_id,
customers.balance,
customers.channels AS customer_channels,
customers.cps AS customer_cps,
customers.balance AS customer_balance
FROM
rates
JOIN ratecards rc
ON rc.card_type = 'customer' AND
rc.ratecard_id = rates.ratecard_id
JOIN customers
ON rc.customer_id = customers.customer_id
WHERE
customers.status = 'act' AND
rc.status = 'act' AND
rc.customer_id = 'AB8KA191' AND
owner_type = 'customer' AND
(CONCAT (rc.ratecard_prefix, rates.prefix) IN ('16026813306',
'1602681330',
'160268133',
'16026813',
'1602681',
'160268',
'16026',
'1602',
'160',
'16',
'1')) AND
rates.status = 'act' AND
now() BETWEEN rates.start_date AND
rates.end_date AND
customers.balance > 0
ORDER BY
LENGTH(PREFIX) DESC LIMIT 1
Postgres.conf is here
But still each Postgres process takes around 25%+ CPU. I now also using pgbouncer to utilize connection pooling but still not helping.