Prefer oldest version in ReplacingMergeTree

Prefer oldest version in ReplacingMergeTree - clickhouse

I have a maybe slightly weird use case. I have to perform expansive counts on data and build snapshots from them (something like "number of users who can access this entity"). Now I store these numbers per entity with a timestamp (so basically "at this point in time, x users could access this entity").
Now it might be, that the number doesn't change between snapshots, because no access lists have been changed and/or no users have been added. This might actually even be the default case. So of course I would like to avoid having tens of thousands of identical lines ("5 users at 10pm", "5 users at 11pm", "5 users at 12pm" and so on). Therefore a ReplacingMergeTree comes to mind. The order by would be entity, count.
There is a problem though. If I understand the documentation correctly, the ReplacingMergeTree would now keep always the latest row. So the timestamp would change. While I would like to keep the oldest timestamp, so I know the first time this count has been calculated. This in turn I can use to fill the gaps (if the count is 3h old and there is no newer count in between, obviously the same count can be assumed true for 2h ago and 1h ago).
Is there any way to achieve this?
The only thing that might come to mind might be using a uint as version and starting with MaxUint and then decrementing. But this feels slightly weird.

best way I found until now is add a version column: UInt32 DEFAULT 4294967295 - toUnixTimestamp(create_time)
CREATE TABLE test
(
`id` UInt8,
`value` UInt8,
`version` UInt32 DEFAULT 4294967295 - toUnixTimestamp(create_time),
`create_time` DateTime DEFAULT now()
)
ENGINE = ReplacingMergeTree(version)
ORDER BY id;
INSERT INTO test (id, value) VALUES (1,1);
INSERT INTO test (id, value) VALUES (1,2);
SELECT * FROM test;
┌─id─┬─value─┬────version─┬─────────create_time─┐
│ 1 │ 1 │ 2670403264 │ 2021-06-25 03:47:11 │
└────┴───────┴────────────┴─────────────────────┘
┌─id─┬─value─┬────version─┬─────────create_time─┐
│ 1 │ 2 │ 2670403251 │ 2021-06-25 03:47:24 │
└────┴───────┴────────────┴─────────────────────┘
SELECT * FROM test FINAL;
┌─id─┬─value─┬────version─┬─────────create_time─┐
│ 1 │ 1 │ 2670403264 │ 2021-06-25 03:47:11 │
└────┴───────┴────────────┴─────────────────────┘

Related

Clickhouse KILL QUERY hangs forever

Having following potential infinite-time execution query. It does not make sense why it had been issued to Clickhouse server. Query is already has been launched and still running:
SELECT Count("SN".*) FROM (SELECT sleepEachRow(3) FROM system.numbers) "SN"
Okay, try to find associated query_id or already have one. For instance, query_id = 'd02f4bdb-8928-4347-8641-4da4b9c0f486'. Let's kill it via following query:
KILL QUERY WHERE query_id = 'd02f4bdb-8928-4347-8641-4da4b9c0f486'
Achieved kill-query result seems to be okay from first look:
┌─kill_status─┬─query_id─────────────────────────────┬─user────┬─query────────────────────────────────────────────────────────────────────────┐
│ waiting │ d02f4bdb-8928-4347-8641-4da4b9c0f486 │ default │ SELECT Count("SN".*) FROM (SELECT sleepEachRow(3) FROM system.numbers) "SN"; │
└─────────────┴──────────────────────────────────────┴─────────┴──────────────────────────────────────────────────────────────────────────────┘
Okay, let's wait for several seconds and ensure that original query had been terminated successfully. Let's check it via following system information schema query:
SELECT "query_id", "query", "is_cancelled" FROM system.processes WHERE query_id = 'd02f4bdb-8928-4347-8641-4da4b9c0f486';
Unfortunately original query is still running in a some sense. It turned into "is_cancelled" state and still hangs:
┌─query_id─────────────────────────────┬─query────────────────────────────────────────────────────────────────────────┬─is_cancelled─┐
│ d02f4bdb-8928-4347-8641-4da4b9c0f486 │ SELECT Count("SN".*) FROM (SELECT sleepEachRow(3) FROM system.numbers) "SN"; │ 1 │
└──────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────┴──────────────┘
Waiting for hour and more time and still getting some results. Original query is still hanged in "is_cancelled" state. Subsequent KILL queries with same query_id does not do nothing.
Most likely, restarting the server will help solve the problem, but I do not want to do this. How to solve the problem with a stuck query without server restarting?

ClickHouse queries can't be killed during the sleep.
If you are using a recent CH release (21.12+), then the KILL flag will be checked after each block is processed (on older releases it might never be checked). Since the default block is 65536, the query will be slept for 65536 * 3 seconds ~= 54 hours before checking anything.
In future releases of CH it will be impossible to sleep for more than 3 seconds (which right now is a limit of sleep but not for sleepEachRow). In the meantime you can either wait or restart the server.

How to get BigQuery table expiration date via query / cli / go?

I can't find a way to extract table expiration date that is not via the console (https://console.cloud.google.com/).
We maintain thousands of tables in BQ and we want to enforce the use of table expiration date - so the only way is to collect the data automatically.
is this possible via query/cli/go/python/perl/whatever?

this can be done via querying the INFORMATION_SCHEMA.TABLE_OPTIONS:
SELECT * FROM `my_project.my_dataset.INFORMATION_SCHEMA.TABLE_OPTIONS`
where option_name='expiration_timestamp'
the value will be in the option_name column.

If you want to extract the expiration time from a number of tables in a dataset, you can refer to this documentation [1] (the same query provided by #EldadT), which returns the catalog of tables in a dataset with the expiration time option.
Therefore, if you want to create a script in Python to get the result of this query, you can also check a Client Bigquery Library [2] to run that query and get the expiration time for each table in a dataset.
[1] https://cloud.google.com/bigquery/docs/tables#example_1_2
[2] https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-usage-python

You can use BigQuery CLI:
bq show mydataset.mytable
Output:
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ------------------- ------------ ------------- ----------------- ------------------- ------------------ --------
16 Aug 10:42:13 |- col_1: integer 7 106 21 Aug 10:42:13
|- col_2: string

Get the last inserted row ID in trafodion

I want to get the row ID or record ID for last inserted record in the table in Trafodion.
Example:
1 | John <br/>
2 | Michael
When executing an INSERT statement, I want to return the created ID, means 3.
Could anyone tell me how to do that using trafodion or is it not possible ?

Are you using a sequence generator to generate unique ids for this table? Something like this:
create table idcol (a largeint generated always as identity not null,
b int,
primary key(a desc));
Either way, with or without sequence generator, you could get the highest key with this statement:
select max(a) from idcol;
The problem is that this statement could be very inefficient. Trafodion has a built-in optimization to read the min of a key column, but it doesn't use the same optimization for the max value, because HBase didn't have a reverse scan until recently. We should make use of the reverse scan, please feel free to file a JIRA. To make this more efficient with the current code, I added a DESC to the primary key declaration. With a descending key, getting the max key will be very fast:
explain select max(a) from idcol;
However, having the data grow from higher to lower values might cause issues in HBase, I'm not sure whether this is a problem or not.
Here is yet another solution: Use the Trafodion feature that allows you to select the inserted data, showing you the inserted values right away:
select * from (insert into idcol(b) values (11),(12),(13)) t(a,b);
A B
-------------------- -----------
1 11
2 12
3 13
--- 3 row(s) selected.

rownum very slow after a particular value

I am running a query in Oracle SQL developer which looks something like this :
select * from dummy_table where col1 < 10 and col2 < 20 and col3 < 40
and rownum <= x
The query takes around 3 seconds and returns x rows if the value of x is <= 12.
But if replace x by anything greater than 12, the query takes more than 7 seconds and returns only 12 results ( in other words, there are only 12 rows satisfying the where clause ) .
Why is rownum behaving like this ? I was expecting this query to take almost same time if value of x is changed from 12 to 13.
Edit : Another thing which I noticed is that there is a composite index on col1,col2 and col3. If I remove the index ( or disable it using a hint ) , the query runs quite fast.

It's difficult to give a complete explanation without knowing the table structure, the indexes, etc.
However, to keep it simple, if your table only has 12 rows matching your condition, asking for the first 12 rows means that Oracle simply looks for 12 rows and returns them, no matter the number of rows that do not match your condition.
If you ask for, say, 13 rows, Orace needs to scan the whole table, to check if a 13rd row exists.
So, without indexes and hints, asking for the first 13 rows where only 12 exist may need a full table scan, and this can be slow.
Please consider this as a very simplified explanation, not considering indexes, cache, hints. For example, we're not considering that even checking the performance of a query by simply running it may be misleading, because Oracle may use cache, and you can have better performance after the first run.

Cassandra timing out when queried for key that have over 10,000 rows even after giving timeout of 10sec

Im using a DataStax Community v 2.1.2-1 (AMI v 2.5) with preinstalled default settings.
And i have a table :
CREATE TABLE notificationstore.note (
user_id text,
real_time timestamp,
insert_time timeuuid,
read boolean,
PRIMARY KEY (user_id, real_time, insert_time))
WITH CLUSTERING ORDER BY (real_time DESC, insert_time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}
AND **default_time_to_live** = 20160
The other configurations are:
I have 2 nodes. on m3.large having 1 x 32 (SSD).
Im facing the issue of timeouts even if consistency is set to ONE on this particular table.
I increased the heap space to 3gb [ram size of 8gb]
I increased the read timeout to 10 secs.
select count (*) from note where user_id = 'xxx' limit 2; // errors={}, last_host=127.0.0.1.
I am wondering if the problem could be with time to live? or is there any other configuration any tuning that matters for this.
The data in the database is pretty small.
Also this problem occurs not as soon as you insert. This happens after some time (more than 6 hours)
Thanks.

[Copying my answer from here because it's the same environment/problem: amazon ec2 - Cassandra Timing out because of TTL expiration.]
You're running into a problem where the number of tombstones (deleted values) is passing a threshold, and then timing out.
You can see this if you turn on tracing and then try your select statement, for example:
cqlsh> tracing on;
cqlsh> select count(*) from test.simple;
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------+--------------+--------------+----------------
...snip...
Scanned over 100000 tombstones; query aborted (see tombstone_failure_threshold) | 23:36:59,324 | 172.31.0.85 | 123932
Scanned 1 rows and matched 1 | 23:36:59,325 | 172.31.0.85 | 124575
Timed out; received 0 of 1 responses for range 2 of 4 | 23:37:09,200 | 172.31.13.33 | 10002216
You're kind of running into an anti-pattern for Cassandra where data is stored for just a short time before being deleted. There are a few options for handling this better, including revisiting your data model if needed. Here are some resources:
The cassandra.yaml configuration file - See section on tombstone settings
Cassandra anti-patterns: Queues and queue-like datasets
About deletes
For your sample problem, I tried lowering the gc_grace_seconds setting to 300 (5 minutes). That causes the tombstones to be cleaned up more frequently than the default 10 days, but that may or not be appropriate based on your application. Read up on the implications of deletes and you can adjust as needed for your application.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Prefer oldest version in ReplacingMergeTree - clickhouse

Related

Clickhouse KILL QUERY hangs forever

How to get BigQuery table expiration date via query / cli / go?

Get the last inserted row ID in trafodion

rownum very slow after a particular value

Cassandra timing out when queried for key that have over 10,000 rows even after giving timeout of 10sec

Categories

Resources