When is data really deleted in Clickhouse with the new `DELETE` query - clickhouse

Clickhouse recently released the new DELETE query, that allows you to quickly mark data as deleted (https://clickhouse.com/docs/en/sql-reference/statements/delete/).
The actual data deletion is done in the background afterwards, as is written in the docs.
My question is, is there some indication to when data would be deleted?
Or is there a way to make sure it gets deleted?
I'm asking for GDPR compliance purposes.
Thanks

For GDRP better use
ALTER TABLE ... DELETE WHERE ...
and poll
SELECT * FROM system.muations WHERE is_done=0
when mutation complete (is_done=1), that mean data deleted physically

According to CH support on slack:
If using the DELETE query, the only way to make sure data is deleted is to use the OPTIMIZE FINAL query.
This would trigger a merge and the deleted data won't be included in the outcome

Related

Clear a DynamoDb trigger iterator

Is it possible to "purge"/"clear" the iterator of a dynamoDb table's trigger ?
Context:
A lambda is processing updates on a given table using a trigger.
Badly formatted records are inserted in the table and the lamdba is unable to process it, stalling the iterator.
A few time after a fix is issued to insert correctly formatted records in the table. So we want to "fast forward" the processing of updates and skip the old ones?
I suppose deleting/recreating the trigger would do that. Is there a "better" way ?
Old question, but this can help somebody else:
as stated in the comments, I don't think it's possible to simply purge the queue.
However, when you re-create it, you can chose the "starting position" between:
"Latest" (will skip all previously stored messages)
"Trim horizon" (will process all the available messages)
Re-creating the trigger is pretty easy, that's what I have done in a similar case recently

Is it possible to query the un-modified data within an Oracle (9.2) Transaction?

I'm looking at doing a data-fix and need to be able to prove that the data I have intended to change is the only data changed. (For example - that a trigger hasn't modified additional columns that I wasn't expecting)
I've been looking at Oracle's Flashback Query process, which would be great, except that this is not enabled on the database in question.
Since this check would be carried out prior to committing the transaction, Oracle must have the "before" information squirreled away somewhere, and I wondered if there is any way of accessing this undo information?
Otherwise, I would potentially have to make a temporary copy of each table and do a compare between the live table and the backup, which may also result in inconsistencies between the backup query time and the update transaction time.
While I'm expecting the answer "no", I'm hoping someone can point me in a better direction than that which I appear to be headed at present.
Thanks!

Cassandra Best Practice on edits: delete & re-insert vs. update?

I am new to Cassandra. I am looking at many examples online. Here is one from JHipster Cassandra examples on GitHub:
https://gist.github.com/jdubois/c3d3bedb869466731316
The repository save(user) method does a read (to look for existence) then a delete and re-insert of the existing user across all the denormalized tables whenever the user data changed.
Is this best practice?
Is this only because of how the data model for this sample is designed?
Is this sample's design a result of twisting a POJO framework into a NoSQL database design?
When would I want to just do a update in Cassandra? It supports updates at the field-level, so it seems like that would be preferred.
First of all, the delete operations should be part of the batch for more robust error handling. But it looks like there are also some concurrency issues with the code. It will update the user based on the current user value read before. It's not save to assume this will still be the latest value while save() is actually executed. It will also just overwrite any keys in the lookup table that might be in use for a different user at that point. E.g. the login could already exist for another user while executing insertByLoginStmt.
It is not necessary to delete a row before inserting a new one.
But if you are replacing rows and new columns are different from existing columns then you need to delete all existing columns and insert new columns. Or insert new and delete old, does not matter if happens in batch.

MERGE in Vertica

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Exclusive table (read) lock on Oracle 10g?

Is there a way to exclusively lock a table for reading in Oracle (10g) ? I am not very familiar with Oracle, so I asked the DBA and he said it's impossible to lock a table for reading in Oracle?
I am actually looking for something like the SQL Server (TABLOCKX HOLDLOCK) hints.
EDIT:
In response to some of the answers: the reason I need to lock a table for reading is to implement a queue that can be read by multiple clients, but it should be impossible for 2 clients to read the same record. So what actually happens is:
Lock table
Read next item in queue
Remove item from the queue
Remove table lock
Maybe there's another way of doing this (more efficiently)?
If you just want to prevent any other session from modifying the data you can issue
LOCK TABLE whatever
/
This blocks other sessions from updating the data but we cannot block other peple from reading it.
Note that in Oracle such table locking is rarely required, because Oracle operates a policy of read consistency. Which means if we run a query that takes fifteen minutes to run the last row returned will be consistent with the first row; in other words, if the result set had been sorted in reverse order we would still see exactly the same rows.
edit
If you want to implement a queue (without actually using Oracle's built-in Advanced Queueing functionality) then SELECT ... FOR UPDATE is the way to go. This construct allows one session to select and lock one or more rows. Other sessions can update the unlocked rows. However, implementing a genuine queue is quite cumbersome, unless you are using 11g. It is only in the latest version that Oracle have supported the SKIP LOCKED clause. Find out more.
1. Lock table
2. Read next item in queue
3. Remove item from the queue
4. Remove table lock
Under this model a lot of sessions are going to be doing nothing but waiting for the lock, which seems a waste. Advanced Queuing would be a better solution.
If you want a 'roll-your-own' solution, you can look into SKIP LOCKED. It wasn't documented until 11g, but it is present in 10g. In this algorithm you would do
1. SELECT item FROM queue WHERE ... FOR UPDATE SKIP LOCKED
2. Process item
3. Delete the item from the queue
4. COMMIT
That would allow multiple processes to consume items off the queue.
The TABLOCKX and HOLDLOCK hints you mentioned appear to be used for writes, not reads (based on http://www.tek-tips.com/faqs.cfm?fid=3141). If that's what you're after, would a SELECT FOR UPDATE fit your need?
UPDATE: Based on your update, SELECT FOR UPDATE should work, assuming all clients use it.
UPDATE 2: You may not be in a position to do anything about it right now, but this sort of problem is actually an ideal fit for something other than a relational database, such as AMQP.
If you mean, lock a table so that no other session can read from the table, then no, you can't. Why would you want to do that anyway?

Resources