this problem makes me very confused, could anybody help me? thank you in advantage.
I use ReplicatedMergeTree to create the table.
Here is update SQL:
when I first run it, the result is correct, but run it again, there will be zero
did you try to run
SELECT count(*) FROM ods.fb_daily_inventory WHERE uds_load_date >= '2021-05-04'
on MySQL server, to ensure data still exists on MySQL?
Mutations are asynchronous. Your delete was sent to Zookeeper, then your insert ran, then your select and likely then your delete statement. If you are updating from mysql you should look at bin log replication to replicate only new or changed data over. Clickhouse works best with new data, not updates/deletes as those are asynchronous.
A mutation query returns immediately after the mutation entry is added (in case of replicated tables to ZooKeeper, for non-replicated tables - to the filesystem). The mutation itself executes asynchronously using the system profile settings. To track the progress of mutations you can use the system.mutations table. A mutation that was successfully submitted will continue to execute even if ClickHouse servers are restarted. There is no way to roll back the mutation once it is submitted, but if the mutation is stuck for some reason it can be cancelled with the KILL MUTATION query.
https://clickhouse.com/docs/en/sql-reference/statements/alter/#mutations
Related
Clickhouse recently released the new DELETE query, that allows you to quickly mark data as deleted (https://clickhouse.com/docs/en/sql-reference/statements/delete/).
The actual data deletion is done in the background afterwards, as is written in the docs.
My question is, is there some indication to when data would be deleted?
Or is there a way to make sure it gets deleted?
I'm asking for GDPR compliance purposes.
Thanks
For GDRP better use
ALTER TABLE ... DELETE WHERE ...
and poll
SELECT * FROM system.muations WHERE is_done=0
when mutation complete (is_done=1), that mean data deleted physically
According to CH support on slack:
If using the DELETE query, the only way to make sure data is deleted is to use the OPTIMIZE FINAL query.
This would trigger a merge and the deleted data won't be included in the outcome
Lets star with background. I have an api endpoint that I have to query every 15 minutes and that returns complex data. Unfortunately this endpoint does not provide information of what exactly changed. So it requires me to compare the data that I have in db and compare everything and than execute update, add or delete. This is pretty boring...
I came to and idea that I can simply remove all data from certain tables and build everything from scratch... But it I have to also return this cached data to my clients. So there might be a situation that the db will be empty during some request from my client because it will be "refreshing/rebulding". And that cant happen because I have to return something
So I cam to and idea to
Lock the certain db tables so that the client will have to wait for the "refreshing the db"
or
CQRS https://martinfowler.com/bliki/CQRS.html
Do you have any suggestions how to solve the problem?
It sounds like you're using a relational database, so I'll try to outline a solution using database terms. The idea, however, is more general than that. In general, it's similar to Blue-Green deployment.
Have two data tables (or two databases, for that matter); one is active, and one is inactive.
When the software starts the update process, it can wipe the inactive table and write new data into it. During this process, the system keeps serving data from the active table.
Once the data update is entirely done, the system can begin to serve data from the previously inactive table. In other words, the inactive table becomes the active table, and vice versa.
I am inserting/updating data into a table. The database system does not provide an "Upsert" functionality. Thus I am using a staging table for the insert followed by a merge into the "final" table and finally I am truncating the staging table.
This leads to a race condition. If new data is inserted into the staging table between the merge+truncate this data is lost.
How can I make sure this does not happen?
I have tried to model this via Wait/Notify, but this is not a clean solution either. The queue for the "Put Data into staging table" PutDatabaseRecord processor could be filled and "MergeVertica for Insert/Update" ExecuteSQL could still execute.
I would use a MonitorActivity processor with a 60 or 30 sec threshold and use the Inactive output with a Continually Send Messages set to "false".
Have the success of the SQL inserts into staging connection into your MonitorActivity, this way if no activity is seen in the last X seconds he will trigger a flowfile that will start your Merge Process.
Download the template from https://codeshare.io/aJNNkn
I have an issue in production env, one of the work flow is running more tgan one day and inserting records in to sql server db. It s just direct load mapping, there is no sq over ride as well. Monitor shows sq count as 7 million and inseting same no of records inyo target. But source db shows around 3 million records only. How can this be possible?
Have you checked if the source qualifier is joining more than one table? A screenshot of the affected mapping pipeline and obfuscated logfile would help.
Another thought... given your job ran for a day, were there any jobs ran in that time to purge old records from the source table?
Cases when I saw this kind of things happening:
There's a SQL Query override doing something different than I thought (eg. joining some tables)
I'm looking at a different source - verify the connections and make sure to check the same object on the same database at the same server the PowerCenter is connecting to.
It's a reusable session being executed multiple times by different workflows. In such case in workflow monitor it may happen that Source/Target Statistics will refer to another execution.
I have 2 databases, DBa and DBb. I have 2 records sets, RecordsA and RecordsB. The concept is that in our app you can add records from A to B. I am having an issue where I go to add a record from A to B and try to query the records again. The particular property on the added record is stale/incorrect.
RecordsA lives on DBa and RecordsB lives on DBb. I make my stored proc call to add the record to the B side and modify a column's value on DBa which makes the insert/update using a dblink on DBb. Problem is, when I do a insert/update followed by an immidiate get call on DBa (calling DBb) that modified property is incorrect, it's null as if the insert never went through. However, if I put a breakpoint before the pull call and wait about 1 second the correct data is returned. Making me wonder if there is some latency issues with dblinks.
This seems like an async issue but we verified no async calls are being made and everything is running on the same thread. Would this type of behavior be likely with a db link? As in, inserting/updating a record on a remote server and retrieving it right away causing some latency where the record wasn't quite updated at the time of the re-pull?