ClickHouse: How to delete on *AggregatingMergeTree tables from a materialized view - clickhouse

Having a structure where there is a base table, then a materialized view base_mv that aggregates sending the result TO an AggregatedMergeTree table base_agg_by_id. Then we have a view over this final table. base_unique. Similarly as in this blog post](https://www.altinity.com/blog/clickhouse-continues-to-crush-time-series).
However, if I delete from base, I would expect the base_mv would trigger the mutation and then act on it, and reflected on the base_agg_by_id, but it doesn't.
Is this the expected behaviour? How to DELETE in such a schema?
I've seen here that in MVs that keep data you can act on .inner tables. However in this case, since the table is from an AggregatedMergeTree and its fields are defined as functions (e.g. AggregateFunction(argMax, String, DateTime) ), I cannot apply a deletion via a value such as ALTER base_agg_by_id DELETE WHERE field = 'myval'.
Note. For the record, we have these tables in a replicated environment using Replicated* engine: base_d, base_agg_by_id_d, base_unique_d

Mutations are not propagated to materialized views.
The reason is very simple: it not possible in common case. And even in cases when it is theoretically possible it can be very expensive operation.
For example, let's say you're deleting one record from the table which references some userid. And your materialized view contains uniqState( userid ). Data structures used for calculating uniqState don't support 'remove' operation; but even if they would - the is no way to decide if that userid should be removed or not without rereading whole data for the partition again because that userid could be seen in other records too.
So in general case, you need to refill the whole partition for your AggregatedMergeTree.
I.e. something like (daily partitioning case):
ALTER amt_table DROP PARTITION '2019-03-01';
-- use same select as in your materialized view
INSERT INTO amt_table SELECT ... WHERE date = '2019-03-01';

Related

Are materialized views virtual tables or real tables with real data?

When materialized views are created in Oracle, do they store indices or do they store actual table values?
I am asking this as creating index on table and using views on that table and using materialized views (created with refresh complete start with (sysdate) next (sysdate+1) with rowid as) on unindexed table gives similar performance.
Where as I would expect materialized views to be far more faster.
Update
I slightly modified the content/title. My current concern after discussion is if materialized views are actual real tables or virtual tables with some optimization.
Materialized views create a copy of the data. To all intents and purposes they are actual tables. In fact we can create a materialized view from an existing table using the PREBUILT clause. The only difference is how the data is mastered - a materialized view doesn't own its data, a table does.
As to your performance conundrum:
When you say "on unindexed table" do you literally mean one table? If so, we wouldn't expect any difference in the time to query a view, a materialized view or the actual data: they all execute a full table scan on the same volume of data.
Consider the case where views have expecting select * from <table> where <condition>.
We would a SELECT against a materialised view built on that query to execute quicker than the same SELECT against the actual table, provided the WHERE clause restricts the data to a significantly smaller subset of the original data. Simply because a full table scan over a small table (materialised view) takes less time than a full table scan over a big table. Same applies if the materialised view's projection has fewer columns than the base table.
Indexing is a different matter. Unless the query selects a very small subset of the data it's not going to be more efficient than a full table scan and a filter.
To sum up: the only universal tuning heuristic is: it takes less time to do less work. Beyond that it is impossible to generalise. We can't discuss some vague "consider the case where views have select * from <table> where <condition>." It's all about the specifics.
Fundamentally, a materialized view is just a table with an associated query to populate it.
Given static data, one would generally expect the performance of a SELECT * from the materialized view (with no WHERE clause) to be at least as fast as running the query that underlies the materialized view, regardless of indexing.
If we add a WHERE clause to a SELECT * against the mview, however, that query could perform significantly slower than running the query that underlies the mview with the same WHERE clause. That's because the tables referenced in the query underlying the mview could have indexes to support the conditions in the WHERE clause, where as the mview might not have such indexes.

Oracle 12c - refreshing the data in my tables based on the data from warehouse tables

I need to update the some tables in my application from some other warehouse tables which would be updating weekly or biweekly. I should update my tables based on those. And these are having foreign keys in another tables. So I cannot just truncate the table and reinsert the whole data every time. So I have to take the delta and update accordingly based on few primary key columns which doesn't change. Need some inputs on how to implement this approach.
My approach:
Check the last updated time of those tables, views.
If it is most recent then compare each row based on the primary key in my table and warehouse table.
update each column if it is different.
Do nothing if there is no change in columns.
insert if there is a new record.
My Question:
How do I implement this? Writing a PL/SQL code is it a good and efficient way? as the expected number of records are around 800K.
Please provide any sample code or links.
I would go for Pl/Sql and bulk collect forall method. You can use minus in your cursor in order to reduce data size and calculating difference.
You can check this site for more information about bulk collect, forall and engines: http://www.oracle.com/technetwork/issue-archive/2012/12-sep/o52plsql-1709862.html
There are many parts to your question above and I will answer as best I can:
While it is possible to disable referencing foreign keys, truncate the table, repopulate the table with the updated data then reenable the foreign keys, given your requirements described above I don't believe truncating the table each time to be optimal
Yes, in principle PL/SQL is a good way to achieve what you are wanting to
achieve as this is too complex to deal with in native SQL and PL/SQL is an efficient alternative
Conceptually, the approach I would take is something like as follows:
Initial set up:
create a sequence called activity_seq
Add an "activity_id" column of type number to your source tables with a unique constraint
Add a trigger to the source table/s setting activity_id = activity_seq.nextval for each insert / update of a table row
create some kind of master table to hold the "last processed activity id" value
Then bi/weekly:
retrieve the value of "last processed activity id" from the master
table
select all rows in the source table/s having activity_id value > "last processed activity id" value
iterate through the selected source rows and update the target if a match is found based on whatever your match criterion is, or if
no match is found then insert a new row into the target (I assume
there is no delete as you do not mention it)
on completion, update the master table "last processed activity id" to the greatest value of activity_id for the source rows
processed in step 3 above.
(please note that, depending on your environment and the number of rows processed, the above process may need to be split and repeated over a number of transactions)
I hope this proves helpful

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Creating Materialized view in oracle taking forever

I have following query which has select query that returns data in 5sec. But when I add create materialized view command infront it takes ever for the query to create materialized view.
When you create a materialized view, you actually create a copy of the data that Oracle takes care to keep synchronized (and it makes those views somewhat like indexes). If your view operates over a big amount of data or over data from other servers, it's natural that the creating this view can take time.
From docs.oracle.com:
A materialized view is a replica of a target master from a single
point in time.
Just for "yuks", try
create table temp_tab nologging as select ...
I've seen cases where MV creation is long for some reason, probably logging.
Also, query development tools sometimes begin returning the data to the screen right away, but if you "paged" to the last row, you would find out how long it really takes to get all the data.
You should profile the select statement with explain plan and understand the table cardinality, indexes, waits states when running, ... in order to see if the query needs tuning.

oracle - moving data from to identical database

I have two databases with identical table layouts. There are a dozen or so tables of interest. They are a number of FK between them.
I have been asked to write a stored procedure to copy data from database A to database B based on the PK of the parent table at the top of the hierarchy. I may receive just one value, or a list of values. I'm supposed to select all records from database A that match the value(s) and insert/update them into database B. This includes all the records in the child tables too.
My questions is whats the best(most efficent/ best practice) way to do this?
Should I write a dozen select from... insert into... statements?
Should I join the tables together an try to insert into all the tables at the same time?
Thanks!
Additional info:
The record should be inserted if it is not already there. (based on the PK of the respective table). Otherwise it should be updated.
Obviously I need to traverse down to all child tables, so There would only be one record to copy in the parent table, but the child table might have 10, and the child's child table might have 500. I would of course need to update the record if it already existed, insert if it does not for the child tables too...
UPDATE:
I think it would make the solution simpler if I just deleted all records related to the top level key, and then insert all the new records rather than trying to do updates.
So I guess the questions is it best to just do a dozen:
delete from ... where ... in ...
select from ... where ... in ...
insert into...
or is it better to do some kinda of fancy joins to do all the inserts in one sql statement?
I would do this by disabling all the foreign key constraints, then doing a set of MERGE statements to deal with the updates and inserts, then enable all the constraints.
Think about logging. How much redo do you want to generate?
You might find that it's quicker and better to truncate all the target tables and then do inserts of everything with nolog. Could be simpler than the merges.
One major main alternative would be to drop all the target tables and use export and import. Might be a lot faster.
A second alternative would be to use materialized views, particularly if you don't need to do updates on the target tables. That way, Oracle does all the heavy lifting for you. You can force integrity by choosing refresh groups carefully.
There are several ways to deal with this business problem. A PL/SQL program may not be the best.

Resources