How do I create a table on vertica that keep the full history of data changes, mainly the update of a row - vertica

I have to create a table that has primary key, but the name changes like the time and a wish to keep historical of the changes, is that a way to do this on Vertica?
I'm new to vertica, it will be good if you could explains well to me.

Pretty sure there is no system table that tracks it that specifically, but you can get what you're looking for by querying the v_monitor.query_requests and looking for requests that start with UPDATE.
SELECT *
FROM v_monitor.query_requests
WHERE request ILIKE 'UPDATE%';
ILIKE is a case insensitive LIKE statement.
If you want a more precise way of looking for updates, you can execute your updates with a query label: /*+LABEL('update')*/
UPDATE /*+LABEL('update')*/ table
SET col1 = col1 + 1;
You can also label your other queries similarly, INSERT /*+LABEL('insert') INTO table ...
Then you can query the query_requests table searching for those specific labels.
SELECT *
FROM v_monitor.query_requests
WHERE request_label = 'update';
Update:
As of 9.2 there is a new system table called LOG_QUERIES that keeps track of some DDL changes such as CREATE, ALTER, TRUNCATE, etc. Unfortunately—from my tests—it does not look like it keeps track of updates, so it may not help with your exact question, but it may be useful if you want to keep track of CREATE or ALTER statements in the future. If you have premium support you could request that they include UPDATE statements in this table as well.

Related

ClickHouse: How to delete on *AggregatingMergeTree tables from a materialized view

Having a structure where there is a base table, then a materialized view base_mv that aggregates sending the result TO an AggregatedMergeTree table base_agg_by_id. Then we have a view over this final table. base_unique. Similarly as in this blog post](https://www.altinity.com/blog/clickhouse-continues-to-crush-time-series).
However, if I delete from base, I would expect the base_mv would trigger the mutation and then act on it, and reflected on the base_agg_by_id, but it doesn't.
Is this the expected behaviour? How to DELETE in such a schema?
I've seen here that in MVs that keep data you can act on .inner tables. However in this case, since the table is from an AggregatedMergeTree and its fields are defined as functions (e.g. AggregateFunction(argMax, String, DateTime) ), I cannot apply a deletion via a value such as ALTER base_agg_by_id DELETE WHERE field = 'myval'.
Note. For the record, we have these tables in a replicated environment using Replicated* engine: base_d, base_agg_by_id_d, base_unique_d
Mutations are not propagated to materialized views.
The reason is very simple: it not possible in common case. And even in cases when it is theoretically possible it can be very expensive operation.
For example, let's say you're deleting one record from the table which references some userid. And your materialized view contains uniqState( userid ). Data structures used for calculating uniqState don't support 'remove' operation; but even if they would - the is no way to decide if that userid should be removed or not without rereading whole data for the partition again because that userid could be seen in other records too.
So in general case, you need to refill the whole partition for your AggregatedMergeTree.
I.e. something like (daily partitioning case):
ALTER amt_table DROP PARTITION '2019-03-01';
-- use same select as in your materialized view
INSERT INTO amt_table SELECT ... WHERE date = '2019-03-01';

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Optimizing a delete... where query with rownum

I'm working with an application that has a large amount of outdated data clogging up a table in my databank. Ideally, I'd want to delete all entries in the table whose reference date is too old:
delete outdatedTable where referenceDate < :deletionCutoffDate
If this statement were to be run, it would take ages to complete, so I'd rather break it up into chunks with the following:
delete outdatedTable where referenceData < :deletionCutoffDate and rownum <= 10000
In testing, this works suprisingly slowly. The following query, however, runs dramatically faster:
delete outdatedTable where rownum <= 10000
I've been reading through multiple blogs and similar questions on StackOverflow, but I haven't yet found a straightforward description of how/whether using rownum affects the Oracle optimizer when there are other Where clauses in the query. In my case, it seems to me as if Oracle checks
referenceData < :deletionCutoffDate
on every single row, executes a massive Select on all matching rows, and only then filters out the top 10000 rows to return. Is this in fact the case? If so, is there any clever way to make Oracle stop checking the Where clause as soon as it's found enough matching rows?
How about a different approach without so much DML on the table. As a permanent solution for future you could go for table partitioning.
Create a new table with required partition(s).
Move ONLY the required rows from your existing table to the new partitioned table.
Once the new table is populated, add the required constraints and indexes.
Drop the old table.
In future, you would just need to DROP the old partitions.
CTAS(create table as select) is another way, however, if you want to have a new table with partition, you would have to go for exchange partition concept.
First of all, you should read about SQL statement's execution plan and learn how to explain in. It will help you to find answers on such questions.
Generally, one single delete is more effective than several chunked. It's main disadvantage is extremal using of undo tablespace.
If you wish to delete most rows of table, much faster way usially a trick:
create table new_table as select * from old_table where date >= :date_limit;
drop table old_table;
rename table new_table to old_table;
... recreate indexes and other stuff ...
If you wish to do it more than once, partitioning is a much better way. If table partitioned by date, you can select actual date quickly and you can drop partion with outdated data in milliseconds.
At last, paritioning if a way to dismiss 'deleting outdated records' at all. Sometimes we need old data, and it's sad if we delete it by own hands. With paritioning you can archive outdated partitions outside of the database, but connects them when you need to access old data.
This is an old request, but I'd like to show another approach (also using partitions).
Depending on what you consider old, you could create corresponding partitions (optimally exactly two; one current, one old; but you could just as well make more), e.g.:
PARTITION BY LIST ( mod(referenceDate,2) )
(
PARTITION year_odd VALUES (1),
PARTITION year_even VALUES (0)
);
This could as well be months (Jan, Feb, ... Dec), decades (XX0X, XX1X, ... XX9X), half years (first_half, second_half), etc. Anything circular.
Then whenever you want to get rid of old data, truncate:
ALTER TABLE mytable TRUNCATE PARTITION year_even;
delete from your_table
where PK not in
(select PK from your_table where rounum<=...) -- these records you want to leave

Is there any use to create index on all the table columns in oracle?

In our one of production database, we have 4 column table and there are no PK,UK constraints on it. only one notnull constraint on one column. The inserts are slow on this table and when I checked the indexes , there is one index which is built on all columns.
It is a normal table and not IOT. I really don't see a need of all column index, but wondering why the developers has created it?
Appreciate your thoughts?
It might be usefull, i.e. if you (mainly) query all columns oracle doesn't have to access the table at all, but can get all the data from the index. Though inserts take longer because a larger index has to be maintained by the dbms everytime.
One case where it could be useful is,
Say for example, you are trying to check the existence of records in this table and for that you have to have joins on all four columns. So in such a case if you have written a correlated query like below,
SELECT <something>
FROM table_1 t1
WHERE EXISTS
(SELECT 1 FROM table_t2 t2 where t1.c1=t2.c1 and t1.c2=t2.c2 and t1.c3=t2.c3 and t1.c4=t2.c4)
Apart from above case, it looks an error to me from developer's side.
Indexes are good to better query optimization but causes slow updates/inserts because the indexes needs to be updated at each modification.
If these tables first use is querying and inserts happens only in a specific periods like a batch at the beginning or the end of the day only, then you can remove the indexes before updating tables and then restore them.
In addition, all the queries all these tables need to be analysed to see which indexes are useful and which are not?
Anyway, You need to ask developers before removing these indexes.

Make column unique in two tables in our database

I have come into a bump at my current company where they have an account and a member. For some reason or another both are stored in separate tables.
Right now a member and an account can be registered. That's fine, except the users of both member and an account can have the same username. This is of course as you all know just wrong. Especially since they use the username to login to the same system except with different functionality levels.
Right now we are doing a check at the application level, and we're just wondering if it's possible to get the database to enforce two columns to be unique, say like a union of the two tables.
Can't set them up as primary or foreign key at the moment but that's for future anyway. Right now looking for a quick fix. In the future I will probably merge databases and get all members added on as new rows in the account table and add a boolean for IsMember.
In general, I agree with the consensus opinion that it's better to fix the design than to kluge a fix using triggers. However, a properly implemented trigger-based solution is still probably better than your current situation.
If you're going to use triggers, the right way to do it is to:
Create a new table that will contain nothing but usernames, with a primary key enforcing uniqueness (this may, in fact, be a good candidate for an indes-organized table).
Create before-insert triggers on both existing tables that add the new username to the new table. If the new username already exists, an error will be thrown, preventing the insert of both rows. Of course, the application will need to be able to handle this error gracefully (presumably it already can, for scenarios in which the new username already exists in the table it's being added to).
The wrong way to do this would be to make the trigger select from the other table, in order to verify uniqueness.
You can add a trigger that enforces your requirement.
The recommended triggers tend to be really brittle with concurrent transaction.
What you can do (AFAIK) is to create a materialized view containing the union of the column in question and put a unique constraint on that column.
Make sure you do some performance tests though.
As you use a soft delete pattern.
A trigger could be used (on each table) as a temporary measure.
By inserting a disabled record in the the other table, you will get a failure if the other record already exists
Remember this will not enforce the rule on existing data, only records that are inserted will be checked
Something like this:
-- Insert into the accounts table too
CREATE OR REPLACE TRIGGER tr_member_chk
BEFORE INSERT ON members
FOR EACH ROW
BEGIN
INSERT INTO account (name, id, etc, isenabled) VALUES(:new.name, :new.id, :new.etc, 0);
END;
-- Insert into the members table too
CREATE OR REPLACE TRIGGER tr_account_chk
BEFORE INSERT ON accounts
FOR EACH ROW
BEGIN
INSERT INTO members (name, id, etc,isenabled) VALUES(:new.name, :new.id, :new.etc,0);
END;

Resources