MERGE in Vertica - insert

I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.

You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.

If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.

I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.

To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.

Related

Best approaches to UPDATE the data in tables - Teradata

I am new to Teradata & fortunately got a chance to work on both DDL-DML statements.
One thing I observed is Teradata is very slow when time comes to UPDATE the data in a table having large number of records.
The simplest way I found on the Google to perform this update is to write an INSERT-SELECT statement with a CASE on column holding values to be update with new values.
But what when this situation arrives in Data Warehouse environment, when we need to update multiple columns from a table holding millions of rows ?
Which would be the best approach to follow ?
INSERT-SELECT only OR MERGE-UPDATE OR MLOAD ?
Not sure if any of the above approach is not used for this UPDATE operation.
Thank you in advance!
At enterprise level, we expect volumes to be huge and updates are often part of some scheduled jobs/scripts.
With huge volume of data, Updates comes as a costly operation that involve risk of blocking table for some time in case the update fails (due to fallback journal). Although scripts are tested well, and failures seldom happen in production environments, it's always better to have data that needs to be updated loaded to a temporary table in required form and inserted back to same table after deleting matching records to maintain SCD-1 (Where we don't maintain history).

Is there a way to make selecting query faster?

I want to select multiple rows from multiple tables, one of them having billions of rows. It sometimes take 20 seconds and there are over thousands of users using it so it is pretty bad.
I looked into COLUMNSTORE and tried it in my local machine and the performance is x50 faster than usual! (note that I was clearing the cache to see the difference)
However, the downside is I can't update, insert and delete rows, which is being constantly done for that table with the billion rows.
Is there a way to optimize it? (Besides the (NOLOCK) dirty read, which security is not an issue btw)
There are already indexes in that table, but doesn't help.
Is there a way to perform BATCH EXECUTION (I see it does row execution)? Or any optimization advice?
Using Microsoft SQL Server 2012
When you get to the scale of billions of rows, you often need to take different approaches for handling the data. Separating the content into multiple databases and storing on different machines might be more effective, however the design is considerably more complex.
An alternative is to consider using a combination of partitioned tables with a column-based index. That way at least, you can stage the updated data for the partition and then swap the updated one for the existing one to perform updates. See: http://technet.microsoft.com/en-us/library/gg492088.aspx#Update
An alternative is to consider using three tables: one that is static -- and is perhaps using column-based storage -- the other one dynamic, holding only recent updates and inserts, and the third holding just a list of deleted rows identified by the primary key. You then have to use a view to reconcile the content for queries.

Most efficient way to update database

I have a table that is auto-updating from time to time (say daily). All updated fields are of type TEXT, and might have lots of data. What I definitely know is that the data will not change a lot. Usually up to 30 characters added or deleted.
So what would be more efficient? To merge somehow the changes or delete the old data and retrieve the new one?
And, if the merge way is the way to do it, how should I do that? Is there any keyword or something in order to make this easier and more efficient?
P.S I am completely new to databases in general, it's the very first time I ever create and use a database, so sorry if it is a silly question
Due to the MVCC model, PostgreSQL always writes a new row for any set of changes applied in a single UPDATE. Doesn't matter, how much you change. There is no "merge way".
It's similar to (but not the same as) deleting the row and inserting a new one.
Since your columns are obviously big, they are going to be TOASTed, meaning they are compressed and stored out-of-line in a separate table. In an UPDATE, these columns can be preserved as-is if they remain unchanged, so it's considerably cheaper to UPDATE than to DELETE and INSERT. Quoting the manual here
During an UPDATE operation, values of unchanged fields are normally
preserved as-is; so an UPDATE of a row with out-of-line values incurs
no TOAST costs if none of the out-of-line values change.
If your rows have lots of columns and only some get updated a lot, it might pay to have two separate tables with a 1:1 relationship. But that's an extreme case.

Make row in a table read only on oracle?

I have a table with many rows.
For testing purpose my colleagues are also using same table. The problem is that some time he is deleting the row which I was testing and some time I.
So is there any way in oracle so I can make some specific rows to be read only so other should not delete and edit that?
Thanks.
There are a number of differnt ways to tackle this problem.
As Sun Tzu said, the best thing would be if you and your colleagues use data sets which do not collide.
For instance perhaps you could each have your own database instance, on local PCs; whether this will suit depends on a number of factors, not the least of which is your licensing arrangements with Oracle. Alternatively, you could have separate schemas in a shared database; depending on your application you may need to you synonyms or special connectioms.
Another approach: everybody builds their own data sets, known as test fixtures. This is a good policy, because testing is only truly valid when it runs against a known state; if we make assumptions regarding the presence or absence of data how valid are our test results? The point is, the tests should clean up after themselves, removing any data created in fixtures and by the running of tests. With this tactic you need to agree ranges of IDs for each team member: they must only use records within their ranges for testing or development work.
I prefer these sorts of approach because they don't really change the way the application works (arguably except using different schemas and synonyms). More draconian methods are available.
If you have Enterprise Edition you can use Row Level Security to protect your records. This is a extension of the last point: you will need a mechanism for identifying your records, and some infrastructure to identify ownership within the session. But in addition to preventing other users rom deleting your data you can also prevent them inserting, updating or even viewing records which are with your range of IDs. Find out more.
A lighter solution is use a trigger as A B Cade suggests. You will still need to identifying your records and who is connected (because presumably from time-to-time you will still want to delete your records.
One last strategy: take your ball home. Get the table in the state you want it and make a data pump export. For extra vindictiveness you can truncate the table at this point. Then any time you want to use the table you run a data pump import. This will reset the table's state, wiping out any existing data. This is just an extreme version of test scripts creating their own data.
You can create a trigger that prevents deleting some specific rows.
CREATE OR REPLACE TRIGGER trg_dont_delete
BEFORE DELETE
ON <your_table_name>
FOR EACH ROW
BEGIN
IF :OLD.ID in (<IDs of rows you dont want to be deleted>) THEN
raise_application_error (-20001, 'Do not delete my records!!!');
END IF;
END;
Of course you can make it smarter - make the if statement rely on user, or get the records IDs from another table and so on
Oracle supports row level locking. you can prevent the others to delete the row, which one you are using. for knowing better check this link.

overcoming 'log file sync' by design?

Advice/suggestions needed for a bit of application design.
I have an application which uses 2 tables, one is a staging table, which many separate processes write to, once a 'group' of processes has finished, another job comes along a aggregates the results together into a final table, then deletes that 'group' from the staging table.
The problem that I'm having is that when the staging table is being cleared out, lots of redo is generated and I'm seeing a lot of 'log file sync' waits in the database. This is a shared database with many other applications and this is causing some issues.
When applying the aggregate, the rows are reduced to about 1 row in the final table for every 20 rows in the staging table.
I'm thinking of getting around this by rather than having a single 'staging' table, I will create a table for each 'group'. Once done, this table can just be dropped, which should result in much less redo.
I only have SE, so partitioned tables isn't an option. Also faster disks for the redo probably isn't an option in the short term either.
Is this a bad idea? Any better solutions to be offered?
Thanks.
Would it be possible to solve the problem by having your process do a logical delete (i.e. set a DELETE_FLAG column in the table to 'Y') and then having a nightly process that truncates the table (potentially writing any non-deleted rows to a separate table before the truncate and then copy them back after the table is truncated)?
Are you certain that the source of the log file sync waits is that your disks can't keep up with the I/O? That's certainly possible, of course, but there are other possible causes of excessive log file sync waits including excessive commits. There is an excellent article on tuning log file sync events on the Pythian blog.
The most common cause of excessive log file syncs is too frequent commits, which are often deliberately coded in a mistaken attempt to reduce system load due to locking. You should commit only when your business transaction is complete.
Loading each group into a separate table sounds like a fine plan to reduce redo. You can truncate individual group table following each aggregation.
Another (but I think probably worse) option is to create a new staging table with the groups that haven't been aggregated then drop the original and rename the new table to replace the staging table.
I prefer Justin's suggestion ("logical delete"), but another option to consider might be a partitioned table, if you have the EE licence. The aggregation process could drop a partition instead of deleting the rows.

Resources