I have a table that is auto-updating from time to time (say daily). All updated fields are of type TEXT, and might have lots of data. What I definitely know is that the data will not change a lot. Usually up to 30 characters added or deleted.
So what would be more efficient? To merge somehow the changes or delete the old data and retrieve the new one?
And, if the merge way is the way to do it, how should I do that? Is there any keyword or something in order to make this easier and more efficient?
P.S I am completely new to databases in general, it's the very first time I ever create and use a database, so sorry if it is a silly question
Due to the MVCC model, PostgreSQL always writes a new row for any set of changes applied in a single UPDATE. Doesn't matter, how much you change. There is no "merge way".
It's similar to (but not the same as) deleting the row and inserting a new one.
Since your columns are obviously big, they are going to be TOASTed, meaning they are compressed and stored out-of-line in a separate table. In an UPDATE, these columns can be preserved as-is if they remain unchanged, so it's considerably cheaper to UPDATE than to DELETE and INSERT. Quoting the manual here
During an UPDATE operation, values of unchanged fields are normally
preserved as-is; so an UPDATE of a row with out-of-line values incurs
no TOAST costs if none of the out-of-line values change.
If your rows have lots of columns and only some get updated a lot, it might pay to have two separate tables with a 1:1 relationship. But that's an extreme case.
Related
I am new to Teradata & fortunately got a chance to work on both DDL-DML statements.
One thing I observed is Teradata is very slow when time comes to UPDATE the data in a table having large number of records.
The simplest way I found on the Google to perform this update is to write an INSERT-SELECT statement with a CASE on column holding values to be update with new values.
But what when this situation arrives in Data Warehouse environment, when we need to update multiple columns from a table holding millions of rows ?
Which would be the best approach to follow ?
INSERT-SELECT only OR MERGE-UPDATE OR MLOAD ?
Not sure if any of the above approach is not used for this UPDATE operation.
Thank you in advance!
At enterprise level, we expect volumes to be huge and updates are often part of some scheduled jobs/scripts.
With huge volume of data, Updates comes as a costly operation that involve risk of blocking table for some time in case the update fails (due to fallback journal). Although scripts are tested well, and failures seldom happen in production environments, it's always better to have data that needs to be updated loaded to a temporary table in required form and inserted back to same table after deleting matching records to maintain SCD-1 (Where we don't maintain history).
First of all I want to apologize because I do not have the vocabulary to talk about hive properly, I'm not sure if what goes into a row is called data and so on, I'm trying to be as correct as possible.
I want to know if it's possible, without adding an extra column to a hive table (where you would put the date/some metadata), what where the new rows added.
The case is as follows: A very large number of data is going to be processed, and the data selected ends in another hive table. If some new data is added to the original tables, I want to only process that new data, not to re-process the whole process, because it seems waste(we're talking several million entries).
I would normally add a new column with dates, or just metadata that tells me whether or not a row was already "computed" with.
edit: I have been updated with more info. Turns out, there are actually two problems, imo.
One, new data may come, and it would be infinitely better to just insert thus new ones in the destination table.
Second, data might be updated. I've been told that hive does not allow updates in the normal sense, since for example insert overwrite would just rewrite the whole set (turns out it's Hive 0.12.0, and in 0.14 SOME functionality has been added but updating is not a possibility).
I have configured free text search on a table in my postgres database. Pretty simple stuff, with firstname, lastname and email. This works well and is fast.
I do however sometimes experience looong delays when inserting a new entry into the table, where the insert keeps running for minutes and also generates huge WAL files. (We use the WAL files for replication).
Is there anything I need to be aware of with my free text index? Like Postgres maybe randomly restructuring it for performance reasons? My index is currently around 400 MB big.
Thanks in advance!
Christian
Given the size of the WAL files, I suspect you are right that it is an index update/rebalancing that is causing the issue. However I have to wonder what else is going on.
I would recommend against storing tsvectors in separate columns. A better way is to run an index on to_tsvector()'s output. You can have multiple indexes for multiple languages if you need. So instead of a trigger that takes, say, a field called description and stores the tsvector in desc_tsvector, I would recommend just doing:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description));
Now, if you need a consistent search interface across a whole table, there are more elegant ways of doing this using "table methods."
In general the functional index approach has fewer issues associated with it than anything else.
Now a second thing you should be aware of are partial indexes. If you need to, you can index only records of interest. For example, if most of my queries only check the last year, I can:
CREATE INDEX mytable_description_tsvector_idx ON mytable(to_tsvector(description))
WHERE created_at > now() - '1 year'::interval;
I would like to write a MERGE statement in Vertica database.
I know it can't be used directly, and insert/update has to be
combined to get the desired effect.
The merge sentence looks like this:
MERGE INTO table c USING (select b.field1,field2 aeg from table a, table b
where a.field3='Y'
and a.field4=b.field4
group by b.field1) t
on (c.field1=t.field1)
WHEN MATCHED THEN
UPDATE
set c.UUS_NAIT=t.field2;
Would just like to see an example of MERGE being used as insert/update.
You really don't want to do an update in Vertica. Inserting is fine. Selects are fine. But I would highly recommend staying away from anything that updates or deletes.
The system is optimized for reading large amounts of data and for inserting large amounts of data. So since you want to do an operation that does 1 of the 2 I would advise against it.
As you stated, you can break apart the statement into an insert and an update.
What I would recommend, not knowing the details of what you want to do so this is subject to change:
1) Insert data from an outside source into a staging table.
2) Perform and INSERT-SELECT from that table into the table you desire using the criteria you are thinking about. Either using a join or in two statements with subqueries to the table you want to test against.
3) Truncate the staging table.
It seems convoluted I guess, but you really don't want to do UPDATE's. And if you think that is a hassle, please remember that what causes the hassle is what gives you your gains on SELECT statements.
If you want an example of a MERGE statement follow the link. That is the link to the Vertica documentation. Remember to follow the instructions clearly. You cannot write a Merge with WHEN NOT MATCHED followed and WHEN MATCHED. It has to follow the sequence as given in the usage description in the documentation (which is the other way round). But you can choose to omit one completely.
I'm not sure, if you are aware of the fact that in Vertica, data which is updated or deleted is not really removed from the table, but just marked as 'deleted'. This sort of data can be manually removed by running: SELECT PURGE_TABLE('schemaName.tableName');
You might need super user permissions to do that on that schema.
More about this can be read here: Vertica Documentation; Purge Data.
An example of this from Vertica's Website: Update and Insert Simultaneously using MERGE
I agree that Merge is supported in Vertica version 6.0. But if Vertica's AHM or epoch management settings are set to save a lot of history (deleted) data, it will slow down your updates. The update speeds might go from what is bad, to worse, to horrible.
What I generally do to get rid of deleted (old) data is run the purge on the table after updating the table. This has helped maintain the speed of the updates.
Merge is useful where you definitely need to run updates. Especially incremental daily updates which might update millions of rows.
Getting to your answer: I don't think Vertica supportes Subquery in Merge. You would get the following.
ERROR 0: Subquery in MERGE is not supported
When I had a similar use-case, I created a view using the sub-query and merged into the destination table using the newly created view as my source table. That should let you keep using MERGE operations in Vertica and regular PURGEs should let you keep your updates fast.
In fact merge also helps avoid duplicate entries during inserts or updates if you use the correct combination of fields in ON clause, which should ideally be a join on the primary keys.
I like geoff's answer in general. It seems counterintuitive, but you'll have better results creating a new table with the rows you want in it versus modifying an existing one.
That said, doing so would only be worth it once the table gets past a certain size, or past a certain number of UPDATEs. If you're talking about a table <1mil rows, I might chance it and do the updates in place, and then purge to get rid of tombstoned rows.
To be clear, Vertica is not well suited for single row updates but large bulk updates are much less of an issue. I would not recommend re-creating the entire table, I would look into strategies around recreating partitions or bulk updates from staging tables.
We've got an Oracle 11g installation that is starting to get big. This database is the backend to a parallel optimization system running on a cluster. Input to the process is contained in the database along with output from the optimization steps. The input includes rote configuration data and some binary files (using 11g's SecureFiles). The output includes 1D, 2D, 3D, and 4D data currently stored in the DB.
DB Structure:
/* Metadata tables */
Case(CaseId, DeleteFlag, ...) On Delete Cascade CaseId
OptimizationRun(OptId, CaseId, ...) On Delete Cascade OptId
OptimizationStep(StepId, OptId, ...) On Delete Cascade StepId
/* Data tables */
Files(FileId, CaseId, Blob) /* deletes are near instantateous here */
/* Data per run */
OnedDataX(OptId, ...)
TwoDDataY1(OptId, ...) /* packed representation of a 1D slice */
/* Data not only per run, but per step */
TwoDDataY2(StepId, ...) /* packed representation of a 1D slice */
ThreeDDataZ(StepId, ...) /* packed representation of a 2D slice */
FourDDataZ(StepId, ...) /* packed representation of a 3D slice */
/* ... About 10 or so of these tables exist */
A reaper script comes around daily and looks for cases with the DeleteFlag = 1 and proceeds with the DELETE FROM Case WHERE DeleteFlag = 1, allowing the cascades to continue.
This strategy works great for read/write, but is now outstripping our capabilities when we want to purge data! The rub is deleting a Case takes ~20-40 minutes depending on the size and often overloads our archiver space. The next major version of the product will take a "from the ground up" approach to solving the problem. The next minor release needs to stay within the confines of data stored in the database.
So, for the minor release we need an approach that can improve delete performance and at most require moderate changes to the database.
REF Partitioning, but the question is HOW? I would love to do INTERVAL on Case and REF on the rest, but that isn't supported. Is there some way to manually partition OptimizationRun by CaseId through a trigger?
Disable archiving/redo logs for deletes? Couldn't find a HINT to go with this one. Not sure it is even feasible.
Truncate? This likely would need some sorta complicated table setup. But maybe I'm not considering all of my option. (per answer, stricken)
To help illustrate the issue, the data in question per case ranges from 15MiB to 1.5GiB with anywhere from 20k to 2M rows.
Update: Current size of the DB is ~1.5TB.
Deleting data is a hell of a job, for the database. It has to create before images, update indexes, write redo logs and remove the data. This is a slow process. If you can have a window to perform this task, easiest and fastest is to build new tables, containing the wanted data. Drop the old tables and rename the new tables.
This requires some setup work, that is obvious but is very well possible to make.
One step less drastic is to drop the indexes before the delete takes place. My vote would go for CTAS (Create Table As Select from) and build the new tables.
A nice partitioning schema would certainly be helpful, maybe in the next release Oracle can combine interval and reference partitioning. It would be very nice to have.
Disabling logging .... can not be done for deletes but CTAS can use nologging. Make a backup when ready and make sure to transfer the datafiles to the standby database, if you have one.
Just some thoughts:
I assume you have indexes on all foreign keys. ON DELETE CASCADE will hold row level locks until the Case delete is complete, and with no indexes will hold table locks I believe and be super slow of course
Do you have any deferred constraints? This would most likely slow things down for Oracle cascading through the various table deletes
Have you tried to do the deletes separately for all affected tables (instead of relying on on delete cascade)? Not as easy, but you may be surprised.
EDIT:
One more thought. You may consider doing a SOFT delete on Case table, meaning you have a status field that will tell your app if that Case should be considered. This flag could have many different values, but maybe 'A' for active and 'I' for inactive. Assuming you are always using Case as a driving/primary table in joins to other tables, you can avoid the HARD deletes all-together (and occasionally do a cleanup off hours on whatever schedule if you like). Apps would need to be aware of this flag of course, and you'd be tied to joining back to Case table. May or may not fit for your situation...
CASCADE DELETE runs internally slow-by-slow, er, row-by-row.
Some options:
Have your purge job snapshot all the cases to be purged into a scratch table with a CTAS. Then have your purge job loop over that table, deleting each case (and its children) individually. This can be unpleasant, especially if you run into millions of descendant rows. We had to change one of the processes recently at [business redacted] which did that to determine which ultimate parents had child counts that would be problematic, and then use a rownum limiter on a delete against the problematic child table(s). It's not fast, but at least it's safer from an undo/redo management perspective by placing an upper bound on how big any transaction can be.
If you're using CASCADE DELETE as a convenience, you could always not do so. You'd have to write a more sophisticated purge routine that deletes from your dependency tree "bottom up".
If you can afford the undo/redo generation on the soft delete, you could range-partition the ultimate parent on DeleteFlag, then partition the children BY REFERENCE, all tables using ENABLE ROW MOVEMENT. You'd incur undo/redo costs for moving the rows when soft-deleted, but when it came time to finally purge, it would be truncating partitions where DeleteFlag = 1, nothing more.
Adding storage is relatively cheap. If there's a date-based retention option, use it, and just have the soft delete option hide the data from the application front end. It's inelegant, but then, so is CASCADE DELETE.
Not advised for live database.
I disabled the foreign key constraints referencing the table which is slow to delete.
I executed the delete
Enabled the foreign keys again.
Use Enterprise Manager to create a AWR report and run it through statspack analyzer which will give you detailed instructions about the bottlenecks in your system. A AWR report is a textfile containing all kinds of data about what the database has done during a certain time and how long it took.... That statspack analyzer ist sort of an automatic DBA telling you what to do.
Forget partitions until Statspack Analyzer tells you that they could be useful and you've got a few idle disks that you can use to distribute the I/O.
Don't think about truncate. It forces a commit...
BTW, I'm not affiliated with Statspack Analyzer, but I think it's a very viable general tuning approach for Oracle, especially if there's no DBA around.