Techniques for removing old data on Oracle databases

Techniques for removing old data on Oracle databases - oracle

We have a mature Oracle database application (in production for over 10 years), and during that time, we have been using scripts of our own devising to remove old data that is no longer needed. They work by issuing delete statements against the appropriate tables, in a loop with frequent commits, in order to avoid overloading the system with i/o or using too much undo space.
They work fine, for the most part. They run daily, and it takes about an hour to remove the oldest days worth of data from the system. The main concerns I have are the effects on tables and indexes that all this deleting may have, and the fact that even though they don't overly load the system, deleting one day's worth of data in that short time does have the effect of blowing out the instances buffer cache, resulting in subsequent queries running slightly slower for the next few hours as the cache is gradually restored.
For years we've been considering better methods. In the past, I had heard that people used partitioned tables to manage old data reaping - one month per partition, for example, and dropping the oldest partition on a monthly basis. The main drawback to this approach is that our reaping rules go beyond "remove month X". Users are allowed to specify how long data must stay in the system, based on key values (e.g., in an invoice table, account foo can be removed after 3 months, but account bar may need to remain for 2 years).
There is also the issue of referential integrity; Oracle documentation talks about using partitions for purging data mostly in the context of data warehouses, where tables tend to be hypercubes. Ours is closer to the OLTP end of things, and it is common for data in month X to have relationships to data in month Y. Creating the right partitioning keys for these tables would be ticklish at best.
As for the cache blowouts, I have read a bit about setting up dedicated buffer caches, but it seems like it's more on a per-table basis, as opposed to a per-user or per-transaction basis. To preserve the cache, I'd really like the reaping job to only keep one transaction's worth of data in the cache at any time, since there is no need to keep the data around once deleted.
Are we stuck using deletes for the foreseeable future, or are there other, more clever ways to deal with reaping?

For the most part I think that you're stuck doing deletes.
Your comments on the difficulty of using partitions in your case probably do prevent them being used effectively (different delete dates being used depending on the type of record) but it it possible that you could create a "delete date" column on the records that you could partition on? It would have the disadvantage of making updates quite expensive as a change in the delete date might cause row migration, so your update would really be implemented as a delete and insert.
It could be that even then you cannot use DDL partition operations to remove old data because of the referential integrity issues, but partitioning still might serve the purpose of physically clustering the rows to be deleted so that fewer blocks need to be modified in order to delete them, mitigating the impact on the buffer cache.

Delete's aren't that bad, provided that you rebuild your indexes. Oracle will recover the pages that no longer contain data.
However, as-of 8i (and quite probably still), it would not properly recover index pages that no longer contained valid references. Worse, since the index leaves were chained, you could get into a situation where it would start walking the leaf nodes to find a row. This would cause a rather significant drop in performance: queries that would normally take seconds could take minutes. The drop was also very sudden: one day it would be fine, the next day it wouldn't.
I discovered this behavior (there was an Oracle bug for it, so other people have too) with an application that used increasing keys and regularly deleted data. Our solution was to invert portions of the key, but that's not going to help you with dates.

What if you temporarily deactivate indexes, perform the deletes and then rebuild them? Would it improve the performance of your deletes? Of course, in this case you have to make sure the scripts are correct and ensure proper delete order and referential integrity.

We have the same problem, using the same strategy.
If the situation becomes really bad (very fragmented allocation of indexes, tables, ...), we try to apply space reclamation actions.
Tables have to allow row movement (like for the flashback):
alter table TTT enable row movement;
alter table TTT shrink space;
and then rebuild all indexes.
I don't know how you are with maintenance windows, if the application has to be usable all the time, it is harder, if not, you can do some "repacking" when it is off-line. "alter table TTT move tablespace SSSS" does a lot of work cleaning up the mess as the table is rewritten. You can also specify new storage parameters such as extent management, sizes, ... take a look in the docs.
I use a script like this to create a script for the whole database:
SET SQLPROMPT "-- "
SET ECHO OFF
SET NEWPAGE 0
SET SPACE 0
SET PAGESIZE 0
SET FEEDBACK OFF
SET HEADING OFF
SET TRIMSPOOL ON
SET TERMOUT OFF
SET VERIFY OFF
SET TAB OFF
spool doit.sql
select 'prompt Enabling row movement in '||table_name||'...'||CHR (10)||'alter table '||table_name||' enable row movement;' from user_tables where table_name not like '%$%' and table_name not like '%QTAB' and table_name not like 'SYS_%';
select 'prompt Setting initial ext for '||table_name||'...'||CHR (10)||'alter table '||table_name||' move storage (initial 1m);' from user_tables where table_name not like '%$%' and table_name not like '%QTAB' and table_name not like 'SYS_%';
select 'prompt Shrinking space for '||table_name||'...'||CHR (10)||'alter table '||table_name||' shrink space;' from user_tables where table_name not like '%$%' and table_name not like '%QTAB' and table_name not like 'SYS_%';
select 'prompt Rebuilding index '||index_name||'...'||CHR (10)||'alter index '||index_name||' rebuild;' from user_indexes where status = 'UNUSABLE';
spool off
prompt now check and then run #doit.sql
exit

Related

Delete from temporary tables takes 100% CPU for a long time

I have a pretty complex query where we make use of a temporary table (this is in Oracle running on AWS RDS service).
INSERT INTO TMPTABLE (inserts about 25.000 rows in no time)
SELECT FROM X JOIN TMPTABLE (joins with temp table also in no time)
DELETE FROM TMPTABLE (takes no time in a copy of the production database, up to 10 minutes in the production database)
If I change the delete to a truncate it is as fast as in development.
So this change I will of course deploy. But I would like to understand why this occurs. AWS team has been quite helpful but they are a bit biased on AWS and like to tell me that my 3000 USD a month database server is not fast enough (I don't think so). I am not that fluent in Oracle administration but I have understood that if the redo logs are constantly filled, this can cause issues. I have increased the size quite substantially, but then again, this doesn't really add up.

This is a fairly standard issue when deleting large amounts of data. The delete operation has to modify each and every row individually. Each row gets deleted, added to a transaction log, and is given an LSN.
truncate, on the other hand, skips all that and simply deallocates the data in the table.
You'll find this behavior is consistent across various RDMS solutions. Oracle, MSSQL, PostgreSQL, and MySQL will all have the same issue.

I suggest you use an Oracle Global Temporary table. They are fast, and don't need to be explicitly deleted after the session ends.
For example:
CREATE GLOBAL TEMPORARY TABLE TMP_T
(
ID NUMBER(32)
)
ON COMMIT DELETE ROWS;
See https://docs.oracle.com/cd/B28359_01/server.111/b28310/tables003.htm#ADMIN11633

Removing bloat from greenplum table

I have created some tables in Greenplum, performing insert update and delete operation. Regularly I am also performing vacuum operation. I Found bloat in it. Found solution to remove bloat https://discuss.pivotal.io/hc/en-us/articles/206578327-What-are-the-different-option-to-remove-bloat-from-a-table
However, if I truncate the table and reinsert the data, it removes bloat. Is it good practice to truncate the data from the table?

If you are performing UPDATE and DELETE statements on a heap table (default storage) and running VACUUM regularly, you will get some bloat by design. Heap storage, which is similar to the default PostgreSQL storage mechanism, provides read consistency using Multi-Version Concurrency Control (MVCC).
When you UPDATE or DELETE a record, the old value is still in the table and is able to be read by transactions that are still inflight and started before you issued the UPDATE or DELETE command. This provides the read consistency to the table.
When you execute a VACUUM statement, the database will mark the stale rows as available to be overwritten. It doesn't shrink the files. It just marks rows so they can be overwritten. The next time you execute an INSERT or UPDATE, the stale rows are now able to be used for the new data.
So if you UPDATE or DELETE 10% of a table between running VACUUM, you will probably have about 10% bloat.
Greenplum also has Append-Optimized (AO) storage which doesn't use MVCC and uses a visibility map instead. The files are bit smaller too so you should get better performance. The stale rows are hidden with the visibility map and VACUUM won't do anything until you hit the gp_appendonly_compaction_threshold percentage. The default is 10%. When you have 10% bloat in an AO table and execute VACUUM, the table will automatically get rebuilt for you.
Append-Optimized is called "appendonly" for backwards compatibility reasons but it does allow UPDATE and DELETE. Here is an example of an AO table:
CREATE TABLE sales
(txn_id int, qty int, date date)
WITH (appendonly=true)
DISTRIBUTED BY (txn_id);

Instead of truncate it is better to use drop the table, create the table and then insert the data.

In Oracle, what dictionary table tells me the "store in" value for partitioned indexes?

We are running Oracle 11g and have some partitioned tables. I am trying to write an automated process to script out the indexes on these tables. (Basically when we do bulk loads, we want to drop all the indexes beforehand and recreate them afterward.)
The problem I have is knowing how to script out the partitioned indexes. Some are created with "LOCAL STORE IN (tablespacename)" and others just with "LOCAL" (which stores index extents in the same partition as the data). In either case, dba_indexes.tablespace_name is null, and I have having a heck of a time scripting out the two different cases correctly.
I know I can simply re-run the original DDL to recreate the indexes, but multiple parts of the organization can make changes, and there would be less risk if the loader tool could be self-contained and simply rebuild whatever was there to begin with.
I can query dba_ind_subpartitions, and if the tablespace_name values for every subpartition all match, then I can assume/infer that I should STORE IN that tablespace name. But, if the table is in a small single-partition state (e.g. newly created or just after archival), then the ones created with just LOCAL also match this test, so this is also not a perfect way of telling them apart.
I can compare the names of the index subpartition tablespaces to the data table partition tablespaces, and if they match, then I can assume/infer that those should be created with just LOCAL. But, that drags a bunch of extra tables into my query and makes it really hard to read, so I am worried about maintainability going forward. Plus, it just seems like a kludge.
It seems like there should be someplace in Oracle's data dictionaries where it is simply keeping track of this, and where I can just directly look it up instead of having to do a bunch of math and rely on assumptions. But, I have done a good deal of digging and haven't yet found it. So, any help would be much appreciated.

Although an insert alone is faster without the presence of indexes, have you benchmarked a load into tables with indexes enabled and established that it is slower than disabling (more robust than dropping!) and rebuilding them?
When you direct path insert into a table with indexes, Oracle optimises the index maintenance process by creating temporary segments to hold just the data required for the index builds. This generally allows the index maintenance to scan much smaller segments than otherwise required -- the temp segments plus the existing indexes.

Well, as jonearles describes, the dbms_metadata package is the way to generate DDL for existing objects.
But, it seems to me, this is more work than is required for what you're trying to achieve. If this is all for loading data, I recommend you simply alter the indexes to be unusable, set 'skip_unusable_indexes=true', do the data load, and the rebuild the indexes.
This should achieve what you want, without having to drop and re-create the indexes.

DBMS_METADATA.GET_DDL is easier than querying the data dictionary:
--Sample table and index.
create table test1(a number);
create index test1_idx on test1(a);
--Store the DDL, drop the index, then re-create it.
declare
ddl_before clob;
begin
ddl_before := dbms_metadata.get_ddl('INDEX', 'TEST1_IDX');
execute immediate 'drop index test1_idx';
--Do some processing here.
execute immediate ddl_before;
end;
/

Oracle flashbacks, query for past data

Did you know how exactly query for past data works?
The version of oracle is 10G
With this query I can recover some data, but sometimes this query
select *
from table as of timestamp systimestamp - 1
retrieve an error (too old snapshot).
Is possible to augment time for this work and retrieve data about 24 hour? Thanks!

The key issue here is the sizing of the undo segments, and the undo retention and guarantee.
The long and short of it is that you need your undo tablespace sized to hold all of the changes that can be made withing the maximum period that you want to flashback over, and you'd want to set the undo retention parameter to that value. If it is really critical to your application that the undo is preserved then set the undo guarantee on the undo tablespace.
Useful docs: http://docs.oracle.com/cd/B12037_01/server.101/b10739/undo.htm#i1008577
Be aware that performance of flashback is rather poor for bulk data, as the required undo blocks need to be found in the tablespace. 11g has better options for high performance flashback.

What the error means is that the rollback segment became invalidated because,
usually, the query took too long. There are other causes. Like rollback segment sizing.
How many rows are in the table? - you can get an idea from this
select num_rows
from all_tables
where table_name='MYTABLE_NAME_GOES_HERE';
If there are LOTS of rows, you may need to look at adding some kind index to support your query. Because a full table scan takes too long. If not then it is a DBA issue. Maybe adding an index is a DBA issue in your shop as well.
If this worked well a few days ago, and started happening lately, you probably just passed the threshold for the rollback.

Improving DELETE and INSERT times on a large table that has an index structure

Our application manages a table containing a per-user set of rows that is the
result of a computationally-intensive query. Storing this result in a table
seems a good way of speeding up further calculations.
The structure of that table is basically the following:
CREATE TABLE per_user_result_set
( user_login VARCHAR2(N)
, result_set_item_id VARCHAR2(M)
, CONSTRAINT result_set_pk PRIMARY KEY(user_login, result_set_item_id)
)
;
A typical user of our application will have this result set computed 30 times a
day, with a result set consisting of between 1 single items and 500,000 items.
A typical customer will declare about 500 users into the production database.
So, this table will typically consist of 5 million rows.
The typical query that we use to update this table is:
BEGIN
DELETE FROM per_user_result_set WHERE user_login = :x;
INSERT INTO per_user_result_set(...) SELECT :x, ... FROM ...;
END;
/
After having run into performance issues (the DELETE part would take much time)
we decided to have a GLOBAL TEMPORARY TABLE (on commit delete rows) to hold a
“delta” of rows to suppress from the table and rows to insert into it:
BEGIN
INSERT INTO _tmp
SELECT ... FROM ...
MINUS SELECT result_set_item_id
FROM per_user_result_set
WHERE user_login = :x;
DELETE FROM per_user_result_set
WHERE user_login = :x
AND result_set_item_id NOT IN (SELECT result_set_item_id
FROM _tmp
);
INSERT INTO per_user_result_set
SELECT :x, result_set_item_id
FROM _tmp;
COMMIT;
END;
/
This has improved performance a bit, but still this is not satisfactory. So
we're exploring ways to speed up that process and here are the issues that
we experience:
We would have loved to use table partitioning (partitioning by user_login).
But partitioning is not always available (on our test databases we hit
ORA-00439). Our customers cannot all afford Oracle Enterprise Edition with
paid additional features.
We could make the per_user_result_set table GLOBAL TEMPORARY, so that it
is isolated and we can TRUNCATE it for example… but our application
sometimes loses connection to Oracle due to network problems, and will
automatically reconnect. By that time we lose the contents of our
computation.
We could split that table into a certain number of buckets, make a view that
UNIONs ALL all those buckets, and triggers INSTEAD OF UPDATE and DELETE on
that view, and repart rows according to ORA_HASH(user_login) % num_buckets.
But we are afraid this could make SELECT operations much slower.
This would result in a constant number of tables, with smaller indexes
affected in DELETE or INSERT operations. In short, “partioning table for the
poor”.
We've tried to ALTER TABLE per_user_result_set NOLOGGING. This does not
improve things much.
We've tried to CREATE TABLE ... ORGANIZATION INDEX COMPRESS 1. This speeds
things up by a ratio of 1:5.
We've tried to have one table per user_login. That's exactly what we could
have by partitioning using a number of partitions equal to the number of
distinct user_logins and a well-chosen hash function. Performance factor is
1:10. But I would really like to avoid this solution: have to maintain a
huge number of indexes, tables, views, on a per-user basis. This would be
an interesting performance gain for the users, but not for us maintainers of
the systems.
Since the users work at the same time there is no way that we create a new
table and swap it with the old one.
What could you please suggest in complement to these approaches?
Note. Our customers run Oracle Databases from 9i to 11g, and XE editions to
Enterprise edition. That's a wide variety of versions that we need to be
compatible with.
Thanks.

We've tried to have one table per user_login. That's exactly what we
could have by partitioning using a number of partitions equal to the
number of distinct user_logins and a well-chosen hash function.
Performance factor is 1:10. But I would really like to avoid this
solution: have to maintain a huge number of indexes, tables, views, on
a per-user basis. This would be an interesting performance gain for
the users, but not for us maintainers of the systems.
Can you then make a stored procedure to generate these table on a per-user basis? Or, better yet, have this stored procedure do the most appropriate thing depending on the licensure of Oracle being supported?
If Partitioning option
then create or truncate user-specific list partition
Else
drop user-specific result table
Create user-specific result table
as Select from template result table
create indexes
create constraints
perform grants
end if
Perform insert

If all your users were on 11g Enterprise Edition I would recommend you to use Oracle's built-in result-set caching rather than trying to roll your own. But that is not the case, so let's move on.
Another attractive option might be to use PL/SQL collections rather than tables. Being in memory these are faster to retrieve and require less maintenance. They are also supported in all the versions you need. However, they are session variables, so if you have lots of users with big result sets that would put stress on your PGA allocations. Also their data would be lost when the network connection drops. So that's probably not the solution you're looking for.
The core of your problem is this statement:
DELETE FROM per_user_result_set WHERE user_login = :x;
It's not a problem in itself but you have extreme variations in data distribution. Bluntly, the deletion of a single row is going to have a very different performance profile from the deletion of half a million rows. And because your users are constantly refreshing their data there is no way you can handle that, except by giving your users their own tables.
You say you don't want to have a table per user because
"[it] would be an interesting performance gain for the users, but not
for us maintainers of the systems,"
Systems exist for the benefit of our users. Convenience for us is great as long as it helps us to provide better service to them. But their need for a good working experience trumps ours: they pay the bills.
But I question whether having individual tables for each user really increases the work load. I presume each user has their own account, and hence schema.
I suggest you stick with index-organized tables. You only need columns which are in the primary key and maintaining a separate index is unnecessary overhead (for both inserting and deleting). The big advantage of having a table per user is that you can use TRUNCATE TABLE in the refresh process, which is a lot faster than deletion.
So your refresh procedure will look like this:
BEGIN
TRUNCATE TABLE per_user_result_set REUSE STORAGE;
INSERT INTO per_user_result_set(...)
SELECT ... FROM ...;
DBMS_STATS.GATHER_TABLE_STATS(user
, 'PER_USER_RESULT_SET'
, estimate_percent=>10);
COMMIT;
END;
/
Note that you don't need to include the USER column any more, so yur table will just have the single column of result_set_item_id (another indication of the suitability of IOT.
Gathering the table stats isn't mandatory but it is advisable. You have a wide variability in the size of result sets, and you don't want to be using an execution plan devised for 500000 rows when the table has only one row, or vice versa.
The only overhead is the need to create the table in the user's schema. But presumably you already have some set-up for a new user - creating the account, granting privileges, etc - so this shouldn't be a big hardship.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio