I have tried to find examples but they are all simple with a single where clause. Here is the situation. I have a bunch of legacy data transferred from another database. I also have the "good" tables in that same database. I need to transfer (data-conversion) data from the legacy tables to thew tables. Because this is a different set of tables the data-conversion requires complex joins to put the old data into the new tables correctly.
So, old tables old data.
New tables must have the old data but it requires lots of joins to get that old data into the new tables correctly.
Can I use direct path with lots of joins like this? INSERT SELECT (lots of joins)
Does direct path apply to tables that are already on the same database (transfer between tables)? Is it only for loading tables from say a text file?
Thank you.
The query in your SELECT can be as complex as you'd like with a direct-path insert. The direct-path refers only to the destination table. It has nothing to do with the way that data is read or processed.
If you're doing a direct-path insert, you're asking Oracle to insert the new data above the high water mark of the table so you bypass the normal code that reuses space in existing blocks for new rows to be inserted. It also has to block other inserts since you can't have the high water mark of the table change during a direct-path insert. This probably isn't a big deal if you've got a downtime window in which to do the load but it would be quite problematic if you wanted the existing tables to be available for other applications during the load.
No, on the contrary, it means you need to do a backup after a NOLOGGING load, not that you can't backup the database.
Allow me to elaborate a bit. Normally, when you do DML in Oracle, the before images of the changes you are are making get logged in UNDO, and all the changes (including the UNDO changes) are first written to REDO. This is how Oracle manages transactions, instance recovery, and database recovery. If a transaction is aborted or rolled back, Oracle uses the information in UNDO to undo the changes your transaction made. If the instance crashes, then on instance restart, Oracle will use the information in REDO and UNDO to recover up to the last committed transaction. First, Oracle will read the REDO and roll forward, then, use UNDO to roll back all the transactions that were not committed at the time of the crash. In this way, Oracle is able to recover up to the last committed transaction.
Now, when you specify an APPEND hint on an insert statement, Oracle will execute the INSERT with direct load. This means that data is loaded into brand new, never before used blocks, from above the highwater mark. Because the blocks being loaded are brand new, there is no "before image", so, Oracle can avoid writing UNDO, which improves performance. If the database is in NOARCHIVELOG mode, then Oracle will also not write REDO. On a database in ARCHIVELOG mode, Oracle will still write REDO, unless, before you do the insert /*+ append */, you set the table to NOLOGGING, (i.e. alter table tab_name nologging;). In that case, REDO logging is disabled for the table. However, this is where you could run into backup/recovery implications. If you do a NOLOGGING direct load, and then you suffer a media failure, and the datafile containing the segment with the nologging operation is restored from a backup taken before the nologging load, then the redo log will not contain the changes required to recover that segment. So, what happens? Well, when you do a NOLOGGING load, Oracle writes extent invaldation records to the redo log, instead of the actual changes. Then, if you use that redo in recovery, those data blocks will be marked logically corrupt. Any subsequent queries against that segment will get an ORA-26040 error.
So, how to avoid this? Well, you should always take a backup imediately following any NOLOGGING direct load. If you restore/recover from a backup taken after the nologging load, there is no problem, because the data will be in the datablocks in the file that was restored.
Hope that's clear,
-Mark
Yes, there should not be any arbitrary limits on query complexity.
If you do
insert /*+ APPEND */ into target_table select .... from source1, source2..., sourceN where
It should work fine. Consider though, that the performance of the load will be limited by the performance of that query, so, be sure it's well-tuned, if you're expecting good performance.
Finally, consider whether setting NOLOGGING on the target table would improve performance significantly. But, also consider the backup recovery implications, if you decide to implement NOLOGGING.
Hope that helps,
-Mark
Related
I have created some tables in Greenplum, performing insert update and delete operation. Regularly I am also performing vacuum operation. I Found bloat in it. Found solution to remove bloat https://discuss.pivotal.io/hc/en-us/articles/206578327-What-are-the-different-option-to-remove-bloat-from-a-table
However, if I truncate the table and reinsert the data, it removes bloat. Is it good practice to truncate the data from the table?
If you are performing UPDATE and DELETE statements on a heap table (default storage) and running VACUUM regularly, you will get some bloat by design. Heap storage, which is similar to the default PostgreSQL storage mechanism, provides read consistency using Multi-Version Concurrency Control (MVCC).
When you UPDATE or DELETE a record, the old value is still in the table and is able to be read by transactions that are still inflight and started before you issued the UPDATE or DELETE command. This provides the read consistency to the table.
When you execute a VACUUM statement, the database will mark the stale rows as available to be overwritten. It doesn't shrink the files. It just marks rows so they can be overwritten. The next time you execute an INSERT or UPDATE, the stale rows are now able to be used for the new data.
So if you UPDATE or DELETE 10% of a table between running VACUUM, you will probably have about 10% bloat.
Greenplum also has Append-Optimized (AO) storage which doesn't use MVCC and uses a visibility map instead. The files are bit smaller too so you should get better performance. The stale rows are hidden with the visibility map and VACUUM won't do anything until you hit the gp_appendonly_compaction_threshold percentage. The default is 10%. When you have 10% bloat in an AO table and execute VACUUM, the table will automatically get rebuilt for you.
Append-Optimized is called "appendonly" for backwards compatibility reasons but it does allow UPDATE and DELETE. Here is an example of an AO table:
CREATE TABLE sales
(txn_id int, qty int, date date)
WITH (appendonly=true)
DISTRIBUTED BY (txn_id);
Instead of truncate it is better to use drop the table, create the table and then insert the data.
What happens if I don't specify logging/nologging in database objects in Oracle? What I meant to say how would behave with logging/nologging in database objects and without logging/nologging in database objects?
LOGGING/NOLOGGING helps manage enabling direct path writes in order to reduce the generation of REDO and UNDO. It is one of several ways to control the delicate balance between recoverability and performance.
Oracle Architecture Background Information
REDO is how Oracle provides durability, the "D" in ACID. When a transaction is committed the changes are not necessarily stored neatly in the datafiles. That keeps things fast and lets background processes handle some work. REDO is a description of the change. It is stored quickly, on multiple disks, in a "dumb" log. Changes are fast and if the server loses power one microsecond after the commit returned, Oracle can go through the REDO logs to make sure that change isn't lost.
UNDO helps Oracle provide consistency, the "C" in ACID. It stores a description of how to reverse the change. This information may be needed by another process that's reading the table and needs to know what the value used to be at an older point-in-time.
Direct path writes skip REDO, UNDO, the cache, and some other features, and directly modify data files. This is a fast but potentially dangerous option in many environments, which is why there are so many confusing options to control it. Direct path writes only apply to INSERTS, and only in the scenarios described below.
If you do nothing the default option is the safest, LOGGING.
The Many Ways to Control Direct Path Writes
LOGGING/NOLOGGING is one of several options to control direct path writes. Look at this table from AskTom to understand how the different options all work together:
Table Mode Insert Mode ArchiveLog mode result
----------- ------------- ----------------- ----------
LOGGING APPEND ARCHIVE LOG redo generated
NOLOGGING APPEND ARCHIVE LOG no redo
LOGGING no append ARCHIVE LOG redo generated
NOLOGGING no append ARCHIVE LOG redo generated
LOGGING APPEND noarchive log mode no redo
NOLOGGING APPEND noarchive log mode no redo
LOGGING no append noarchive log mode redo generated
NOLOGGING no append noarchive log mode redo generated
FORCE LOGGING can override all those settings. There are probably some other switches I'm not aware of. And of course there are the many limitations that prevent direct path - triggers, foreign keys, cluster, index organized tables, etc.
The rules are even more restrictive for indexes. An index will always generate REDO during DML statements. Only DDL statements, like CREATE INDEX ... NOLOGGING or ALTER INDEX ... REBUILD on a NOLOGGING index will not generate REDO.
Why are there so many ways? Because recoverability is incredibly important and different roles may have different views on the matter. And sometimes some people's decisions need to override others.
Developers decide at the statement level, "Insert Mode". Many weird things can happen with an /*+ APPEND */ hint and developers need to choose carefully when to use it.
Architects decide at the object level, "Table Mode". Some tables, regardless of how fast a developer may want to insert into it, must always be recoverable.
Database Administrators decide at the database or tablespace mode, "Archive log" and FORCE LOGGING. Maybe the organization just doesn't care about recovering a specific database, so set it to NOARCHIVELOG mode. Or maybe the organization has a strict rule that everything must be recoverable, so set the tablespace to FORCE LOGGING.
If you have table/index with nologging, then redo will not be generated when data is inserted into the object using direct path approaches such as insert /*+ append */.
However if database is in force logging mode then nologging will not have any affect. Redo is generated whether table/index is in logging or nologging mode.
If nologging option is set redo logs won't be generated while inserting data. You can use this to increase significantly performance of for example INSERT statement when inserting large amount of data.
Be careful never to use nologging option under Data guard setup. DB replication relies on redologs so it'll a create pretty big mess you certainly want to avoid.
I need to delete a large amount of data from my database on a regular basis. The process generates huge volume of archive logs. We had a database crash at one point because there was no storage space available on archive destination. How can I avoid generation of logs while I delete data?
The data to be deleted is already marked as inactive in the database. Application code ignores inactive data. I do not need the ability to rollback the operation.
I cannot partition the data in such a way that inactive data falls in one partition that can be dropped. I have to delete the data with delete statements.
I can ask DBAs to set certain configuration at table level/schema level/tablespace level/server level if needed.
I am using Oracle 11g.
What proportion of the data on the table would be deleted, what volume? Are there any referential integrity constraints to manage or is this table childless?
Depending on the answers , you might consider:
"CREATE TABLE keep_data UNRECOVERABLE AS SELECT * FROM ... WHERE
[keep condition]"
Then drop the original table
Then rename keep_table to original table
Rebuild the indexes (again with unrecoverable to prevent redo),constraints etc.
The problem with this approach is it's a multi-step DDL, process, which you will have a job to make fault tolerant and reversible.
A safer option might be to use data-pump to:
Data-pump expdp to extract the "Keep" data
TRUNCATE the table
Data-pump impdp import of data from step 1, with direct-path
At this point I suggest you read the Oracle manual on Data Pump, particularly the section on Direct Path Loads to be sure this will work for you.
MY preferred option would be partitioning.
Of course, the best way would be TenG solution (CTAS, drop and rename table) but it seems it's impossible for you.
Your only problem is the amount of archive logs and database crash problem. In this case, maybe you could partition your delete statement (for example per 10.000 rows).
Something like:
declare
e number;
i number
begin
select count(*) from myTable where [delete condition];
f :=trunc(e/10000)+1;
for i in 1.. f
loop
delete from myTable where [delete condition] and rownum<=10000;
commit;
dbms_lock.sleep(600); -- purge old archive if it's possible
end loop;
end;
After this operation, you should reorganize your table which is surely fragmented.
Alter the table to set NOLOGGING, delete the rows, then turn logging back on.
How can I verify an Oracle database rollback action is successful? Can I use Number of rows in activity log and Number of rows in event log?
V$TRANSACTION does not contain historical information but it does contain information about all active transactions. In practice this is often enough to quickly and easily monitor rollbacks and estimate when they will complete.
Specifically the columns USED_UBLK and USED_UREC contain the number of UNDO blocks and records remaining. USED_UREC is not always the same as the number of rows; sometimes the number is higher because it includes index entries and sometimes the number is lower because it groups inserts together.
During a long rollback those numbers will decrease until they hit 0. No rows in the table imply that the transactions successfully committed or rolled back. Below is a simple example.
create table table1(a number);
create index table1_idx on table1(a);
insert into table1 values(1);
insert into table1 values(1);
insert into table1 values(1);
select used_ublk, used_urec, ses_addr from v$transaction;
USED_UBLK USED_UREC SES_ADDR
--------- --------- --------
1 6 000007FF1C5A8EA0
Oracle LogMiner, which is part of Oracle Database, enables you to query online and archived redo log files through a SQL interface. Redo log files contain information about the history of activity on a database.
LogMiner Benefits
All changes made to user data or to the database dictionary are
recorded in the Oracle redo log files so that database recovery
operations can be performed.
Because LogMiner provides a well-defined, easy-to-use, and
comprehensive relational interface to redo log files, it can be used
as a powerful data audit tool, as well as a tool for sophisticated
data analysis. The following list describes some key capabilities of
LogMiner:
Pinpointing when a logical corruption to a database, such as errors
made at the application level, may have begun. These might include
errors such as those where the wrong rows were deleted because of
incorrect values in a WHERE clause, rows were updated with incorrect
values, the wrong index was dropped, and so forth. For example, a user
application could mistakenly update a database to give all employees
100 percent salary increases rather than 10 percent increases, or a
database administrator (DBA) could accidently delete a critical system
table. It is important to know exactly when an error was made so that
you know when to initiate time-based or change-based recovery. This
enables you to restore the database to the state it was in just before
corruption. See Querying V$LOGMNR_CONTENTS Based on Column Values
for details about how you can use LogMiner to accomplish this.
Determining what actions you would have to take to perform
fine-grained recovery at the transaction level. If you fully
understand and take into account existing dependencies, it may be
possible to perform a table-specific undo operation to return the
table to its original state. This is achieved by applying
table-specific reconstructed SQL statements that LogMiner provides in
the reverse order from which they were originally issued. See
Scenario 1: Using LogMiner to Track Changes Made by a Specific
User for an example.
Normally you would have to restore the table to its previous state,
and then apply an archived redo log file to roll it forward.
Performance tuning and capacity planning through trend analysis. You
can determine which tables get the most updates and inserts. That
information provides a historical perspective on disk access
statistics, which can be used for tuning purposes. See Scenario 2:
Using LogMiner to Calculate Table Access Statistics for an
example.
Performing postauditing. LogMiner can be used to track any data
manipulation language (DML) and data definition language (DDL)
statements executed on the database, the order in which they were
executed, and who executed them. (However, to use LogMiner for such a
purpose, you need to have an idea when the event occurred so that you
can specify the appropriate logs for analysis; otherwise you might
have to mine a large number of redo log files, which can take a long
time. Consider using LogMiner as a complementary activity to auditing
database use. See the Oracle Database Administrator's Guide for
information about database auditing.)
Enjoy.
I need to perform a query 2.5 million times. This query generates some rows which I need to AVG(column) and then use this AVG to filter the table from all values below average. I then need to INSERT these filtered results into a table.
The only way to do such a thing with reasonable efficiency, seems to be by creating a TEMPORARY TABLE for each query-postmaster python-thread. I am just hoping these TEMPORARY TABLEs will not be persisted to hard drive (at all) and will remain in memory (RAM), unless they are out of working memory, of course.
I would like to know if a TEMPORARY TABLE will incur disk writes (which would interfere with the INSERTS, i.e. slow to whole process down)
Please note that, in Postgres, the default behaviour for temporary tables is that they are not automatically dropped, and data is persisted on commit. See ON COMMIT.
Temporary table are, however, dropped at the end of a database session:
Temporary tables are automatically dropped at the end of a session, or
optionally at the end of the current transaction.
There are multiple considerations you have to take into account:
If you do want to explicitly DROP a temporary table at the end of a transaction, create it with the CREATE TEMPORARY TABLE ... ON COMMIT DROP syntax.
In the presence of connection pooling, a database session may span multiple client sessions; to avoid clashes in CREATE, you should drop your temporary tables -- either prior to returning a connection to the pool (e.g. by doing everything inside a transaction and using the ON COMMIT DROP creation syntax), or on an as-needed basis (by preceding any CREATE TEMPORARY TABLE statement with a corresponding DROP TABLE IF EXISTS, which has the advantage of also working outside transactions e.g. if the connection is used in auto-commit mode.)
While the temporary table is in use, how much of it will fit in memory before overflowing on to disk? See the temp_buffers option in postgresql.conf
Anything else I should worry about when working often with temp tables? A vacuum is recommended after you have DROPped temporary tables, to clean up any dead tuples from the catalog. Postgres will automatically vacuum every 3 minutes or so for you when using the default settings (auto_vacuum).
Also, unrelated to your question (but possibly related to your project): keep in mind that, if you have to run queries against a temp table after you have populated it, then it is a good idea to create appropriate indices and issue an ANALYZE on the temp table in question after you're done inserting into it. By default, the cost based optimizer will assume that a newly created the temp table has ~1000 rows and this may result in poor performance should the temp table actually contain millions of rows.
Temporary tables provide only one guarantee - they are dropped at the end of the session. For a small table you'll probably have most of your data in the backing store. For a large table I guarantee that data will be flushed to disk periodically as the database engine needs more working space for other requests.
EDIT:
If you're absolutely in need of RAM-only temporary tables you can create a table space for your database on a RAM disk (/dev/shm works). This reduces the amount of disk IO, but beware that it is currently not possible to do this without a physical disk write; the DB engine will flush the table list to stable storage when you create the temporary table.