INSERT: when should I not be using the APPEND hint? - oracle

I'm trying to insert batches of data in an Oracle table, with an INSERT statement, namely:
INSERT INTO t1 SELECT * FROM all_objects;
I've come across the APPEND hint, which seems to increase performance in some cases.
Are there situations where it might decrease performance and it should not be used?
Thanks

The append hint does a direct path insert, which is the same direct path insert used by SQL*Loader, when specified. For large datasets, you should see dramatic improvements.
One of the main caveats you need to be aware of is that one of the reasons it is so fast is that it inserts all new rows past the high water mark. This means that if you are frequently deleting rows and re-inserting, a conventional insert could potentially be better than a direct path because it will reclaim the freed space from the deleted rows.
If, for example, you had a table with 5 million rows where you did a delete from followed by a direct path insert, after a few iterations you would notice things slow to a crawl. The insert itself would continue to be nice and speedy, but your queries against the table will gradually get worse.
The only way I know of to reset the HWM is to truncate the table. If you plan to use direct path on a table with minimal dead rows, or if you are going to somehow reset the HWM, then I think in most cases it will be fine -- preferable, in fact, if you are inserting large amounts of data.
Here's a nice article that explains the details of the differences:
https://sshailesh.wordpress.com/2009/05/03/conventional-path-load-and-direct-path-load-simple-to-use-in-complex-situations/
A final parting shot -- with all Oracle hints, know everything you can before you use them. Using them haphazardly can be haphazard to your health.

I think performance will may be decreased in the special case if your select retrievs only one or a small number of rows.
So in this I would use not the append hint. The OracleBase article describes very well the impact of the APPEND hint. He also provides the link to the manual page
There are 3 different situations:
The APPEND hint will not have any effect because it will be silently ignored. This will happen if a trigger is defined on the table or a reference constraint or under some other circumstances.
The append hint will raise an error message or a statement following the statement with the APPEND hint will raise an error message. Her you have two possibilities: either you remove the APPEND hint or you split the transaction in two or more separate transactions.
The append hint will work. Here you will get better performance if you use the APPEND hint (except if you have only a small number of rows to insert as stated at the beginning). But you will also need more space when using the append hint. The insert will use news extents for the data and not fill them in the free space of the existing extends. If you do a parallel insert each process uses its own extents. This may in a lot of unused space and be a drawback in some situations.

It might negatively affect performance if you are using it for inserting small sets of data.
That's because it will allocate new space every time instead of reusing free space, so using it with multiple small sets can fragment your table which may result on performance issues.
That hint is a good idea for large inserts scheduled for times where usage is low.

Related

Data Types and Indexes

Is there some sort of performance difference for inserting, updating, or deleting data when you use the TEXT data type?
I went here and found this:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.
This makes me believe there should not be a performance difference, but my friend, who is much more experienced than I am, says inserts, updates, and deletes are slower for the TEXT data type.
I had a table that was partitioned with a trigger and function, and extremely heavily indexed, but the inserts did not go that slow.
Now I have another table, with 5 more columns all of which are text data type, the same exact trigger and function, no indexes, but the inserts are terribly slow.
From my experience, I think he is correct, but what do you guys think?
Edit #1:
I am uploading the same exact data, just the 2nd version has 5 more columns.
Edit #2:
By "Slow" I mean with the first scenario, I was able to insert 500 or more rows per second, but now I can only insert 20 rows per second.
Edit #3: I didn't add the indexes to the 2nd scenario like they are in the 1st scenario because indexes are supposed to slow down inserts, updates, and deletes, from my understanding.
Edit #4: I guarantee it is exactly the same data, because I'm the one uploading it. The only difference is, the 2nd scenario has 5 additional columns, all text data type.
Edit #5: Even when I removed all of the indexes on scenario 2 and left all of them on scenario 1, the inserts were still slower on scenario 2.
Edit #6: Both scenarios have the same exact trigger and function.
Edit #7:
I am using an ETL tool, Pentaho, to insert the data, so there is no way for me to show you the code being used to insert the data.
I think I might have had too many transformation steps in the ETL tool. When I tried to insert data in the same transformation as the steps that actually transform the data, it was massively slow, but when I simply inserted the data already transformed into an empty table and then inserted data from this table into the actual table I'm using,the inserts were much faster than scenario 1 at 4000 rows per second.
The only difference between scenario 1 and scenario 2, other than the increase in columns in scenario 2, is the number of steps in the ETL transformation.Scenario two has about 20 or more steps in the ETL transformation. In some cases, there are 50 more.
I think I can solve my problem by reducing the number of transformation steps, or putting the transformed data into an empty table and then inserting the data from this table into the actual table I'm using.
PostgreSQL text and character varying are the same, with the exception of the (optional) length limit for the latter. They will perform identically.
The only reasons to prefer character varying are
you want to impose a length limit
you want to conform with the SQL standard

how to get around when normal index or bitmap index isn't useful

The columns that are in the where clause are not selective. They are all in 1 single table. In addition the expressions used are NOT EQUAL, OR, IS NULL, IS NOT NULL. The primary key is on the customer ID. I am not sure how to get around with this kind of data. Are there different indexing methods that can be created on table or other ways to solve the problem. I guess partitions won't be helpful either for breaking a table into one major section with large data. Any thoughts or workarounds will be useful.
I'm putting below the data for reference and sample queries for ease of understanding.
sample query
colA = 'Marketable' OR colA is null
NORMAL index: gets ignored due to OR and NULL operator. Moreover the queried data covers more than 95% of data in the table.
BITMAP index: gets ignored due to more than 96% data coverage.
sample query
colB = '7' OR colB = '6' OR colB = '5'
NORMAL or BITMAP: both not useful due to large data selection. Optimizer goes with full table scan using the primary key cust_id.
sample query
colC <> 'SPECIAL SEGMENT' OR colC is null (since the values can change, no specific value is passed)
combination sample query
NOT (colB = '6' OR colB = '3') AND
(colC <> 'SPECIAL SEGMENT' OR colC is null)
Full table scans are not evil. Index access is not always more efficient.
If you want to return the majority of the data in a table, you want to use a full table scan since that's the most efficient way of accessing large fractions of the data in the table. Indexes are great when you want to access relatively small fractions of the data in the table. But if you want most of the data, doing millions of index accesses is not going to be more efficient. In your first example, you want to return 9.2 million rows from a 9.3 million row table. A full table scan is the plan you want-- that's the most efficient way to retrieve 99% of the rows in the table. Anything else is going to be less efficient. You could, I suppose, potentially partition the table on A leading to full partition scans of the two large partitions. That's only going to cut, say 1% of the work your query needs to do, though, and may have negative impacts on other queries on that table.
Now, I'm always a bit suspicious about queries that want to return 99% of the rows in a table in the first place. It would make no sense to have such a query in an OLTP system, for example, because no human is going to page through 9.2 million rows of data. It wouldn't make sense to have that sort of query if the goal is to replicate data because it would almost certainly be more efficient to just replicate incremental changes rather than the entire data set every time. It might make sense to read almost all the rows if the goal is to perform some aggregations. But if this is something that happens enough to care about optimizing the analysis, you'd be better off looking at ways of pre-aggregating the data using materialized views and dimensions so that you can read and aggregate the data once and then just read your pre-aggregated values at runtime.
If you do really need to read all that data, you may also want to look into parallel query. If there are relatively few readers, it is more efficient to let Oracle do the full scan in parallel so that your session can utilize more of the available hardware. Of course, that means that you can have fewer simultaneous sessions since more hardware for you means less for others, so that's a trade-off you need to understand. If you're building an ETL process where there will only be a couple sessions loading data at any point, parallel query can provide substantial performance improvements.

Why oracle IN clause has limit of 1000 only for static data?

Oracle IN clause has limit of 1000 for static data,but it accepts unlimited data from sub queries. why?
It's a restriction on any expression list:
A comma-delimited list of expressions can contain no more than 1000 expressions.
Why 1000? Presumably the implementation needs some kind of limit, and that probably seemed like more than enough. There may well be, or certainly may have been when that limit was set decades ago, a performance reason for the limit as well, particularly as the IN is converted to multiple OR statements by the optimiser in this case (which you can see if you look at the execution plan).
I'd struggle to come up with a reasonable scenario that needed to get anywhere near that, with fixed values that couldn't be derived from other data anyway as a subquery.
I suspect it's somewhat related to the logical database limits which say you can't have more than 1000 columns in a table, for instance; since an expression list is used in an insert statement to list both the columns and the values being inserted, the expression list has to be able to match that, but maybe has no reason to exceed it.
Speculation of course... without seeing the internals of the software you're unlikely to get a definitive answer.
This is because IN has very poor performance with large number of values in the list. It's just shortcut for OR clause, and at the database level the engine will change IN to OR's.
You should also avoid doing subqueries inside IN clause - better use EXISTS.
Try using 'exists' than 'in'.You can as well create sub queries using 'exists'.

Full table scan when querying against an indexed column

I created an index for one table, a simple index just like that:
CREATE INDEX IDX_TRANSACAO_NOVA_STATUS ON TRANSACAO_NOVA(STATUS) TABLESPACE COMVENIF;
This table has 1000K registers insinde and the status table just 5 or 6 possible values. After created the index i expected that the query bellow would have a better performance:
select * from transacao_nova tn where tn.status = 'XXX'
but, the explain plan still show me a full scan with 16.000 cost.
any help? i'm not a dba but i need to improve this performance.
thanks in advance.
If there are only 5 or 6 different status values and a million records the query optimizer may be deciding it is not worth using the index to do a range scan that would still return a substantial number of all the records in the table.
You might look into using an index-clustered table for this application.
If data in the status column es skewed (not uniform: some values appear very often and others appear very rarely), you can accelerate queries for the rare values by refreshing statistics (and verifying that you are calculating a histogram for the status column. This will make Oracle use the index in the cases in which it is more efficient.
http://docs.oracle.com/cd/E11882_01/server.112/e16638/stats.htm#autoId12
Be aware that automatically determining if a column needs a histogram is not a good idea as it may lead to inconsistent behaviour. It is better to manually specify histograms when needed. Also, histograms affect every query that uses those columns, so they should be collected with care.
You might need to generate new statistics on the table.
http://docs.oracle.com/cd/B19306_01/server.102/b14211/stats.htm
A common mistake is to assume that an index range scan will be better than a full scan because you only want some "small" fraction of the total rows in the table. But if the rows you want are scattered throughout the table's storage extents, locating them by an index lookup can be slower than just scanning the entire table. I can't say for sure that's the case in your situation, but it's a possibility.
For a more in-depth discussion of this topic I recommend this paper.

TSql, building indexes before or after data input

Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.

Resources