gathering statistics on tables without indexes - oracle

Does it make sense to gather regularly statistics on table without indexes in Oracle database? I'm asking from optimization point of view. I assume that always FULL TABLE SCAN would be performed on that table.

Yes it's still worth gathering the statistics. Information about the number and size of rows is of use to the optimizer, even though there are no indexes

In a nutshell, statistics are as important to optimizer as food is to human beings. If you don't get to eat for a long time, your brain would degrade in its functioning.
The more the optimizer knows the latest statistics, the better is the execution plan it could decide.
Let me try to explain with an example:
Let's say you are asked to reach a particular destination on a fine day. However, you are not provided with the map and location information. Now, there could be N number of ways to reach the destination, but without proper information you would make the worst possible way. If you are smart enough, you might ask for directions, now this is where you start gathering statistics. Just imagine, if you had the entire plan in mind before you start your journey, i.e. if you could gather all the statistics, you could make the best plan.
UPDATE Saw a comment about auto optimizer stats collection.
Yes, Of course there is auto optimizer stats collection in Oracle 11g Release 1. Please see more information here

Related

Indexes vs full scans, how do i determine which to use and when

Is there a rule of thumb on when to use full scans rather than indexes? I am new to oracle and am still struggling to wrap my mind around performance tuning. thanks!
The intent of the optimizer is to make these decisions for you. Granted it won't always be right, but generally I'll let the optimizer do all the heavy lifting for me, and only worry about the issues where it went awry and I had to "intervene" so to speak.
But in terms of full scan versus index, I'd encourage you not to think in these terms, simply because your customer and hence your applications don't care. What customers care about is response time, so the driving factor here should always be response time.
If for query 1, a full scan is faster than an index scan, then the full scan is the better option. That doesn't make a full scan better "always" or somehow "philosophically" or "technically" better than an index scan. It simply means for this query, the full scan was best.
If for another query, the index is better, then by all means, we should use the index.
Its common to see advisories in books and blog posts, saying "Look for TABLE ACCESS FULL in the plan - thats bad", and similar. That is BS in my opinion. Whatever execution plan gives you the best performance for your query, is the best one... no matter whether it uses index scans, full scans, or any other form of optimizer path.
If you'd like a more holistic look at SQL tuning, I've written a four-part series here https://connor-mcdonald.com/2019/10/24/the-holistic-sql-tuning-series/

Does it matter if I don't stress test Oracle 11g with random data

We need to stress test our Oracle database with about 5 million row inserts. According to our DBA, the only columns that need to be different are the Primary or foreign key...all other columns can be the same. He said if we do that, then Oracle will not do any sort of caching when inserting the data.
I just want to make sure that he is right and that by doing this, the stress testing results would be nearly as accurate as using random data. Thank you for your help.
In a very narrow set of circumstances, the DBA is correct. If ALL your queries are lookups based upon primary and foreign keys, then they may be right. In the past when the rule-based optimizer was king, then the data didn't matter so much. Record counts, yes, but not really the data.
In the real world, though, this is not the case. Do you have any other indexes? Then the data matters. Do you join against things other than primary/foreign keys? Then the data matters. Are your strings all 1 byte or null? I doubt it, and the size of these variable-length fields may affect the amount of IO. Basically, for any non-trivial schema in a non-trivial application, having "realistic" data can be significant. The Oracle optimizer takes into account a large variety of statistics when determining how to perform a query.
Are you REALLY only doing inserts in this load test? That's kinda silly. 5 million records is chump change by modern standards. Desktops do that in seconds, typically. Even simple applications will perform some select to do a lookup, or get a set of records based upon a non-key value.
You seem to be smart enough to evaluate the DBA's statement. If you can get him to put that in writing, sign off on it, and have the responsibility fall on him when his idea of a load test doesn't work as expected, then that's great. It sounds like you're the one responsible for this test, though.
If I were in your shoes, I would want to load test with the most accurate data possible. Copying from a production system or known test set of data is a much better option than "random" and light-years better than "nulls except for the primary key" approach.

Calculating results in a scalable way based on transaction data in web app?

Possible duplicate:
Database design: Calculating the Account Balance
I work with a web app which stores transaction data (e.g. like "amount x on date y", but more complicated) and provides calculation results based on details of all relevant transactions[1]. We are investing a lot of time into ensuring that these calculations perform efficiently, as they are an interactive part of the application: i.e. a user clicks a button and waits to see the result. We are confident, that for the current levels of data, we can optimise the database fetching and calculation to complete in an acceptable amount of time. However, I am concerned that the time taken will still grow linearly as the number of transactions grow[2]. I'd like to be able to say that we could handle an order of magnitude more transactions without excessive performance degradation.
I am looking for effective techniques, technologies, patterns or algorithms which can improve the scalability of calculations based on transaction data.
There are however, real and significant constraints for any suggestion:
We currently have to support two highly incompatible database implementations, MySQL and Oracle. Thus, for example, using database specific stored procedures have roughly twice the maintenance cost.
The actual transactions are more complex than the example transaction given, and the business logic involved in the calculation is complicated, and regularly changing. Thus having the calculations stored directly in SQL are not something we can easily maintain.
Any of the transactions previously saved can be modified at any time (e.g. the date of a transaction can be moved a year forward or back) and calculations are expected to be updated instantly. This has a knock-on effect for caching strategies.
Users can query across a large space, in several dimensions. To explain, consider being able to calculate a result as it would stand at any given date, for any particular transaction type, where transactions are filtered by several arbitrary conditions. This makes it difficult to pre-calculate the results a user would want to see.
One instance of our application is hosted on a client's corporate network, on their hardware. Thus we can't easily throw money at the problem in terms of CPUs and memory (even if those are actually the bottleneck).
I realise this is very open ended and general, however...
Are there any suggestions for achieving a scalable solution?
[1] Where 'relevant' can be: the date queried for; the type of transaction; the type of user; formula selection; etc.
[2] Admittedly, this is an improvement over the previous performance, where an ORM's n+1 problems saw time taken grow either exponentially, or at least a much steeper gradient.
I have worked against similar requirements, and have some suggestions. Alot of this depends on what is possible with your data. It is difficult to make every case imaginable quick, but you can optimize for the common cases and have enough hardware grunt available for the others.
Summarise
We create summaries on a daily, weekly and monthly basis. For us, most of the transactions happen in the current day. Old transactions can also change. We keep a batch and under this the individual transaction records. Each batch has a status to indicate if the transaction summary (in table batch_summary) can be used. If an old transaction in a summarised batch changes, as part of this transaction the batch is flagged to indicate that the summary is not to be trusted. A background job will re-calculate the summary later.
Our software then uses the summary when possible and falls back to the individual transactions where there is no summary.
We played around with Oracle's materialized views, but ended up rolling our own summary process.
Limit the Requirements
Your requirements sound very wide. There can be a temptation to put all the query fields on a web page and let the users choose any combination of fields and output results. This makes it very difficult to optimize. I would suggest digging deeper into what they actually need to do, or have done in the past. It may not make sense to query on very unselective dimensions.
In our application for certain queries it is to limit the date range to not more than 1 month. We have in aligned some features to the date-based summaries. e.g. you can get results for the whole of Jan 2011, but not 5-20 Jan 2011.
Provide User Interface Feedback for Slow Operations
On occasions we have found it difficult to optimize some things to be shorter than a few minutes. We shirt a job off to a background server rather than have a very slow loading web page. The user can fire off a request and go about their business while we get the answer.
I would suggest using Materialized Views. Materialized Views allow you to store a View as you would a table. Thus all of the complex queries you need to have done are pre-calculated before the user queries them.
The tricky part is of course updating the Materialized View when the tables it is based on change. There's a nice article about it here: Update materialized view when urderlying tables change.
Materialized Views are not (yet) available without plugins in MySQL and are horribly complicated to implement otherwise. However, since you have Oracle I would suggest checking out the link above for how to add a Materialized View in Oracle.

How to increase Oracle CBO cost estimation for hash joins, group by's and order by's without hints

It seems that on some of the servers that we have, the cost of hash joins, group by's and order by's is too low compared to the actual cost. I.e. often execution plans with index range scans outperform the former, but on explain plan the cost shows up as higher.
Some further notes:
I already set optimizer_index_cost_adj to 20 and it's still not good enough. I do NOT want to increase the cost for pure full table scans, in fact I wouldn't mind the optimizer decreasing the cost.
I've noticed that pga_aggregate_target makes an impact on CBO cost estimates, but I definitely do NOT want to lower this parameter as we have plenty of RAM.
As opposed to using optimizer hints in individual queries, I want the settings to be global.
Edit 1: I'm thinking about experimenting with dynamic sampling, but I don't have enough intimate knowledge to predict how this could affect the overall performance, i.e. how frequently the execution plans could change. I would definitely prefer something which is very stable, in fact for some of our largest clients we have a policy of locking the all the stats (which will change with Oracle 11g SQL Plan Management).
Quite often when execution plans with index range scans outperform those with full scans + sorts or hash joins, but the CBO is picking the full scans, it's because the optimiser believes it's going to find more matching results than it actually gets in real life.
In other words, if the optimiser thinks it's going to get 1M rows from table A and 1000 rows from table B, it may very well choose full scans + sort merge or hash join; if, however, when it actually runs the query, it only gets 1 row from table A, an index range scan may very well be better.
I'd first look at some poorly performing queries and analyse the selectivity of the predicates, determine whether the optimiser is making reasonable estimates of the number of rows for each table.
EDIT:
You've mentioned that the cardinality estimates are incorrect. This is the root cause of your problems; the costing of hash joins and sorts are probably quite ok. In some cases the optimiser may be using wrong estimates because it doesn't know how much the data is correlated. Histograms on some columns may help (if you haven't already got them), and in some cases you can create function-based indexes and gather statistics on the hidden columns to provide even better data to the optimiser.
At the end of the day, your trick of specifying the cardinalities of various tables in the queries may very well be required to get satisfactory performance.

Does having several indices all starting with the same columns negatively affect Sybase optimizer speed or accuracy?

We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.
The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.
I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.
I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.
I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)

Resources