How to increase Oracle CBO cost estimation for hash joins, group by's and order by's without hints

How to increase Oracle CBO cost estimation for hash joins, group by's and order by's without hints - performance

It seems that on some of the servers that we have, the cost of hash joins, group by's and order by's is too low compared to the actual cost. I.e. often execution plans with index range scans outperform the former, but on explain plan the cost shows up as higher.
Some further notes:
I already set optimizer_index_cost_adj to 20 and it's still not good enough. I do NOT want to increase the cost for pure full table scans, in fact I wouldn't mind the optimizer decreasing the cost.
I've noticed that pga_aggregate_target makes an impact on CBO cost estimates, but I definitely do NOT want to lower this parameter as we have plenty of RAM.
As opposed to using optimizer hints in individual queries, I want the settings to be global.
Edit 1: I'm thinking about experimenting with dynamic sampling, but I don't have enough intimate knowledge to predict how this could affect the overall performance, i.e. how frequently the execution plans could change. I would definitely prefer something which is very stable, in fact for some of our largest clients we have a policy of locking the all the stats (which will change with Oracle 11g SQL Plan Management).

Quite often when execution plans with index range scans outperform those with full scans + sorts or hash joins, but the CBO is picking the full scans, it's because the optimiser believes it's going to find more matching results than it actually gets in real life.
In other words, if the optimiser thinks it's going to get 1M rows from table A and 1000 rows from table B, it may very well choose full scans + sort merge or hash join; if, however, when it actually runs the query, it only gets 1 row from table A, an index range scan may very well be better.
I'd first look at some poorly performing queries and analyse the selectivity of the predicates, determine whether the optimiser is making reasonable estimates of the number of rows for each table.
EDIT:
You've mentioned that the cardinality estimates are incorrect. This is the root cause of your problems; the costing of hash joins and sorts are probably quite ok. In some cases the optimiser may be using wrong estimates because it doesn't know how much the data is correlated. Histograms on some columns may help (if you haven't already got them), and in some cases you can create function-based indexes and gather statistics on the hidden columns to provide even better data to the optimiser.
At the end of the day, your trick of specifying the cardinalities of various tables in the queries may very well be required to get satisfactory performance.

Related

Elasticsearch: Approximate quantitative difference between insert and update performance

I have read many posts I could find here and on the internet internet on insert vs. updates but was unable to find any, even order of magnitude, quantitative statements. Let's assume I do bulk update operations on 50% of my document fields, can I expect the performance vs. bulk index operations with data from all fields to be 80%, 50%, 20%, 10%, 1% of an insert? Just a rough number from experience would be very helpful.
Disclaimer: I understand that inserts are preferable in terms of performance but often there's a difficult trade-off between access/query performance/complexity and insert performance, especially if you have data that you want to query in one place but individual components have different lifecycles. So in my case, I would probably be OK with a certain, even significant write performance hit to keep all other properties of my ES index ideal.

I think there is a particular reason for this not being discussed commonly.
The update operation (add + delete) does not delete the document instantly. Instead, it marks the document as deleted by marking a bit. Therefore, the performance of document addition and document update is not very different.
However, once enough documents are marked deleted, the Lucene indexes are merged and deleted documents are completely removed. Until that time, the documents keep piling up as they do not free the space.
So, the only performance indicator that is considered is the performance of search (as it is effected in terms of results and in terms of time complexity as well).
More on document merging here.

Hive union all efficiency and best practice

I have a hive efficiency question. I have 2 massive queries that need to be filtered, joined with mapping tables, and unioned. All the joins are identical for both tables. Would it be more efficient to union them before applying the joins to the combined table or to apply the joins to each massive query individually then union the results? Does it make a difference?
I tried the second way and the query ran for 24 hours before I killed it. I feel like I did everything I could to optimize it except potentially rearrange the union statement. On the one hand, I feel like it should not matter because the number or rows being joined by the mapping table is the same and since everything is palatalized, it should take roughly the same amount of time. On the other hand, maybe by doing the union first, it should guarantee that the two big queries are given full system resources before the joins are run. Then again, that might mean that there are only 2 jobs running at a time so the system is not being fully used or something.
I simply do not know enough about how hive and it's multi-threading works. Anybody have any ideas?

There is no such best practice. Both approaches are applicable. Subqueries in UNION ALL are running as parallel jobs. So join before union will work as parallel tasks with smaller datasets, tez can optimize execution and common joined tables will be read only once in single mapper stage for each table.
Also you can avoid joins for some subqueries for example if their keys are not applicable for join.
Join with union-ed bigger dataset also may work with very high parallelism depending on your settings (bytes per reducer for example), optimizer also may rewrite query plan. So I suggest you to check both methods, measure speed, study plan and check if you can change something. Change, measure, study plan... repeat
Few more suggestions:
Try to limit datasets before joining them. If your join multiplies rows then analytics and aggregation may work slower on bigger datasets and first approach may be preferable if you can apply analytics/aggregation before union.

gathering statistics on tables without indexes

Does it make sense to gather regularly statistics on table without indexes in Oracle database? I'm asking from optimization point of view. I assume that always FULL TABLE SCAN would be performed on that table.

Yes it's still worth gathering the statistics. Information about the number and size of rows is of use to the optimizer, even though there are no indexes

In a nutshell, statistics are as important to optimizer as food is to human beings. If you don't get to eat for a long time, your brain would degrade in its functioning.
The more the optimizer knows the latest statistics, the better is the execution plan it could decide.
Let me try to explain with an example:
Let's say you are asked to reach a particular destination on a fine day. However, you are not provided with the map and location information. Now, there could be N number of ways to reach the destination, but without proper information you would make the worst possible way. If you are smart enough, you might ask for directions, now this is where you start gathering statistics. Just imagine, if you had the entire plan in mind before you start your journey, i.e. if you could gather all the statistics, you could make the best plan.
UPDATE Saw a comment about auto optimizer stats collection.
Yes, Of course there is auto optimizer stats collection in Oracle 11g Release 1. Please see more information here

Oracle Execution plan: Cost vs Actual issue: Query with higher cost runs faster

I know this has been asked many times before but none has a definte answer.
Here
Here
In my case my DBA has optimized a select query which takes around 1.05 mins to execute. I made further enhancements by making in run within 1sec. But the one which I optimized cost more in Execution plan. My DBA is suggesting since my query is costlier we should not change the query.
I know that Exection Plans for two queries are not comparable. But how should I convince my DBA to understand that Execution Plan is a "Plan" but results are "Actual"
Could anyone point me in the right direction

As Tom Kyte says:
You cannot compare the cost of 2 queries with eachother. They are simply not
comparable.
...
If we get another query -- another similar query -- we go through the same steps, build
lots of plans, assign a cost, pick the one with the lowest cost from that set.
We cannot however take those 2 queries and compare the relative cost of each to the
other. Those costs are specific to those queries.
Many things influence the "cost" of a query. The "cost" is just some artificial number
we arrived at to select a query given a certain environment. ...
Don't compare them, you cannot -- they might as well be random numbers.
A DBA should already know that, as should anyone trying to tune queries (and you said in the question that you do). They should also know to trust Tom's opinion.

the calculated costs are results from statistics calculated in the past and also depend on configuration. For example if you set the optimizer_index_cost_adj , then it will effect the costs but not the actual time it takes to execute the query. And statistics that are a second old are already a second old and not 100% accurate anymore.
On the other hand, when you measure the execution time of your query, then the measured time is subject to your cache hit ratio. You can of course flush all caches before every query, but this would not resemble a live situation. So be careful when you claim that your statement takes 1.00 seconds instead of 1.05. Because you might just be doing experiments in your lab that isn't your reality.

Does having several indices all starting with the same columns negatively affect Sybase optimizer speed or accuracy?

We have a table with, say, 5 indices (one clustered).
Question: will it somehow negatively affect optimizer performance - either speed or accuracy of index picks - if all 5 indices start with the same exact field? (all other things being equal).
It was suggested by someone at the company that it may have detrimental effect on performance, and thus one of the indices needs to have the first two fields switched.
I would prefer to avoid change if it is not necessary, since they didn't back up their assertion with any facts/reasoning, but the guy is senior and smart enough that I'm inclined to seriously consider what he suggests.
NOTE1: The basic answer "tailor the index to the where clauses and overall queries" is not going to help me - the index that would be changed is a covered index for the only query using it and thus the order of the fields in it would not affect the IO amount. I have asked a separate SO question just to confirm that assertion.
NOTE2: That field is a date when the records are inserted, and the table is pretty big, if this matters. It has data for ~100 days, about equal # of rows per date, and the first index is a clustered index starting with that date field.

The optimizer has to think more about which if any of the indexes to use if there are five. That cost is usually not too bad, but it depends on the queries you're asking of it. In principle, once the query is optimized, the time taken to execute it should be about the same. If you are preparing SELECT statements for multiple uses, that won't matter much. If every query is prepared afresh and never reused, then the overhead may become a drag on the system performance - particularly if it turns out that it really doesn't matter which of the indexes is actually used for most queries (a moderately strong danger when five indexes all share the same leading columns).
There is also the maintenance cost when the data changes - updating five indexes takes noticably longer than just one index, plus you are using roughly five times as much disk storage for five indexes as for one.

I do not wish to speak for your senior colleague but I believe you have misinterpreted what he said, or he has not expressed himself explicitly enough for you to understand.
One of the things that stand out about poorly designed, and therefore poorly performing tables are, they have many indices on them, and the leading columns of the indices are all the same. Every single time.
So it is pointless debating (the debate is too isolated) whether there is a server cost for indices which all have the same leading columns; the problem is the poorly designed table which exposes itself in myriad ways. That is a massive server cost on every access. I suspect that that is where your esteemed colleague was coming from.
A monotonic column for an index is very poor choice (understood, you need at least one) for an index. But when you use that monotonic column to force uniqueness in some other index, which would otherwise be irrelevant (due to low cardinality, such as SexCode), that is another red flag to me. You've merely forced an irrelevant index to be slightly relevant); the queries, except for the single covered query, perform poorly on anything beyond the simplest select via primary key.
There is no such thing as a "covered index", but I understand what you mean, you have added an index so that a certain query will execute as a covered query. Another flag.
I am with Mitch, but I am not sure you get his drift.
Last, responding to your question in isolation, having five indices with the leading columns all the same would not cause a "performance problem", beyond that which your already have due to the poor table design, but it will cause angst and unnecessary manual labour for the developers chasing down weird behaviour, such as "how come the optimiser used index_1 for my query but today it is using index_4?".
Your language consistently (and particularly in the comments) displays a manner of dealing with issues in isolation. The concept of a server and a database, is that it is a shared central resource, the very opposite of isolation. A problem that is "solved" in isolation will usually result in negative performance impact for everyone outside that isolated space.
If you really want the problem dealt with, fully, post the CREATE TABLE statement.

I doubt it would have any major impact on SELECT performance.
BUT it probably means you could reorganise those indexes (based on a respresentative query workload) to better serve queries more efficiently.

I'm not familiar with the recent version of Sybase, but in general with all SQL servers,
the main (and almost) only performance impact indexes have is with INSERT, DELETE and UPDATE queries. Basically each change to the database requires the data table per-se (or the clustered index) to be updated, as well as all the indexes.
With regards to SELECT queries, having "too many" indexes may have a minor performance impact for example by introducing competing hard disk pages for cache. But I doubt this would be a significant issue in most cases.
The fact that the first column in all these indexes is the date, and assuming a generally monotonic progression of the date value, is a positive thing (with regards to CRUD operations) for it will keep the need of splitting/balancing the index tables to a minimal. (since most inserts at at the end of the indexes).
Also this table appears to be small enough ("big" is a relative word ;-) ) that some experimentation with it to assert performance issues in a more systematic fashion could probably be done relatively safely and easily without interfering much with production. (Unless the 10k or so records are very wide or the query per seconds rate is high etc..)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio