I have a large query, part of this query contains several joins in the where clause and the joins are according to the execution plan causing TABLE ACCESS (FULL), which is causing the query to run very slow, obviously.
Here is the part of the query that seems to be causing the issue
WHERE ......
A.CN= B.CN(+) AND
A.CI= B.CI(+) AND
A.SO= B.SO(+) AND
A.CN= C.CN(+) AND
The execution plan shows
HASH JOIN (RIGHT OUTER)
Access Predicates: "A"."CN"="C"."CN"
Estimated bytes is 700MB which is 1/3 of this entire queries cost.
I have checked indexes and both tables have indexes on CN.
Im just beginning to learn about performance and how things work so im sorry if this is a dumb question :x
Looking for advice on how to improve performance.
Related
Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.
I am investigating a problem where our application takes too much time to get data from Oracle Database. In my investigation, I found that the slowness of the query traces to the join between tables and because of the aggregate function- SUM.
This may look simple but I am not a good with SQL query optimization.
The query is below
SELECT T1.TONNES, SUM(R.TONNES) AS TOTAL_TONNES
FROM
RECLAIMED R ,
(SELECT DELIVERY_OUT_ID, SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_IN_ID=53773 GROUP BY DELIVERY_OUT_ID) T1
where
R.DELIVERY_OUT_ID = T1.DELIVERY_OUT_ID
GROUP BY
T1.TONNES
SUM(R.TONNES) is the total tonnes per delivery out.
SUM(TONNES) is the total tonnes per delivery in.
My table looks like
I have 16 million entries in this table, and by trying multiple delivery_in_id's by average I am getting about 6 seconds for the query to comeback.
I have similar database (complete copy but only have 4 million entries) and when the same query is applied I am getting less than 1 seconds.
They have both the same indexes so I am confident that index is not a problem.
I am certain that it is just the data, it is heavy on the first database(16 million). I have a feeling that when this query is optimized then the problem will be solved.
Open for suggestions : )
Are the two DB on the same server? If it's not, first, compare the computer configuration, settings and running applications.
If there is no differences, you can try to check if your have NULL values in the column you want to SUM. Use NVL-function to improve your query if there are some.
Also, you may "Analyse index" (or "Rebuild Index"). It cleans up the index. (it's quite fast and safe for your data).
If it is not helping, look if the TABLESPACE of your table is not full. It might have some impact... but I am not sure.
;-)
I've solved the performance problem by updating the stored procedure. It is optimized in a way by adding a filter in the first table before joining the second table. Below is the outcome stored procedure
SELECT R.DELIVERY_IN_ID, R.DELIVERY_OUT_ID, SUM(R.TONNES),
(SELECT SUM(TONNES) AS TONNES FROM RECLAIMED WHERE DELIVERY_OUT_ID=R.DELIVERY_OUT_ID) AS TOTAL_TONNES
FROM
CTSBT_RECLAIMED R
WHERE DELIVERY_IN_ID=53733
GROUP BY DELIVERY_IN_ID, R.DELIVERY_OUT_ID
The result in timing/performance is huge for my case since I am joining a huge table(16M).This query now goes for less than a second. The slowness, I am suspecting is due to 1 table having no index(see T1) and even though in my case it is only about 20 items, it is enough to slow the query down because it compares it to 16 million entries.
The optimized query does the filtering of this 16 million and merge to T1 after.
Should there be a better way how to optimized this? Probably. But I am happy with this result and solved what I intended to solve. Now moving on.
Thanks for those who commented.
SO is full of work-arounds, but I'm wondering about the historical reasons behind the 1000 limit for "maximum number of expressions" in IN clause?
It might be because, there is potential of being abused with tons of values. And every value in it will be transformed into equivalent OR condition.
For example NAME IN ('JOHN', 'CHARLES'..) would be transformed into NAME = 'JOHN' OR NAME = 'CHARLES'
So, it might impact the performance..
But note Oracle still supports
SELECT ID FROM EMP WHERE NAME IN (SELECT NAME FROM ATTENDEES)
In this case, the optimizer doesn't convert into multiple OR conditions, but make a JOIN instead..
This restriction is not only for IN list, but on any expression list. Documentation says :
A comma-delimited list of expressions can contain no more than 1000 expressions.
Your question is WHY the limit is 1000. Why not 100 or 10000 or a million? I guess it relates to the limit of the number of columns in a table, which is 1000. Perhaps, this relation is true in Oracle internally to make the expression list and the columns to match with the DML statement.
But, for a good design, the limit 1000 itself is big. Practically, you won't reach the limit.
And, a quote from the famous AskTom site on similar topic,
We'll spend more time parsing queries then actually executing them!
Update My own thoughts
I think Oracle is quite old in DB technology, that these limits were made then once and they never had to think about it again. All expression list have 1000 limit. And a robust design never let the users to ask Oracle for an explanation. And Tom's answer abour parsing always make me think that all this limit purpose back then in 70s or 80s was more of computation issue. The algorithms based on C might have needed some limit and Oracle came uo with 1000.
Update 2 : From application and it's framework point of view
As a DBA, I have seen so many develpers approaching me with performance issues which are actually issues with application framework generating the queries to fetch the data from database. The application provides the functionality to the users to add filters, which eventually form the AND, OR logic within the IN list of the query. Internally Oracle expands it as query rewrite in the optimization stage as OR logic. And the query becomes huge, thus increasing the time to PARSE it. Most of the times, it suppresses the index usage. So, this is one of the cases where a query is generated with huge IN list, via application framework.
I am trying to select a distinct list of a certain column from a table with many millions of rows, such as:
select distinct stylecode from bass.stock_snapshot
This query obviously takes a very long time. What performance tuning can I do on this table?
If there are no predicates to my query, will an index help at all?
" just did this on a test table and the explain plan shows it did use
the index."
Please bear in mind that you have to maintain that index for ever more. I don't understand your data but it seems unlikely this index will be useful for other queries, and this query doesn't seem like the sort of query you ought to be running on a frequent basis.
If this is a one-off, some other approach such as parallel query might be better.
If on the other hand it is a frequent requirement perhaps a reference table for STYLECODE would be a good idea.
I have a basically sql select question that people gave me different answers over the years. Say I have a couple of tables designed each with over 40 columns and potentially will hold ten and thousands of row, I'm using SqlServer2005.
On joining these tables, in the where clause if I have things like
select * from t1, t2
where t1.UserID = 5
and t1.SomeID = t2.SomeOtherID
some people say you should alwasys have the constant "t1.UserID = 5" up front rather than after the "t1.SomeID = t2.SomeOtherID", it boosts the select performance. While others say it doesn't matter.
What is the correct answer?
Also, if I use ADO.NET Entity Framework to implement my DAL, will modeling tables that have over 40 columns and doing CRUD operations be a performance issue to it?
Thank you,
Ray.
In general, with database optimization, you should write SQL which is conceptually correct first, and then tweak performance if profiling shows it to be necessary. When doing an inner join, it is better to use SQL-92, explicit INNER JOINs than Cartesian products. So I would begin by writing your SQL as follows:
SELECT *
FROM t1
INNER JOIN t2
ON t1.SomeID = t2.SomeOtherID
WHERE
t1.UserID = 5
The t1.SomeID = t2.SomeOtherID that goes in the ON part of the INNER JOIN, because it expresses the relationship between the two tables. The UserID that goes in the WHERE clause because it is a filter to limit the result set. Writing your SQL in this way gives more information to the database optimizer, because it expresses your intentions about the join versus the filtering.
Now IF you are not getting acceptable performance with this syntax in a real-world database, then do feel free to experiment with moving bits around. But like I said, start with something which is conceptually correct.
With regards to the second part of your question, the most obvious performance implication is that when you select a collection of entities, the Entity Framework needs to bring back all properties for the entities it is materializing. So if you have 40 columns, then you will be pulling that data back over the wire, if you materialize them as entities. It is, however, possible to write LINQ queries which return anonymous types containing only the columns you need. However, to do full CRUD, you will need to return entities.
People's opinion on this will change over time because RDBMS query optimisation has evolved over time, and different RDBMSs will have different approaches. I can't speak for every syste out there but it's really unlikely that in 2008 this is going to make any difference. YMMV if you are interested only in a specific system.
I can tell you that for any recent version of Oracle it makes no difference.
I know this answer is kind of trite, but I would suggest writing benchmarks. Whip up a console app and test it out yourself. Run the query a couple hundred times and see how long it takes for each way.
There is a lot of superstition when it comes to SQL query performance and optimization. Some people do things thinking it is faster but they don't actually check their facts. Also, the way EF or LinqToSql work and interact with the DB may introduce performance differences not evident in SQL.
If you're optimizing code you may also want to use a profiler like RedGate ANTS. Its not free, but it can help a lot to find bottlenecks in your code. Then you can find places in your code to optimize much easier. It's not always your database slowing your apps down. Or sometimes you're executing a fast query, but doing it a jillion times when you could actually be caching the result.
Firstly, construct the query using an explicit JOIN syntax, rather than the cartesian product. It probably won't make any difference performance-wise for any modern optimiser, but it does make the information on how the JOINs work more accessible for the programmers.
SELECT Player.Name, Game.Date
FROM Player
INNER JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Game.WinnerFrags > Game.TotalFrags/2
ORDER BY Player.Name
Which will give us all the players sorted by name who have take more frags in a game than all the other players in the game put together, and the dates of the games. Putting both the conditions are in the JOIN probably won't affect performance either, since the optimiser is likely do the filtering as part of the JOIN anyway. It does start to matter for LEFT JOINs though. Lets say we're looking for how many games the week's top ten players have ever won by the margin described above. Since it is possible that some of them have never one this spectacularly, we'll need LEFT JOIN.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Player.WeekRank >= 10
AND Game.WinnerFrags > Game.TotalFrags/2
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Well, not quite. The JOIN will return records for each game played by a player, or the player data and NULL game data if the player has played no games. These results will get filtered, during or after the JOIN depending on the optimiser's decision, based on the frag criteria. This will eliminate all the records that don't meet the frag criteria. So there will be no records to group for players who have never had such a spectacular win. Effectively creating an INNER JOIN .... FAIL.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
AND Game.WinnerFrags > Game.TotalFrags/2
WHERE Player.WeekRank >= 10
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Once we move the frag criteria into the JOIN the query will behave correctly, returning records for all players in the week's top ten, irrespective of whether they've achieved a whitewash.
After all of that, the short answer is:
For INNER JOIN situations it probably doesn't make a to performance difference where you put the conditions. The queries are more readable if you separate the the join and filtering conditions though. And getting a condition in the wrong place can seriously mess up the results of a LEFT JOIN.