Database and EF performance concern? - performance

I have a basically sql select question that people gave me different answers over the years. Say I have a couple of tables designed each with over 40 columns and potentially will hold ten and thousands of row, I'm using SqlServer2005.
On joining these tables, in the where clause if I have things like
select * from t1, t2
where t1.UserID = 5
and t1.SomeID = t2.SomeOtherID
some people say you should alwasys have the constant "t1.UserID = 5" up front rather than after the "t1.SomeID = t2.SomeOtherID", it boosts the select performance. While others say it doesn't matter.
What is the correct answer?
Also, if I use ADO.NET Entity Framework to implement my DAL, will modeling tables that have over 40 columns and doing CRUD operations be a performance issue to it?
Thank you,
Ray.

In general, with database optimization, you should write SQL which is conceptually correct first, and then tweak performance if profiling shows it to be necessary. When doing an inner join, it is better to use SQL-92, explicit INNER JOINs than Cartesian products. So I would begin by writing your SQL as follows:
SELECT *
FROM t1
INNER JOIN t2
ON t1.SomeID = t2.SomeOtherID
WHERE
t1.UserID = 5
The t1.SomeID = t2.SomeOtherID that goes in the ON part of the INNER JOIN, because it expresses the relationship between the two tables. The UserID that goes in the WHERE clause because it is a filter to limit the result set. Writing your SQL in this way gives more information to the database optimizer, because it expresses your intentions about the join versus the filtering.
Now IF you are not getting acceptable performance with this syntax in a real-world database, then do feel free to experiment with moving bits around. But like I said, start with something which is conceptually correct.
With regards to the second part of your question, the most obvious performance implication is that when you select a collection of entities, the Entity Framework needs to bring back all properties for the entities it is materializing. So if you have 40 columns, then you will be pulling that data back over the wire, if you materialize them as entities. It is, however, possible to write LINQ queries which return anonymous types containing only the columns you need. However, to do full CRUD, you will need to return entities.

People's opinion on this will change over time because RDBMS query optimisation has evolved over time, and different RDBMSs will have different approaches. I can't speak for every syste out there but it's really unlikely that in 2008 this is going to make any difference. YMMV if you are interested only in a specific system.
I can tell you that for any recent version of Oracle it makes no difference.

I know this answer is kind of trite, but I would suggest writing benchmarks. Whip up a console app and test it out yourself. Run the query a couple hundred times and see how long it takes for each way.
There is a lot of superstition when it comes to SQL query performance and optimization. Some people do things thinking it is faster but they don't actually check their facts. Also, the way EF or LinqToSql work and interact with the DB may introduce performance differences not evident in SQL.
If you're optimizing code you may also want to use a profiler like RedGate ANTS. Its not free, but it can help a lot to find bottlenecks in your code. Then you can find places in your code to optimize much easier. It's not always your database slowing your apps down. Or sometimes you're executing a fast query, but doing it a jillion times when you could actually be caching the result.

Firstly, construct the query using an explicit JOIN syntax, rather than the cartesian product. It probably won't make any difference performance-wise for any modern optimiser, but it does make the information on how the JOINs work more accessible for the programmers.
SELECT Player.Name, Game.Date
FROM Player
INNER JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Game.WinnerFrags > Game.TotalFrags/2
ORDER BY Player.Name
Which will give us all the players sorted by name who have take more frags in a game than all the other players in the game put together, and the dates of the games. Putting both the conditions are in the JOIN probably won't affect performance either, since the optimiser is likely do the filtering as part of the JOIN anyway. It does start to matter for LEFT JOINs though. Lets say we're looking for how many games the week's top ten players have ever won by the margin described above. Since it is possible that some of them have never one this spectacularly, we'll need LEFT JOIN.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
WHERE Player.WeekRank >= 10
AND Game.WinnerFrags > Game.TotalFrags/2
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Well, not quite. The JOIN will return records for each game played by a player, or the player data and NULL game data if the player has played no games. These results will get filtered, during or after the JOIN depending on the optimiser's decision, based on the frag criteria. This will eliminate all the records that don't meet the frag criteria. So there will be no records to group for players who have never had such a spectacular win. Effectively creating an INNER JOIN .... FAIL.
SELECT Player.WeekRank, Player.Name, COUNT(Game.*) AS WhitewashCount
FROM Player
LEFT JOIN Game ON Game.WinnerPlayerID = Player.PlayerID
AND Game.WinnerFrags > Game.TotalFrags/2
WHERE Player.WeekRank >= 10
GROUP BY Player.WeekRank, Player.Name
ORDER BY Player.WeekRank
Once we move the frag criteria into the JOIN the query will behave correctly, returning records for all players in the week's top ten, irrespective of whether they've achieved a whitewash.
After all of that, the short answer is:
For INNER JOIN situations it probably doesn't make a to performance difference where you put the conditions. The queries are more readable if you separate the the join and filtering conditions though. And getting a condition in the wrong place can seriously mess up the results of a LEFT JOIN.

Related

outer join causing TABLE ACCESS (FULL) which causes massive performance issues

I have a large query, part of this query contains several joins in the where clause and the joins are according to the execution plan causing TABLE ACCESS (FULL), which is causing the query to run very slow, obviously.
Here is the part of the query that seems to be causing the issue
WHERE ......
A.CN= B.CN(+) AND
A.CI= B.CI(+) AND
A.SO= B.SO(+) AND
A.CN= C.CN(+) AND
The execution plan shows
HASH JOIN (RIGHT OUTER)
Access Predicates: "A"."CN"="C"."CN"
Estimated bytes is 700MB which is 1/3 of this entire queries cost.
I have checked indexes and both tables have indexes on CN.
Im just beginning to learn about performance and how things work so im sorry if this is a dumb question :x
Looking for advice on how to improve performance.

Order of multiple conditions in where clause in oracle [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

Vertica query optimization

I want to optimize a query in vertica database. I have table like this
CREATE TABLE data (a INT, b INT, c INT);
and a lot of rows in it (billions)
I fetch some data using whis query
SELECT b, c FROM data WHERE a = 1 AND b IN ( 1,2,3, ...)
but it runs slow. The query plan shows something like this
[Cost: 3M, Rows: 3B (NO STATISTICS)]
The same is shown when I perform explain on
SELECT b, c FROM data WHERE a = 1 AND b = 1
It looks like scan on some part of table. In other databases I can create an index to make such query realy fast, but what can I do in vertica?
Vertica does not have a concept of indexes. You would want to create a query specific projection using the Database Designer if this is a query that you feel is run frequently enough. Each time you create a projection, the data is physically copied and stored on disk.
I would recommend reviewing projection concepts in the documentation.
If you see a NO STATISTICS message in the plan, you can run ANALYZE_STATISTICS on the object.
For further optimization, you might want to use a JOIN rather than IN. Consider using partitions if appropriate.
Creating good projections is the "secret-sauce" of how to make Vertica perform well. Projection design is a bit of an art-form, but there are 3 fundamental concepts that you need to keep in mind:
1) SEGMENTATION: For every row, this determines which node to store the data on, based on the segmentation key. This is important for two reasons: a) DATA SKEW -- if data is heavily skewed then one node will do too much work, slowing down the entire query. b) LOCAL JOINS - if you frequently join two large fact tables, then you want the data to be segmented the same way so that the joined records exist on the same nodes. This is extremely important.
2) ORDER BY: If you are performing frequent FILTER operations in the where clause, such as in your query WHERE a=1, then consider ordering the data by this key first. Ordering will also improve GROUP BY operations. In your case, you would order the projection by columns a then b. Ordering correctly allows Vertica to perform MERGE joins instead of HASH joins which will use less memory. If you are unsure how to order the columns, then generally aim for low to high cardinality which will also improve your compression ratio significantly.
3) PARTITIONING: By partitioning your data with a column which is frequently used in the queries, such as transaction_date, etc, you allow Vertica to perform partition pruning, which reads much less data. It also helps during insert operations, allowing to only affect one small ROS container, instead of the entire file.
Here is an image which can help illustrate how these concepts work together.

Netezza/PureData - Bad distribution key chosen in HASH JOIN

I am using Netezza/Pure Data for a query. I have a INNER JOIN (which became a HASH JOIN) on two columns A and B. A is a column that has good distribution and B is a column that has bad distribution. For some reason, my query plan always uses B instead A as the distribution key for that JOIN, which causes immense performance issue.
GENERATE STATISTICS does help alleviate this issue, but due to performance constraints, it is not feasiable to GENERATE STATISTICS before every query. I do it before a batch run but not in between each query within a batch.
In a nutshell, the source tables have good distributions but when I join them, they choose a bad distribution key (which is actually never used as a distribution column at all in the sources).
So my question is, what are some good ways to influence the choice of distribution key in a JOIN without doing GENERATE STATISTICS. I've tried changing around the distribution columns of the source tables but that didn't do much even if I make sure all the skew's are less than 0.5.
You could create a temp table and force the distribution so that they both align, this should expedite the join
The workaround is to force exhaustive planner to be used.
set num_star_planner_rels = X; -- Set X to very high.
According to IBM Netezza team, queries with more than 7 entities (# of tables) will use a greedy query planner called "Snowflake". At 7 or less entities, it will use the brute force approach to find the best plan.
The trade off is that exhaustive search is very expensive for large number of entities.

ORA-01795 - why is the maximum number of expressions limited to 1000

SO is full of work-arounds, but I'm wondering about the historical reasons behind the 1000 limit for "maximum number of expressions" in IN clause?
It might be because, there is potential of being abused with tons of values. And every value in it will be transformed into equivalent OR condition.
For example NAME IN ('JOHN', 'CHARLES'..) would be transformed into NAME = 'JOHN' OR NAME = 'CHARLES'
So, it might impact the performance..
But note Oracle still supports
SELECT ID FROM EMP WHERE NAME IN (SELECT NAME FROM ATTENDEES)
In this case, the optimizer doesn't convert into multiple OR conditions, but make a JOIN instead..
This restriction is not only for IN list, but on any expression list. Documentation says :
A comma-delimited list of expressions can contain no more than 1000 expressions.
Your question is WHY the limit is 1000. Why not 100 or 10000 or a million? I guess it relates to the limit of the number of columns in a table, which is 1000. Perhaps, this relation is true in Oracle internally to make the expression list and the columns to match with the DML statement.
But, for a good design, the limit 1000 itself is big. Practically, you won't reach the limit.
And, a quote from the famous AskTom site on similar topic,
We'll spend more time parsing queries then actually executing them!
Update My own thoughts
I think Oracle is quite old in DB technology, that these limits were made then once and they never had to think about it again. All expression list have 1000 limit. And a robust design never let the users to ask Oracle for an explanation. And Tom's answer abour parsing always make me think that all this limit purpose back then in 70s or 80s was more of computation issue. The algorithms based on C might have needed some limit and Oracle came uo with 1000.
Update 2 : From application and it's framework point of view
As a DBA, I have seen so many develpers approaching me with performance issues which are actually issues with application framework generating the queries to fetch the data from database. The application provides the functionality to the users to add filters, which eventually form the AND, OR logic within the IN list of the query. Internally Oracle expands it as query rewrite in the optimization stage as OR logic. And the query becomes huge, thus increasing the time to PARSE it. Most of the times, it suppresses the index usage. So, this is one of the cases where a query is generated with huge IN list, via application framework.

Resources