I wonder what's usually faster:
Filter out duplicates and then do the select
or
Do the select directly with duplicates
I think it may be the first one but I don't know -
how to nicely and efficiently integrate the deletion of duplicates in my code?
DATA:
lt_itab TYPE TABLE OF string,
lt_range_itab TYPE RANGE OF string
.
* Populating itab with duplicates
* Can the following somehow become a neat one-liner? This is ugly!
APPEND '1' TO lt_itab.
APPEND '2' TO lt_itab.
APPEND '2' TO lt_itab.
APPEND '3' TO lt_itab.
APPEND '4' TO lt_itab.
APPEND '4' TO lt_itab.
*Populating range table from itab
*Should one remove the duplicates here for a performance boost in the upcoming select?
*If so - how?
*-------------------------------------------------------
lt_range_itab = VALUE #(
FOR <ls_itab> IN lt_itab
( sign = 'I'
option = 'EQ'
low = <ls_itab> )
).
*...or is such a select usually faster than the time it takes to remove the duplicates?
*-------------------------------------------------------
*SELECT *
* FROM anyTable
* INTO TABLE #DATA(lt_new_data)
* WHERE anyProperty NOT IN #lt_range_itab.
So many questions... :-)
Edit: I revised this answer after the discussions in the answers
What's usually faster?
Good databases will select fast even if there are some duplicates in the range.
Oracle's optimizer, for example, removes duplicates on its own. SAP HANA, in comparison, may get slower, but its dictionary-based architecture will usually keep it on a negligible level. So generally, I see no imperative to remove duplicates before each and every query.
However, things may go awry if the optimizer is sub-optimal and there is a large number of duplicates. So if you expect duplicates, it may be better to keep to the safe side and remove them before the query.
Also note that range tables have a length restriction. They are translated to a SQL statement with an IN clause, and SQL statement strings have a maximum number of characters. Duplicate removal thus may be a necessary strategy to get the query working at all.
Longer ranges can be converted to FOR ALL ENTRIES, which packetizes the query and allows for much longer ranges. However, that statement form leads to multiple roundtrips with the database, and will definitely suffer from duplicates in the query.
Can the following somehow become a neat one-liner?
DATA(lt_itab) = VALUE string_table( ( `1` ) ( `2` ) ( `2` ) ( `3` ) ( `4` ) ( `4` ) ).
Or right away as suggested by Sandra below:
SELECT ... WHERE anyProperty NOT IN ('1','2','3','4')
If so - how?
SORT lt_range_tab.
DELETE ADJACENT DUPLICATES FROM lt_range_tab.
Last but not least, note that an anyProperty IN #lt_range_tab may be considerably faster than the reversed NOT IN variant. Databases tend to keep positive indexes which respond best to positive queries. If you have the possibility, e.g. because you're filtering a field with fixed value list, it may be worthwhile to reverse the filter before sending it to the database.
The first one is faster of course, because the database operations with physical disks are much slower than in-memory operations.
Is the impact on performance noticeable is another question ; it depends on the number of duplicate selections and on the volume of data.
One well known example is the SELECT ... FOR ALL ENTRIES construct which can have a big impact on performance if the duplicates are not removed, because ABAP internally converts it into several SELECT, and so the same data may be selected several times (which is removed at ABAP side afterwards).
In brief, except when you are sure there is a small data volume, make sure there are no duplicates before a SELECT.
Related
Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).
According to my understanding - and correct me if I'm wrong - "Normalization" is the process of removing the redundant data from the database-desing
However, when I was trying to learn about database optimizing/tuning for performance, I encountered that Mr. Rick James recommend against normalizing continuous values such as (INTS, FLOATS, DATETIME, ...)
"Normalize, but don't over-normalize." In particular, do not normalize
datetimes or floats or other "continuous" values.
source
Sure purists say normalize time. That is a big mistake. Generally,
"continuous" values should not be normalized because you generally
want to do range queries on them. If it is normalized, performance
will be orders of magnitude worse.
Normalization has several purposes; they don't really apply here:
Save space -- a timestamp is 4 bytes; a MEDIUMINT for normalizing is 3; not much savings
To allow for changing the common value (eg changing "International Business Machines" to "IBM" in one place) -- not relevent here; each
time was independently assigned, and you are not a Time Lord.
In the case of datetime, the normalization table could have extra columns like "day of week", "hour of day". Yeah, but performance still
sucks.
source
Do not normalize "continuous" values -- dates, floats, etc --
especially if you will do range queries.
source.
I tried to understand this point but I couldn't, can someone please explain this to me and give me an example of the worst case that applying this rule on will enhance the performance ?.
Note: I could have asked him in a comment or something, but I wanted to document and highlight this point alone because I believe this is very important note that affect almost my entire database performance
The Comments (so far) are discussing the misuse of the term "normalization". I accept that criticism. Is there a term for what is being discussed?
Let me elaborate on my 'claim' with this example... Some DBAs replace a DATE with a surrogate ID; this is likely to cause significant performance issues when a date range is used. Contrast these:
-- single table
SELECT ...
FROM t
WHERE x = ...
AND date BETWEEN ... AND ...; -- `date` is of datatype DATE/DATETIME/etc
-- extra table
SELECT ...
FROM t
JOIN Dates AS d ON t.date_id = d.date_id
WHERE t.x = ...
AND d.date BETWEEN ... AND ...; -- Range test is now in the other table
Moving the range test to a JOINed table causes the slowdown.
The first query is quite optimizable via
INDEX(x, date)
In the second query, the Optimizer will (for MySQL at least) pick one of the two tables to start with, then do a somewhat tedious back-and-forth to the other table to handle rest of the WHERE. (Other Engines use have other techniques, but there is still a significant cost.)
DATE is one of several datatypes where you are likely to have a "range" test. Hence my proclamations about it applying to any "continuous" datatypes (ints, dates, floats).
Even if you don't have a range test, there may be no performance benefit from the secondary table. I often see a 3-byte DATE being replaced by a 4-byte INT, thereby making the main table larger! A "composite" index almost always will lead to a more efficient query for the single-table approach.
Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.
I've be told and read it everywhere (but no one dared to explain why) that when composing an index on multiple columns I should put the most selective column first, for performance reasons.
Why is that?
Is it a myth?
I should put the most selective column first
According to Tom, column selectivity has no performance impact for queries that use all the columns in the index (it does affect Oracle's ability to compress the index).
it is not the first thing, it is not the most important thing. sure, it is something to consider but it is relatively far down there in the grand scheme of things.
In certain strange, very peculiar and abnormal cases (like the above with really utterly skewed data), the selectivity could easily matter HOWEVER, they are
a) pretty rare
b) truly dependent on the values used at runtime, as all skewed queries are
so in general, look at the questions you have, try to minimize the indexes you need based on that.
The number of distinct values in a column in a concatenated index is not relevant when considering
the position in the index.
However, these considerations should come second when deciding on index column order. More importantly is to ensure that the index can be useful to many queries, so the column order has to reflect the use of those columns (or the lack thereof) in the where clauses of your queries (for the reason illustrated by AndreKR).
HOW YOU USE the index -- that is what is relevant when deciding.
All other things being equal, I would still put the most selective column first. It just feels right...
Update: Another quote from Tom (thanks to milan for finding it).
In Oracle 5 (yes, version 5!), there was an argument for placing the most selective columns first
in an index.
Since then, it is not true that putting the most discriminating entries first in the index
will make the index smaller or more efficient. It seems like it will, but it will not.
With index
key compression, there is a compelling argument to go the other way since it can make the index
smaller. However, it should be driven by how you use the index, as previously stated.
You can omit columns from right to left when using an index, i.e. when you have an index on col_a, col_b you can use it in WHERE col_a = x but you can not use it in WHERE col_b = x.
Imagine to have a telephone book that is sorted by the first names and then by the last names.
At least in Europe and US first names have a much lower selectivity than last names, so looking up the first name wouldn't narrow the result set much, so there would still be many pages to check for the correct last name.
The ordering of the columns in the index should be determined by your queries and not be any selectivity considerations. If you have an index on (a,b,c), and most of your single column queries are against column c, followed by a, then put them in the order of c,a,b in the index definition for the best efficiency. Oracle prefers to use the leading edge of the index for the query, but can use other columns in the index in a less efficient access path known as skip-scan.
The more selective is your index, the fastest is the research.
Simply imagine a phonebook: you can find someone mostly fast by lastname. But if you have a lot of people with the same lastname, you will last more time on looking for the person by looking at the firstname everytime.
So you have to give the most selective columns firstly to avoid as much as possible this problem.
Additionally, you should then make sure that your queries are using correctly these "selectivity criterias".
I have a query where I need to modify the selected data and I want to limit my results of that data. For instance:
SELECT table_id, radians( 25 ) AS rad FROM test_table WHERE rad < 5 ORDER BY rad ASC;
Where this gets hung up is the 'rad < 5', because according to codeigniter there is no 'rad' column. I've tried writing this as a custom query ($this->db->query(...)) but even that won't let me. I need to restrict my results based on this field. Oh, and the ORDER BY works perfect if I remove the WHERE filter. The results are order ASC by the rad field.
HELP!!!
With many DBMSes, we need to repeat the formula / expression in the where clause, i.e.
SELECT table_id, radians( 25 ) AS rad
FROM test_table
WHERE radians( 25 ) < 5
ORDER BY radians( 25 ) ASC
However, in this case, since the calculated column is a constant, the query itself doesn't make much sense. Was there maybe a missing part, as in say radians (25 * myColumn) or something like that ?
Edit (following info about true nature of formula etc.)
You seem disappointed, because the formula needs to be repeated... A few comments on that:
The fact that the formula needs to be explicitly spelled-out rather than aliased may make the query less readable, less fun to write etc. (more on this below), but the more important factor to consider is that the formula being used in the WHERE clause causes the DBMS to calculate this value for potentially all of the records in the underlying table!!!
This in turns hurts performance in several ways:
SQL may not be able use some indexes, and instead have to scan the table (or parts thereof)
if the formula is heavy, it both makes for slow response and for a less scalable server
The situation is not quite as bad if additional predicates in the WHERE clause allow SQL to filter out [a significant amount of] records that would otherwise be processed. Such additional search criteria may be driven by the application (for example in addition to this condition on radiant, the [unrelated] altitude of the location is required to be below 6,000 ft), or such criteria may be added "artificially" to help with the query (for example you may know of a rough heuristic which is insufficient to calculate the "radian" value within acceptable precision, but may yet be good enough to filter-out 70% of the records, only keeping these which have a chance of satisfying the exact range desired for the "radian".
Now a few tricks regarding the formula itself, in an attempt to make it faster:
remember that you may not need to run 100% of the textbook formula.
I'm not sure which part of the great circle math is relevant to this radian calculation, but speaking in generic terms, some formulas include an expensive step, such as a square root extraction, a call to a trig function etc. In some cases it may be possible to simplify the formula (which has to be run for many records/values), by applying the reverse step to the other side of the predicate (which, typically, only needs to be evaluated once). For example if say the search condition predicate is "WHERE SQRT((x1-x2)^2 + (y1-y2)^2) > 5". Since the calculation of distance involves finding the square root (of the sum of the squared differences), one may decide to remove the square root and instead compare the results of this modified formula with the square of the distance value origninally, i.e. "WHERE ((x1-x2)^2 + (y1-y2)^2) > (5^2)"
Depending on your SQL/DBMS system, it may be possible to implement the formula in a custom-defined function, which would make it both more efficient (because "pre-compiled", maybe written in a better language etc.) and shorter to reference, in the SQL query itself (event though it would require being listed twice, as said)
Depending on situation, it may also be possible to alter the database schema and underlying application, to have the formula (or parts thereof) but pre-computed, and indexed, saving the DBMS this lengthy resolution of the function-based predicate.