SO is full of work-arounds, but I'm wondering about the historical reasons behind the 1000 limit for "maximum number of expressions" in IN clause?
It might be because, there is potential of being abused with tons of values. And every value in it will be transformed into equivalent OR condition.
For example NAME IN ('JOHN', 'CHARLES'..) would be transformed into NAME = 'JOHN' OR NAME = 'CHARLES'
So, it might impact the performance..
But note Oracle still supports
SELECT ID FROM EMP WHERE NAME IN (SELECT NAME FROM ATTENDEES)
In this case, the optimizer doesn't convert into multiple OR conditions, but make a JOIN instead..
This restriction is not only for IN list, but on any expression list. Documentation says :
A comma-delimited list of expressions can contain no more than 1000 expressions.
Your question is WHY the limit is 1000. Why not 100 or 10000 or a million? I guess it relates to the limit of the number of columns in a table, which is 1000. Perhaps, this relation is true in Oracle internally to make the expression list and the columns to match with the DML statement.
But, for a good design, the limit 1000 itself is big. Practically, you won't reach the limit.
And, a quote from the famous AskTom site on similar topic,
We'll spend more time parsing queries then actually executing them!
Update My own thoughts
I think Oracle is quite old in DB technology, that these limits were made then once and they never had to think about it again. All expression list have 1000 limit. And a robust design never let the users to ask Oracle for an explanation. And Tom's answer abour parsing always make me think that all this limit purpose back then in 70s or 80s was more of computation issue. The algorithms based on C might have needed some limit and Oracle came uo with 1000.
Update 2 : From application and it's framework point of view
As a DBA, I have seen so many develpers approaching me with performance issues which are actually issues with application framework generating the queries to fetch the data from database. The application provides the functionality to the users to add filters, which eventually form the AND, OR logic within the IN list of the query. Internally Oracle expands it as query rewrite in the optimization stage as OR logic. And the query becomes huge, thus increasing the time to PARSE it. Most of the times, it suppresses the index usage. So, this is one of the cases where a query is generated with huge IN list, via application framework.
Related
Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).
Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.
Anyone who's read Parse documentation has stumbled upon this
Caveat: Count queries are rate limited to a maximum of 160 requests per minute. They can also return inaccurate results for classes with more than 1,000 objects. Thus, it is preferable to architect your application to avoid this sort of count operation (by using counters, for example.)
Why's there such limitation and inaccuracy?
To quote the Parse Engineering Blog Post: Building Scalable Apps on Parse
Suppose you are building a product catalog. You might want to display
the count of products in each category on the top-level navigation
screen. If you run a count query for each of these UI elements, they
will not run efficiently on large data sets because MongoDB does not
use counting B-trees. Instead, we recommend that you use a separate
Parse Object to keep track of counts for each category. Whenever a
product gets added or deleted, you can increment or decrement the
counts in an afterSave or afterDelete Cloud Code handler.
To add on to this, here is another quote by Hector Ramos from the Parse Developers Google Group
Count queries have always been expensive once you throw some
constraints in. If you only care about the total size of the
collection, you can run a count query without any constraints and that
one should be pretty fast, as getting the total number of records is a
different problem than counting how many of these match an arbitrary
list of constraints. This is just the reality of working with database
systems.
The inaccuracy is not due to the 1000 request object limit. The count query will try to get the total number of records regardless of size, but since the operation may take a large amount of time to complete, it is possible that the database has changed during that window and the count value that is returned may no longer be valid.
The recommended way to handle counts is to essentially maintain your own index using before/after save hooks. However, this is also a non-ideal solution because save hooks can arbitrarily fail part way through and (worse) postSave hooks have no error propagation.
The limitation is simply to stop people using counts too much, they're just as runtime costly as full queries in effect.
The inaccuracy is because queries are limited to 1000 result objects (100 by default) and counts have the same hard limit.
You can run a recursive query to build up a count, but it's a crappy option. Hence the only really good option at this point in time (and as far as we can see in the future) is to keep an index of the things you're interested in counting and update the counts when anything changes. You would usually do that with save hooks in cloud code.
Oracle IN clause has limit of 1000 for static data,but it accepts unlimited data from sub queries. why?
It's a restriction on any expression list:
A comma-delimited list of expressions can contain no more than 1000 expressions.
Why 1000? Presumably the implementation needs some kind of limit, and that probably seemed like more than enough. There may well be, or certainly may have been when that limit was set decades ago, a performance reason for the limit as well, particularly as the IN is converted to multiple OR statements by the optimiser in this case (which you can see if you look at the execution plan).
I'd struggle to come up with a reasonable scenario that needed to get anywhere near that, with fixed values that couldn't be derived from other data anyway as a subquery.
I suspect it's somewhat related to the logical database limits which say you can't have more than 1000 columns in a table, for instance; since an expression list is used in an insert statement to list both the columns and the values being inserted, the expression list has to be able to match that, but maybe has no reason to exceed it.
Speculation of course... without seeing the internals of the software you're unlikely to get a definitive answer.
This is because IN has very poor performance with large number of values in the list. It's just shortcut for OR clause, and at the database level the engine will change IN to OR's.
You should also avoid doing subqueries inside IN clause - better use EXISTS.
Try using 'exists' than 'in'.You can as well create sub queries using 'exists'.
I have a query where I need to modify the selected data and I want to limit my results of that data. For instance:
SELECT table_id, radians( 25 ) AS rad FROM test_table WHERE rad < 5 ORDER BY rad ASC;
Where this gets hung up is the 'rad < 5', because according to codeigniter there is no 'rad' column. I've tried writing this as a custom query ($this->db->query(...)) but even that won't let me. I need to restrict my results based on this field. Oh, and the ORDER BY works perfect if I remove the WHERE filter. The results are order ASC by the rad field.
HELP!!!
With many DBMSes, we need to repeat the formula / expression in the where clause, i.e.
SELECT table_id, radians( 25 ) AS rad
FROM test_table
WHERE radians( 25 ) < 5
ORDER BY radians( 25 ) ASC
However, in this case, since the calculated column is a constant, the query itself doesn't make much sense. Was there maybe a missing part, as in say radians (25 * myColumn) or something like that ?
Edit (following info about true nature of formula etc.)
You seem disappointed, because the formula needs to be repeated... A few comments on that:
The fact that the formula needs to be explicitly spelled-out rather than aliased may make the query less readable, less fun to write etc. (more on this below), but the more important factor to consider is that the formula being used in the WHERE clause causes the DBMS to calculate this value for potentially all of the records in the underlying table!!!
This in turns hurts performance in several ways:
SQL may not be able use some indexes, and instead have to scan the table (or parts thereof)
if the formula is heavy, it both makes for slow response and for a less scalable server
The situation is not quite as bad if additional predicates in the WHERE clause allow SQL to filter out [a significant amount of] records that would otherwise be processed. Such additional search criteria may be driven by the application (for example in addition to this condition on radiant, the [unrelated] altitude of the location is required to be below 6,000 ft), or such criteria may be added "artificially" to help with the query (for example you may know of a rough heuristic which is insufficient to calculate the "radian" value within acceptable precision, but may yet be good enough to filter-out 70% of the records, only keeping these which have a chance of satisfying the exact range desired for the "radian".
Now a few tricks regarding the formula itself, in an attempt to make it faster:
remember that you may not need to run 100% of the textbook formula.
I'm not sure which part of the great circle math is relevant to this radian calculation, but speaking in generic terms, some formulas include an expensive step, such as a square root extraction, a call to a trig function etc. In some cases it may be possible to simplify the formula (which has to be run for many records/values), by applying the reverse step to the other side of the predicate (which, typically, only needs to be evaluated once). For example if say the search condition predicate is "WHERE SQRT((x1-x2)^2 + (y1-y2)^2) > 5". Since the calculation of distance involves finding the square root (of the sum of the squared differences), one may decide to remove the square root and instead compare the results of this modified formula with the square of the distance value origninally, i.e. "WHERE ((x1-x2)^2 + (y1-y2)^2) > (5^2)"
Depending on your SQL/DBMS system, it may be possible to implement the formula in a custom-defined function, which would make it both more efficient (because "pre-compiled", maybe written in a better language etc.) and shorter to reference, in the SQL query itself (event though it would require being listed twice, as said)
Depending on situation, it may also be possible to alter the database schema and underlying application, to have the formula (or parts thereof) but pre-computed, and indexed, saving the DBMS this lengthy resolution of the function-based predicate.