Suppose an in-clause can accept a varying number of bind parameters. In this case, databases can have a hard time caching the query. Basically, each time a different number of bind parameters is passed, the query needs to be hard parsed. Enter "parameter padding". Parameter padding will take an in-clause and increase the number of binds to the closest 2^x number of binds.
Examples:
select count(*) from user where id in (1, 2, 3) becomes select count(*) from user where id in (1, 2, 3, 3)
select count(*) from user where id in (1, 2, 3, 4) remains select count(*) from user where id in (1, 2, 3, 4)
And now these two queries can share the same cached plan.
Question: Is there a logical reason for 2^x binds? Why not 3^x or 5^x so that even fewer hard parses are required? This would be especially helpful in queries that contain multiple in-clauses with varying binds.
The specific database in question is Oracle 12c. Using stats for a query that has an in-clause of in (1, 2, 3, 3) shows that the duplicate values do not appear in the execution plan. Furthermore, using stats for a query requiring 30 binds runs just as efficiently when using an in-clause of the exact 30 values needed OR using an in-clause with 100 values where the last value appears 70 more times.
There's a tradeoff:
If you choose, say, 5^x, then for 7 parameters your IN-list will have 25 members, instead of just 8. The query will then take longer to run - the fact that the tail values are all equal won't help.
Note that your example of an explain plan for the IN-list of (1,2,3,3) is irrelevant. That has hard-coded values, not bind variables. The relevant example is (:bind1, :bind2, :bind3, :bind4); when the query is parsed, the optimizer can't assume that :bind3 will always equal :bind4 (for the obvious reason that that's not even true in general).
2^x is usually a good tradeoff between "how many hard parses to allow" and "how fast the queries will be". Otherwise you could just use a single query, with 1000 parameters (the max allowed) - why even have more than ONE such query?
Related
Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:
This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).
Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.
I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?
As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.
There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.
For example, sampling trivially from a table with 8 rows:
> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id) |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s
(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)
Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.
In retrospect, knowing that TABLESAMPLE is unavailable, one could add a field "RVAL" (random 32-bit integer, for instance) to each record, and sample repeatedly by adding "where RVAL > x and RVAL < y", for appropriate values of x and y. Non-overlapping intervals [x1,y1], [x2,y2],... will be independent. You can also select using "where RVAL%10000 = 1, =2, ... etc, for a separate population of independent subsets.
TABLESAMPLE mentioned in other answers is now available in newer versions of impala (>=2.9.0), see documentation.
Here's an example of how you could use it to sample 1% of your data:
SELECT foo FROM huge_table TABLESAMPLE SYSTEM(1)
or
SELECT bar FROM huge_table TABLESAMPLE SYSTEM(1) WHERE name='john'
Looks like percentage argument must be an integer, so the smallest sample you can take is limited to 1%.
Keep in mind that the proportion of sampled data from the table is not guaranteed and may be greater than the specified percentage (in this case more than 1%). This is explained in greater detail in Impala's documentation.
If you are looking for sample over certain column(s), you can check below answer.
Say, you have global data and you want to pick 10% from them randomly and create your dataset. You can use any combination of columns too - like city, zip code and state.
select * from
(
select
row_number() over (partition by country order by country , random()) rn,
count() over (partition by country order by country) cntpartition,
tab.*
from dat.mytable tab
)rs
where rs.rn between 1 and cntpartition* 10/100 -- This is for 10% data
Link -
Randomly sampling n rows in impala using random() or tablesample system()
I was wondering how this query is executed :
SELECT TOP 10 * FROM aSybaseTable
WHERE aCondition
The fact is that this query is taking too much time to return results.
So I was wondering if the query is smart enough to stop when the results reach 10 rows of if it returns all the possible results and then print only the 10 first rows.
Thanks in advance for your replies !
When using select top N the query is still executed fully, just the data page reads stop after the specified number of rows is affected. All the index page reads, and sorts still have to occur, so depending on the complexity of the where condition or subqueries, it can definitely still take time to execute. select top N is functionally similar to using set rowcount
Michael is right, but there is one special case here that really needs to be mentioned.
Query WILL execute faster and partially if no order by and group by clauses are used.
But this case is rarely useful since then you will get random N rows which fulfill the condition, in order they are physically located in table/index.
I have some numbers in a fact table, and have generated a measure which use the SUM aggregator to summarize the numbers. But the problem is that I only want to sum the numbers that are higher than, say 10. I tried using a generic expression in the measure definition, and that works of course, but the problem is that I need to be able to dynamically set that value, because it's not always 10, meaning users should be able to select it themselves.
More specifically, my current MDX looks like this:
WITH
SET [Email Measures] AS '{[Measures].[Number Of Answered Cases],
[Measures].[Max Expedition Time First In Case], [Measures].[Avg Expedition Times First In Case],
[Measures].[Number Of Incoming Email Requests], [Measures].[Avg Number Of Emails In Cases],
[Measures].[Avg Expedition Times Total],[Measures].[Number Of Answered Incoming Emails]}'
SET [Organizations] AS '{[Organization.Id].[860]}'
SET [Operators] AS '{[Operator.Id].[3379],[Operator.Id].[3181]}'
SET [Email Accounts] AS '{[Email Account.Id].[6]}'
MEMBER [Time.Date].[Date Period] AS Aggregate ({[Time.Date].[2008].[11].[11] :[Time.Date].[2009].[1].[2] })
MEMBER [Email.Type].[Email Types] AS Aggregate ({[Email.Type].[0]})
SELECT {[Email Measures]} ON columns,
[Operators] ON rows
FROM [Email_Fact]
WHERE ( [Time.Date].[Date Period] )
Now, the member in question is the calculated member [Avg Expedition Times Total]. This member takes in two measures; [Sum Expedition Times] and [Nr of Expedition Times] and splits one on the other to get the average, all this presently works. However, I want [Sum Expedition Times] to only summarize values over or under a parameter of my/the user's wish.
How do I filter the numbers [Sum Expedition Times] iterates through, rather than filtering on the sum that the measure gives me in the end?
You could move the member into the MDX query, instead of putting it in the cube. Then you get something like....
WITH
MEMBER [Avg Expedition Times Total] AS
SUM(
FILTER([Your Dimension],
[Measure you want to filter] > 10),
[Measure you want to sum])
I'm not sure exactly which dimensions and measure you want to filter and sum by, but I think this is a step in the right direction. If your users can't modify the MDX (or don't want to!) then creating multiple measures is a pretty solid solution too.
You need to have a Dimension which has distinct values of this Measure( Incase if the number of distinct values is too high then perhaps some sort of range). Then you join this dimension to the fact. The joining should be simple. The Measure column will become the key column also. Now you just need to refer dimension memeber.