How does predicate pushdown work exactly?

How does predicate pushdown work exactly? - hadoop

Could anyone please explain with examples how exactly predicate pushdown works?

Say you want to execute a query
SELECT
SUM(price)
FROM sales
WHERE
purchase_date BETWEEN '2018-01-01' and '2018-01-31';
A very trivial implementation of a query engine is to iterate over all parquet/orc files, deserialize the price and purchase_date columns, apply the predicate on the purchase_date and sum the filtered rows.
Parquet (not sure about orc) maintains statistics on the columns in each file, so if the execution engine is smart enough, it can look at the min/max of the purchase_date within the statistics and determine if any rows is going to match. For example, if purchase_date.min=2014-05-05 and purchase_date.max=2014-05-06, it can deduce that the predicate will always evaluate to false.
In other words, it can skip parquet files by combining statistics and the filter predicate . This can lead to massive gain of performance because IO (file or memory) is usually the bottleneck. The gain is inversely proportional to the selectivity (the percentage of matching rows).
The term predicate push-down comes from the fact that you're "hinting" the scan operator with the predicate that is then going to be used to filter the rows of interest. Or, pushing the predicate to the scan.

Related

Bind variables results in full table scan in Oracle

Checking the query cost on a table with 1 million records results in full table scan while the same query in oracle with actual values results in significant lesser cost.
Is this expected behaviour from Oracle ?
Is there a way to tell Oracle not to scan the full table ?
The query is scanning the full table when bind variables are used:
The query cost reduces significantly with actual variables:

This is a pagination query. You want to retrieve a handful of records from the table, filtering on their position in the filtered set. Your projection includes all the columns of the table, so you need to query the table to get the whole row. The question is, why do the two query variants have different plans?
Let's consider the second query. You are passing hard values for the offsets, so the optimizer knows that you want the eleven most recent rows in the sorted set. The set is sorted by an indexed column. The most important element is that the optimizer knows you want 11 rows. 11 is a very small sliver of one million, so using an indexed read to get the required rows is an efficient way of doing things. The path starts at the far end of the index, reads the last eleven entries and retrieves the rows.
Now, your first query has bind variables for the starting and finishing offsets and also for the number of rows to be returned. This is crucial: the optimizer doesn't know whether you want to return eleven rows or eleven thousand rows. So it opts for a very high cardinality. The reason for this is that index reads perform very badly for retrieving large numbers of rows. Full table scans are the best way of handling big slices of our tables.
Is this expected behaviour from Oracle ?
Now you understand this you will can see that the answer to this question is yes. The optimizer makes the best decision it can with the information we give it. When we provide hard values it can be very clever. When we provide vague data it has to guess; sometimes its guesses aren't the ones we expected.
Bind variables are very useful for running the same query with different values when the expected result set is similar. But using bind variables to specify ranges means the result sets can potentially vary tremendously in size.
Is there a way to tell Oracle not to scan the full table ?
If you can fix the pagesize, thus removing the :a2 parameter, that would allow the optimizer to produce a much more accurate plan. Alternatively, if you need to vary the pagesize within a small range (say 10 - 100) then you could try a /*+ cardinality (100) */ hint in the query; provided the cardinality value is within the right order of magnitude it doesn't have to be the precise value.
As with all performance questions, the devil is in the specifics. So you need to benchmark various performance changes and choose the best fit for your particular use case(s).

Order of multiple conditions in where clause in oracle [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?

No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).

The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.

ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here

No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.

It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.

Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

When may it make sense to not have an index on a table and why?

For Oracle and being Relative to application tuning, when may it make sense to not have an index on a table and why?

There is a cost associated to having an index:
it takes up disk space
it slows down updates (index needs to be updated as well)
it makes query planning more complex (slightly slower, but more importantly increased potential for bad decisions)
These costs are supposed to be offset by the benefit of more efficient query processing (faster, fewer I/O).
If the index is not used enough to justify the cost, then having the index will be negative.

In particular, if your data distribution is low (think flags like 'Y' and 'N'), indexes won't help much. Think of it this way: if the number of distinct values in an index is low, the optimizer will probably choose not to use the index. An interesting aside is that if the column in the index is null, it might be much faster if your query criteria include actual values as nulls aren't indexed, which means that only the actual values (non null) are in that particular index,thereby not evaluating most of the rows in the table. In the "is null" case, it will never use an index - if you have a query with a "where" clause like "where mytable.mycolumn is null", abandon all indexes ye who enter here.

If a table has very little data (small number of rows) then it doesn't serve you to use an index. An index makes it quick to search on a specific attribute and if the application you are working with doesn't need a fast lookup then using an index does very little for you.

performance for sum oracle

I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000

If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.

multicolumn index column order

I've be told and read it everywhere (but no one dared to explain why) that when composing an index on multiple columns I should put the most selective column first, for performance reasons.
Why is that?
Is it a myth?

I should put the most selective column first
According to Tom, column selectivity has no performance impact for queries that use all the columns in the index (it does affect Oracle's ability to compress the index).
it is not the first thing, it is not the most important thing. sure, it is something to consider but it is relatively far down there in the grand scheme of things.
In certain strange, very peculiar and abnormal cases (like the above with really utterly skewed data), the selectivity could easily matter HOWEVER, they are
a) pretty rare
b) truly dependent on the values used at runtime, as all skewed queries are
so in general, look at the questions you have, try to minimize the indexes you need based on that.
The number of distinct values in a column in a concatenated index is not relevant when considering
the position in the index.
However, these considerations should come second when deciding on index column order. More importantly is to ensure that the index can be useful to many queries, so the column order has to reflect the use of those columns (or the lack thereof) in the where clauses of your queries (for the reason illustrated by AndreKR).
HOW YOU USE the index -- that is what is relevant when deciding.
All other things being equal, I would still put the most selective column first. It just feels right...
Update: Another quote from Tom (thanks to milan for finding it).
In Oracle 5 (yes, version 5!), there was an argument for placing the most selective columns first
in an index.
Since then, it is not true that putting the most discriminating entries first in the index
will make the index smaller or more efficient. It seems like it will, but it will not.
With index
key compression, there is a compelling argument to go the other way since it can make the index
smaller. However, it should be driven by how you use the index, as previously stated.

You can omit columns from right to left when using an index, i.e. when you have an index on col_a, col_b you can use it in WHERE col_a = x but you can not use it in WHERE col_b = x.
Imagine to have a telephone book that is sorted by the first names and then by the last names.
At least in Europe and US first names have a much lower selectivity than last names, so looking up the first name wouldn't narrow the result set much, so there would still be many pages to check for the correct last name.

The ordering of the columns in the index should be determined by your queries and not be any selectivity considerations. If you have an index on (a,b,c), and most of your single column queries are against column c, followed by a, then put them in the order of c,a,b in the index definition for the best efficiency. Oracle prefers to use the leading edge of the index for the query, but can use other columns in the index in a less efficient access path known as skip-scan.

The more selective is your index, the fastest is the research.
Simply imagine a phonebook: you can find someone mostly fast by lastname. But if you have a lot of people with the same lastname, you will last more time on looking for the person by looking at the firstname everytime.
So you have to give the most selective columns firstly to avoid as much as possible this problem.
Additionally, you should then make sure that your queries are using correctly these "selectivity criterias".

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio