chained comparison in DolphinDB queries - performance

I noticed that chained comparison in DolphinDB queries is considerably slower than expected.
For example, for a distributed table "quotes" with more than 2 billion rows, the query
timer select avg(bid) as bid, avg(ofr) as ofr from quotes where 2020.12.07<=date<=2020.12.11 group by date, minute(time) as minute
is much slower than
timer select avg(bid) as bid, avg(ofr) as ofr from quotes where date>=2020.12.07, date<=2020.12.11 group by date, minute(time) as minute
The second query is really fast though. Does anyone know how to write a proper chained comparison in DolphinDB?

According DolphinDB manual
https://www.dolphindb.com/help/Queries.html
Filtering conditions with chained comparisons such as where 2020.12.07<=date<=2020.12.11 will scan all partitions instead of narrowing down relevant partitions. You can either use where date between 2020.12.07:2020.12.11 or separate inequality constraints for optimal performance.

Related

What is the difference between the UNION and CONCATENATION operators in terms of performance?

Sometimes, it can be seen CONCATENATION step in the Explain Plan.
I wonder what is the difference between union and concatenation operators in terms of performance tuning?
First up, UNION and CONCATENATION are subtly different.
CONCATENATION is equivalent to UNION-ALL. This combines the input tables and returns all the rows.
UNION combines the input tables. Then returns the distinct rows.
So UNION has an extra sort/distinct operation compared to CONCATENATION. How big this effect is depends on your data set.
You'll see CONCATENATION when the optimizer does an OR expansion. But note that from Oracle Database 12.2, this has changed:
CONCATENATION is replaced with UNION-ALL.
Each UNION-ALL branch can be subject to further query transformations, if applicable. This is not possible with
CONCATENATION.
Parallel queries can execute UNION-ALL branches concurrently. Again, this is not possible with CONCATENATION.
So UNION-ALL can come up with better plans for each operation below it. And run these at the same (if using parallel). So in many cases this will be faster than CONCATENATION.
A UNION has to remove duplicates, which is expensive. CONCATENATION is the step that happens when you do a UNION ALL (which does not remove duplicates, hence is cheaper).

Order of multiple conditions in where clause in oracle [duplicate]

Let's say I have a table called PEOPLE having three columns, ID, LastName, and FirstName. None of these columns are indexed.
LastName is more unique, and FirstName is less unique.
If I do two searches:
select * from PEOPLE where FirstName="F" and LastName="L"
select * from PEOPLE where LastName="L" and FirstName="F"
My belief is the second one is faster because the more unique criterion (LastName) comes first in the where clause, and records will get eliminated more efficiently. I don't think the optimizer is smart enough to optimize the first SQL query.
Is my understanding correct?
No, that order doesn't matter (or at least: shouldn't matter).
Any decent query optimizer will look at all the parts of the WHERE clause and figure out the most efficient way to satisfy that query.
I know the SQL Server query optimizer will pick a suitable index - no matter which order you have your two conditions in. I assume other RDBMS will have similar strategies.
What does matter is whether or not you have a suitable index for this!
In the case of SQL Server, it will likely use an index if you have:
an index on (LastName, FirstName)
an index on (FirstName, LastName)
an index on just (LastName), or just (FirstName) (or both)
On the other hand - again for SQL Server - if you use SELECT * to grab all columns from a table, and the table is rather small, then there's a good chance the query optimizer will just do a table (or clustered index) scan instead of using an index (because the lookup into the full data page to get all other columns just gets too expensive very quickly).
The order of WHERE clauses should not make a difference in a database that conforms to the SQL standard. The order of evaluation is not guaranteed in most databases.
Do not think that SQL cares about the order. The following generates an error in SQL Server:
select *
from INFORMATION_SCHEMA.TABLES
where ISNUMERIC(table_name) = 1 and CAST(table_name as int) <> 0
If the first part of this clause were executed first, then only numeric table names would be cast as integers. However, it fails, providing a clear example that SQL Server (as with other databases) does not care about the order of clauses in the WHERE statement.
ANSI SQL Draft 2003 5WD-01-Framework-2003-09.pdf
6.3.3.3 Rule evaluation order
...
Where the precedence is not determined by the Formats or by parentheses, effective evaluation of expressions is generally performed from left to right. However, it is implementation-dependent whether expressions are actually evaluated left to right, particularly when operands or operators might cause conditions to be raised or if the results of the expressions can be determined without completely evaluating all parts of the expression.
copied from here
No, all the RDBMs first start by analysing the query and optimize it by reordering your where clause.
Depending on which RDBM you are you using can display what is the result of the analyse (search for explain plan in oracle for instance)
M.
It's true as far as it goes, assuming the names aren't indexed.
Different data would make it wrong though. In order to find out which way to do it, which could differ every time, the DBMS would have to run a distinct count query for each column and compare the numbers, that would cost more than just shrugging and getting on with it.
Original OP statement
My belief is the second one is faster because the more unique criterion (LastName) comes first in >the where clause, and records will get eliminated more efficiently. I don't think the optimizer is >smart enough to optimize the first sql.
I guess you are confusing this with selecting the order of columns while creating the indexes where you have to put the more selective columns first than second most selective and so on.
BTW, for the above two query SQL server optimizer will not do any optimization but will use Trivila plan as long as the total cost of the plan is less than parallelism threshold cost.

How to make simple GROUP BY use index?

I want to get average temperatures hourly for given table with temperature reads of a thermometer, with row structure: thermometer_id, timestamp (float, julian days), value (float) plus ascending index on timestamp.
To get whole day 4 days ago, I'm using this query:
SELECT
ROUND(AVG(value), 2), -- average temperature
COUNT(*) -- count of readings
FROM reads
WHERE
timestamp >= (julianday(date('now')) - 5) -- between 5 days
AND
timestamp < (julianday(date('now')) - 4) -- ...and 4 days ago
GROUP BY CAST(timestamp * 24 as int) -- make hours from floats, group by hours
It does it work well, yet it works very slowly, for a 9MB database, 355k rows, it takes more than half a second to finish, which is confusingly long, it shouldn't take more than few tens of ms. It does so on not quite fast hardware (not ssd though), yet I'm preparing it to use on raspberry pi, quite slow in comparison + it's going to get 80k more rows per day of work.
Explain explains the reason:
"USE TEMP B-TREE FOR GROUP BY"
I've tried adding day and hour columns with indexes just for the sake of quick access, but still, group by didn't use any of the indexes.
How can I tune this query or database to make this query faster?
If an index is used to optimize the GROUP BY, the timestamp search can no longer be optimized (except by using the skip-scan optimization, which your old SQLite might not have). And going through all rows in reads, only to throw most of them away because of a non-matching timestamp, would not be efficient.
If SQLite doesn't automatically do the right thing, even after running ANALYZE, you can try to force it to use a specific index:
CREATE INDEX rhv ON reads(hour, value);
SELECT ... FROM reads INDEXED BY rhv WHERE timestamp ... GROUP BY hour;
But this is unlikely to result in a query plan that is actually faster.
As #colonel-thirty-two commented, the problem was with cast and multiplication on GROUP BY CAST(timestamp * 24 as int). Such grouping would totally omit the index, hence the slow query time. When I've used hour column both for time comparison and grouping, the query finished immediately.

performance for sum oracle

I have to sum a huge number of data with aggregation and where clause, using this query
what I am doing is like this : I have three tables one contains terms the second contains user terms , and the third contains correlation factor between term and user term.
I want to calculate the similarity between the sentence that that user inserted with an already existing sentences, and take the results greater than .5 by summing the correlation factor between sentences' terms
The problem is that this query takes more than 15 min. because I have huge tables
any suggestions to improve performance please?
insert into PLAG_SENTENCE_SIMILARITY
SELECT plag_TERMS.SENTENCE_ID ,plag_User_TERMS.SENTENCE_ID,
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length),
plag_TERMs.isn,
plag_user_terms.isn
FROM plag_TERM_CORRELATIONS3,
plag_TERMS,
Plag_User_TERMS
WHERE ( Plag_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM1
AND Plag_User_TERMS.TERM_ROOT = Plag_TERM_CORRELATIONS3.TERM2
AND Plag_User_Terms.ISN=123)
having
least( sum( plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_terms.sentence_length,
sum (plag_TERM_CORRELATIONS3.CORRELATION_FACTOR)/ plag_user_terms.sentence_length) >0.5
group by (plag_User_TERMS.SENTENCE_ID,plag_TERMS.SENTENCE_ID , plag_TERMs.isn, plag_terms.sentence_length,plag_user_terms.sentence_length, plag_user_terms.isn);
plag_terms contains more than 50 million records and plag_correlations3 contains 500000
If you have a sufficient amount of free disk space, then create a materialized view
over the join of the three tables
fast-refreshable on commit (don't use the ANSI join syntax here, even if tempted to do so, or the mview won't be fast-refreshable ... a strange bug in Oracle)
with query rewrite enabled
properly physically organized for quick calculations
The query rewrite is optional. If you can modify the above insert-select, then you can just select from the materialized view instead of selecting from the join of the three tables.
As for the physical organization, consider
hash partitioning by Plag_User_Terms.ISN (with a sufficiently high number of partitions; don't hesitate to partition your table with e.g. 1024 partitions, if it seems reasonable) if you want to do a bulk calculation over all values of ISN
single-table hash clustering by Plag_User_Terms.ISN if you want to retain your calculation over a single ISN
If you don't have a spare disk space, then just hint your query to
either use nested loops joins, since the number of rows processed seems to be quite low (assumed by the estimations in the execution plan)
or full-scan the plag_correlations3 table in parallel
Bottom line: Constrain your tables with foreign keys, check constraints, not-null constraints, unique constraints, everything! Because Oracle optimizer is capable of using most of these informations to its advantage, as are the people who tune SQL queries.

How Does SELECT TOP N work in Sybase ASE

I was wondering how this query is executed :
SELECT TOP 10 * FROM aSybaseTable
WHERE aCondition
The fact is that this query is taking too much time to return results.
So I was wondering if the query is smart enough to stop when the results reach 10 rows of if it returns all the possible results and then print only the 10 first rows.
Thanks in advance for your replies !
When using select top N the query is still executed fully, just the data page reads stop after the specified number of rows is affected. All the index page reads, and sorts still have to occur, so depending on the complexity of the where condition or subqueries, it can definitely still take time to execute. select top N is functionally similar to using set rowcount
Michael is right, but there is one special case here that really needs to be mentioned.
Query WILL execute faster and partially if no order by and group by clauses are used.
But this case is rarely useful since then you will get random N rows which fulfill the condition, in order they are physically located in table/index.

Resources