I have a table with about 1.5 million rows and three columns. Column 'timestamp' is of type REAL and indexed. I am accessing the SQLite database via PHP PDO.
The following three selects run in less than a millisecond:
select timestamp from trades
select timestamp + 1 from trades
select max(timestamp) from trades
The following select needs almost half a second:
select max(timestamp) + 1 from trades
Why is that?
EDIT:
Lasse has asked for a "explain query plan", I have run this within a PHP PDO query since I have no direct SQLite3 command line tool access at the moment. I guess it does not matter, here is the result:
explain query plan select max(timestamp) + 1 from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SCAN TABLE trades (~1000000 rows)
explain query plan select max(timestamp) from trades:
[selectid] => 0
[order] => 0
[from] => 0
[detail] => SEARCH TABLE trades USING COVERING INDEX tradesTimestampIdx (~1 rows)
The reason this query
select max(timestamp) + 1 from trades
takes so long is that the query engine must, for each record, compute the MAX value and then add one to it. Computing the MAX value involves doing a full table scan, and this must be repeated for each record because you are adding one to the value.
In the query
select timestamp + 1 from trades
you are doing a calculation for each record, but the engine only needs to scan the entire table once. And in this query
select max(timestamp) from trades
the engine does have to scan the entire table, however it also does so only once.
From the SQLite documentation:
Queries that contain a single MIN() or MAX() aggregate function whose argument is the left-most column of an index might be satisfied by doing a single index lookup rather than by scanning the entire table.
I emphasized might from the documentation, because it appears that a full table scan may be necessary for a query of the form SELECT MAX(x)+1 FROM table
if column x be not the left-most column of an index.
Related
I got a oracle query that uses an in-clause with eight given values, like:
select * from mytable a
where a.wf_type in ('value1', 'value2', 'value3', 'value4', 'value5', 'value6', 'value7', 'value8');
The table is not really big (about 3 million rows) and the query does a full table scan.
Therefore I added an index for the wf_type attribute.
But the index is not used by the query with the in-clause. If I change the query to one specific value like
select * from mytable a where a.wf_type = 'value1';
the index is used and the query runs fast.
How do I fasten the query with the in-clause? Is it possible by using an index or are there other ways?
I'm trying to speed up the following
create table tab2 parallel 24 nologging compress for query high as
select /*+ parallel(24) index(a ix_1) index(b ix_2)*/
a.usr
,a.dtnum
,a.company
,count(distinct b.usr) as num
,count(distinct case when b.checked_1 = 1 then b.usr end) as num_che_1
,count(distinct case when b.checked_2 = 1 then b.usr end) as num_che_2
from tab a
join tab b on a.company = b.company
and b.dtnum between a.dtnum-1 and a.dtnum-0.0000000001
group by a.usr, a.dtnum, a.company;
by using indexes
create index ix_1 on tab(usr, dtnum, company);
create index ix_2 on tab(usr, company, dtnum, checked_1, checked_2);
but the execution plan tells me that it's going to be an index full scan for both indexes, and the calculations are very long (1 day is not enough).
About the data. Table tab has over 3 mln records. None of the single columns are unique. The unique values here are pairs of (usr, dtnum), where dtnum is a date with time written as a number in the format yyyy,mmddhh24miss. Columns checked_1, checked_2 have values from set (null, 0, 1, 2). Company holds an id for a company.
Each pair can only have one value checked_1, checked_2 and company as it is unique. Each user can be in multple pairs with different dtnum.
Edit
#Roberto Hernandez: I've attached the picture with the execution plan. As for parallel 24, in our company we are told to create tables with options 'parallel [num] nologging compress for query high'. I'm using 24 but I'm no expert in this field.
#Sayan Malakshinov: http://sqlfiddle.com/#!4/40b6b/2 Here I've simplified by giving data with checked_1 = checked_2, but in real life this may not be true.
#scaisEdge:
For
create index my_id1 on tab (company, dtnum);
create index my_id2 on tab (company, dtnum, usr);
I get
For table tab Your join condition is based on columns
company, datun
so you index should be primarly based on these columns
create index my_id1 on tab (company, datum);
The indexes you are using are useless because don't contain in left most position columsn use ij join /where condition
Eventually you can add user right most potition for avoid the needs of table access and let the db engine retrive alla the inf inside the index values
create index my_id1 on tab (company, datum, user, checked_1, checked_2);
Indexes (bitmap or otherwise) are not that useful for this execution. If you look at the execution plan, the optimizer thinks the group-by is going to reduce the output to 1 row. This results in serialization (PX SELECTOR) So I would question the quality of your statistics. What you may need is to create a column group on the three group-by columns, to improve the cardinality estimate of the group by.
How to set range for limit clause in hive , I have tried the below query but failed with syntax error . Can someone please help
select * from table limit 1000,2000;
You can use Row_Number window function and set the range limit.
Below Query will result only the first 20 records from the table
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 0 and rowid <=20;
Using Between operator to specify range
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid between 0 and 20;
To fetch rows from 20 to 40 then increase the lower/upper bound values
hive> select * from
(
SELECT *,ROW_NUMBER() over (Order by id) as rowid FROM <tab_name>
)t
where rowid > 20 and rowid <=40;
The LIMIT clause is used to set a ceiling on the number of rows in the result set. You are getting a syntax error because of an incorrect usage of this HQL clause.
The query could be written as the following to return no more than 2000 rows:
SELECT * FROM table LIMIT 2000;
You could also write it like so to return no more than 1000 rows:
SELECT * FROM table LIMIT 1000;
However you cannot combine both into the same argument for LIMIT. The LIMIT argument must evaluate to a constant value.
I will try and expand on this information a bit to try and help solve your problem. If you are attempting to "paginate" your results the following may be of use.
FIRST I would recommend against leaning on HQL for pagination, in most situations that would be more efficiently implemented on the application logic side (query large result set, cache what you need, paginate with application logic). If you have no choice but to pull out ranges of rows you can get the desired effect through a combination of the LIMIT, ORDER BY, and OFFSET clauses.
LIMIT : This will limit your result set to a maximum number of row
ORDER BY: This will sort/order your result set based on one or more columns
OFFSET: This will start your result set at a certain row after the logical first entry in the table.
You may combine these three clauses to effectively query "pages" of your table. For example the following three queries show how to get the first 3 blocks of data from a table where each block contains 1000 rows and the target table's 'column1' is used to determine logical order.
SELECT title as "Page 1", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 0;
SELECT title as "Page 2", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 1000;
SELECT title as "Page 3", column1, column2, ... FROM table
ORDER BY column1 LIMIT 1000 OFFSET 2000;
Each query declares 'column1' as the sorting value with ORDER BY. The queries will return no more than 1000 rows due to the LIMIT clause. Each result set will start at a different row due to the OFFSET being incremented by the "page size" for each query.
I am not sure what you are trying to achieve, but ...
That will return the 1001 and the 2001 record in the query results set only if you are using hive a hive version greater than 2.0.0
hive --version
(https://issues.apache.org/jira/browse/HIVE-11531)
Limit in Hive gives 'n' number of records randomly. It's not to print a range of records.
You may use order by in conjunction with limit to get what you want
Have the following tables (Oracle 10g):
catalog (
id NUMBER PRIMARY KEY,
name VARCHAR2(255),
owner NUMBER,
root NUMBER REFERENCES catalog(id)
...
)
university (
id NUMBER PRIMARY KEY,
...
)
securitygroup (
id NUMBER PRIMARY KEY
...
)
catalog_securitygroup (
catalog REFERENCES catalog(id),
securitygroup REFERENCES securitygroup(id)
)
catalog_university (
catalog REFERENCES catalog(id),
university REFERENCES university(id)
)
Catalog: 500 000 rows, catalog_university: 500 000, catalog_securitygroup: 1 500 000.
I need to select any 50 rows from catalog with specified root ordered by name for current university and current securitygroup. There is a query:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root = 100
AND cs.catalog = c.id
AND cs.securitygroup = 200
AND cu.catalog = c.id
AND cu.university = 300
ORDER BY name
) cc
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
Where 100 - some catalog, 200 - some securitygroup, 300 - some university. This query return 50 rows from ~ 170 000 in 3 minutes.
But next query return this rows in 2 sec:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c
WHERE c.root = 100
ORDER BY name
) cc
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
I build next indexes: (catalog.id, catalog.name, catalog.owner), (catalog_securitygroup.catalog, catalog_securitygroup.index), (catalog_university.catalog, catalog_university.university).
Plan for first query (using PLSQL Developer):
http://habreffect.ru/66c/f25faa5f8/plan2.jpg
Plan for second query:
http://habreffect.ru/f91/86e780cc7/plan1.jpg
What are the ways to optimize the query I have?
The indexes that can be useful and should be considered deal with
WHERE c.root = 100
AND cs.catalog = c.id
AND cs.securitygroup = 200
AND cu.catalog = c.id
AND cu.university = 300
So the following fields can be interesting for indexes
c: id, root
cs: catalog, securitygroup
cu: catalog, university
So, try creating
(catalog_securitygroup.catalog, catalog_securitygroup.securitygroup)
and
(catalog_university.catalog, catalog_university.university)
EDIT:
I missed the ORDER BY - these fields should also be considered, so
(catalog.name, catalog.id)
might be beneficial (or some other composite index that could be used for sorting and the conditions - possibly (catalog.root, catalog.name, catalog.id))
EDIT2
Although another question is accepted I'll provide some more food for thought.
I have created some test data and run some benchmarks.
The test cases are minimal in terms of record width (in catalog_securitygroup and catalog_university the primary keys are (catalog, securitygroup) and (catalog, university)). Here is the number of records per table:
test=# SELECT (SELECT COUNT(*) FROM catalog), (SELECT COUNT(*) FROM catalog_securitygroup), (SELECT COUNT(*) FROM catalog_university);
?column? | ?column? | ?column?
----------+----------+----------
500000 | 1497501 | 500000
(1 row)
Database is postgres 8.4, default ubuntu install, hardware i5, 4GRAM
First I rewrote the query to
SELECT c.id, c.name, c.owner
FROM catalog c, catalog_securitygroup cs, catalog_university cu
WHERE c.root < 50
AND cs.catalog = c.id
AND cu.catalog = c.id
AND cs.securitygroup < 200
AND cu.university < 200
ORDER BY c.name
LIMIT 50 OFFSET 100
note: the conditions are turned into less then to maintain comparable number of intermediate rows (the above query would return 198,801 rows without the LIMIT clause)
If run as above, without any extra indexes (save for PKs and foreign keys) it runs in 556 ms on a cold database (this is actually indication that I oversimplified the sample data somehow - I would be happier if I had 2-4s here without resorting to less then operators)
This bring me to my point - any straight query that only joins and filters (certain number of tables) and returns only a certain number of the records should run under 1s on any decent database without need to use cursors or to denormalize data (one of these days I'll have to write a post on that).
Furthermore, if a query is returning only 50 rows and does simple equality joins and restrictive equality conditions it should run even much faster.
Now let's see if I add some indexes, the biggest potential in queries like this is usually the sort order, so let me try that:
CREATE INDEX test1 ON catalog (name, id);
This makes execution time on the query - 22ms on a cold database.
And that's the point - if you are trying to get only a page of data, you should only get a page of data and execution times of queries such as this on normalized data with proper indexes should take less then 100ms on decent hardware.
I hope I didn't oversimplify the case to the point of no comparison (as I stated before some simplification is present as I don't know the cardinality of relationships between catalog and the many-to-many tables).
So, the conclusion is
if I were you I would not stop tweaking indexes (and the SQL) until I get the performance of the query to go below 200ms as rule of the thumb.
only if I would find an objective explanation why it can't go below such value I would resort to denormalisation and/or cursors, etc...
First I assume that your University and SecurityGroup tables are rather small. You posted the size of the large tables but it's really the other sizes that are part of the problem
Your problem is from the fact that you can't join the smallest tables first. Your join order should be from small to large. But because your mapping tables don't include a securitygroup-to-university table, you can't join the smallest ones first. So you wind up starting with one or the other, to a big table, to another big table and then with that large intermediate result you have to go to a small table.
If you always have current_univ and current_secgrp and root as inputs you want to use them to filter as soon as possible. The only way to do that is to change your schema some. In fact, you can leave the existing tables in place if you have to but you'll be adding to the space with this suggestion.
You've normalized the data very well. That's great for speed of update... not so great for querying. We denormalize to speed querying (that's the whole reason for datawarehouses (ok that and history)). Build a single mapping table with the following columns.
Univ_id, SecGrp_ID, Root, catalog_id. Make it an index organized table of the first 3 columns as pk.
Now when you query that index with all three PK values, you'll finish that index scan with a complete list of allowable catalog Id, now it's just a single join to the cat table to get the cat item details and you're off an running.
The Oracle cost-based optimizer makes use of all the information that it has to decide what the best access paths are for the data and what the least costly methods are for getting that data. So below are some random points related to your question.
The first three tables that you've listed all have primary keys. Do the other tables (catalog_university and catalog_securitygroup) also have primary keys on them?? A primary key defines a column or set of columns that are non-null and unique and are very important in a relational database.
Oracle generally enforces a primary key by generating a unique index on the given columns. The Oracle optimizer is more likely to make use of a unique index if it available as it is more likely to be more selective.
If possible an index that contains unique values should be defined as unique (CREATE UNIQUE INDEX...) and this will provide the optimizer with more information.
The additional indexes that you have provided are no more selective than the existing indexes. For example, the index on (catalog.id, catalog.name, catalog.owner) is unique but is less useful than the existing primary key index on (catalog.id). If a query is written to select on the catalog.name column, it is possible to do and index skip scan but this starts being costly (and most not even be possible in this case).
Since you are trying to select based in the catalog.root column, it might be worth adding an index on that column. This would mean that it could quickly find the relevant rows from the catalog table. The timing for the second query could be a bit misleading. It might be taking 2 seconds to find 50 matching rows from catalog, but these could easily be the first 50 rows from the catalog table..... finding 50 that match all your conditions might take longer, and not just because you need to join to other tables to get them. I would always use create table as select without restricting on rownum when trying to performance tune. With a complex query I would generally care about how long it take to get all the rows back... and a simple select with rownum can be misleading
Everything about Oracle performance tuning is about providing the optimizer enough information and the right tools (indexes, constraints, etc) to do its job properly. For this reason it's important to get optimizer statistics using something like DBMS_STATS.GATHER_TABLE_STATS(). Indexes should have stats gathered automatically in Oracle 10g or later.
Somehow this grew into quite a long answer about the Oracle optimizer. Hopefully some of it answers your question. Here is a summary of what is said above:
Give the optimizer as much information as possible, e.g if index is unique then declare it as such.
Add indexes on your access paths
Find the correct times for queries without limiting by rowwnum. It will always be quicker to find the first 50 M&Ms in a jar than finding the first 50 red M&Ms
Gather optimizer stats
Add unique/primary keys on all tables where they exist.
The use of rownum is wrong and causes all the rows to be processed. It will process all the rows, assigned them all a row number, and then find those between 0 and 50. When you want to look for in the explain plan is COUNT STOPKEY rather than just count
The query below should be an improvement as it will only get the first 50 rows... but there is still the issue of the joins to look at too:
SELECT ccc.* FROM (
SELECT cc.*, ROWNUM AS n FROM (
SELECT c.id, c.name, c.owner
FROM catalog c
WHERE c.root = 100
ORDER BY name
) cc
where rownum <= 50
) ccc WHERE ccc.n > 0 AND ccc.n <= 50;
Also, assuming this for a web page or something similar, maybe there is a better way to handle this than just running the query again to get the data for the next page.
try to declare a cursor. I dont know oracle, but in SqlServer would look like this:
declare #result
table (
id numeric,
name varchar(255)
);
declare __dyn_select_cursor cursor LOCAL SCROLL DYNAMIC for
--Select
select distinct
c.id, c.name
From [catalog] c
inner join university u
on u.catalog = c.id
and u.university = 300
inner join catalog_securitygroup s
on s.catalog = c.id
and s.securitygroup = 200
Where
c.root = 100
Order by name
--Cursor
declare #id numeric;
declare #name varchar(255);
open __dyn_select_cursor;
fetch relative 1 from __dyn_select_cursor into #id,#name declare #maxrowscount int
set #maxrowscount = 50
while (##fetch_status = 0 and #maxrowscount <> 0)
begin
insert into #result values (#id, #name);
set #maxrowscount = #maxrowscount - 1;
fetch next from __dyn_select_cursor into #id, #name;
end
close __dyn_select_cursor;
deallocate __dyn_select_cursor;
--Select temp, final result
select
id,
name
from #result;
Database
Table1
Id
Table2Id
...
Table2
Id
StartTime
Duration //in hours
Query
select * from Table1 join Table2 on Table2Id = Table2.Id
where starttime < :starttime and starttime + Duration/24 > :endtime
This query is currently taking about 2 seconds to run which is too long. There is an index on the id columns and a function index on Start_time+duration/24 In Sql Developer the query plan shows no indexes being used. The query returns 475 rows for my test start and end times. Table2 has ~800k rows Table1 has ~200k rows
If the duration/24 calculation is removed from the query, replaced with a static value the query time is reduced by half. This does not retrieve the exact same data, but leads me to believe that the division is expensive.
I have also tested adding an endtime column to Table2 that is populated with (starttime + duration/24) The column was prepopulated via a single update, if it would be used in production I would populate it via an update trigger.
select * from Table1 join Table2 on Table2Id = Table2.Id
where starttime < :starttime and endtime > :endtime
This query will run in about 600ms and it uses an index for the join. It is less then ideal because of the additional column with redundant data.
Are there any methods of making this query faster?
Create a function index on both starttime and the expression starttime + Duration/24:
create index myindex on table2(starttime, starttime + Duration / 24);
A compound index on the entire predicate of your query should be selected, whereas individually indexed the optimizer is likely deciding that repeated table accesses by rowid based on a scan of one of those indexes is actually slower than a full table scan.
Also make sure that you're not doing an implicit conversion from varchar to date, by ensuring that you're passing DATEs in your bind variables.
Try lowering the optimizer_index_cost_adj system parameter. I believe the default is 100. Try setting that to 10 and see if your index is selected.
Consider partitioning the table by starttime.
You have two criteria with range predicates (greater than/less than). An index range scan can start at one point in the index and end at another.
For a compound index on starttime and "Starttime+duration/24", since the leading column is starttime and the predicate is "less than bind value", it will start at the left most edge of the index (earliest starttime) and range scan all rows up to the point where the starttime reaches the limit. For each of those matches, it can evaluate the calculated value for "Starttime+duration/24" on the index against the bind value and pass or reject the row. I'd suspect most of the data in the table is old, so most entries have an old starttime and you'd end up scanning most of the index.
For a compound index on "Starttime+duration/24" and starttime, since the leading column is the function and the predicate is "greater than bindvalue", it will start partway through the index and work its way to the end. For each of those matches, it can evaluate the starttime on the index against the bind value and pass or reject the row. If the enddate passed in is recent, I suspect this would actually involve a much smaller amount of the index being scanned.
Even without the starttime as a second column on the index, the existing function based index on "Starttime+duration/24" should still be useful and used. Check the explain plan to make sure the bindvalue is either a date or converted to a date. If it is converted, make sure the appropriate format mask is used (eg an entered value of '1/Jun/09' may be converted to year 0009, so Oracle will see the condition as very relaxed and would tend not to use the index - plus the result could be wrong).
"In Sql Developer the query plan shows no indexes being used. " If the index wasn't being used to find the table2 rows, I suspect the optimizer thought most/all of table2 would be returned [which it obviously isn't, by your numbers]. I'd guess that it though most of table1 would be returned, and thus neither of your predicates did a lot of filtering. As I said above, I think the "less than" predicate isn't selective, but the "greater than" should be. Look at the explain plan, especially the ROWS value, to see what Oracle thinks
PS.
Adjusting the value means the optimizer changes the basis for its estimates. If a journey planner says you'll take six hours for a trip because it assumes an average speed of 50, if you tell it to assume an average of 100 it will comes out with three hours. it won't actually affect the speed you travel at, or how long it takes to actually make the journey.
So you only want to change that value to make it more accurately reflect the actual value for your database (or session).
Oracle would not use indexes if the selectivity of the where clause is not very good. Index would be used if the number of rows returned would be some percentage of the total number of rows in the table (the percentage varies, since oracle will count the cost of reading the index as well as reading the tables).
Also, when the index columns are modified in where clause, the index would get disabled. For example, UPPERCASE(some_index_column), would disable the usage of the index on some_index_column. This is why starttime + Duration/24 > :endtime does not use the Index.
Can you try this
select * from Table1 join Table2 on Table1.Id = Table2.Table1Id
where starttime < :starttime and starttime > :endtime - Duration/24
This should allow the use of the Index and there is no need for an additional column.