Improve SQLite anti-join performance - performance

Check out the update at the bottom of this question, the cause of the unexpected variance in query times noted below has been identified as the result of a sqliteman quirk
I have the following two tables in a SQLite DB (The structure might seem pointless I know but bear with me)
+-----------------------+
| source |
+-----------------------+
| item_id | time | data |
+-----------------------+
+----------------+
| target |
+----------------+
| item_id | time |
+----------------+
--Both tables have a multi column index on item_id and time
The source table contains around 500,000 rows, there will never more than one matching record in the target table, in practise it is likely almost all source rows will have a matching target row.
I am attempting to a perform a fairly standard anti-join to find all records in source without corresponding rows in target, but am finding it difficult to create a query with an acceptable execution time.
The query I am using is:
SELECT
source.item_id,
source.time,
source.data
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL;
Just the LEFT JOIN without the WHERE clause takes around 200ms to complete, with it this increases to 5000ms.
While I originally noticed the slow query from within my consuming application the timings above were obtained by executing the statements directly from within sqliteman.
Is there a particular reason why this seemingly simple clause so dramatically increases execution time and is there some way I can restructure this query to improve it?
I have also tried the following with the same result. (I imagine the underlying query plan is the same)
SELECT
source.item_id,
source.time,
source.data
FROM source
WHERE NOT EXISTS (
SELECT 1 FROM target
WHERE target.item_id = source.item_id
AND target.time = source.time
);
Thanks very much!
Update
Terribly sorry, it turns out that these apparent results are actually due to a quirk with sqliteman.
It seems sqliteman arbitrarily applies a limit to the number of rows returned to 256, and will load more dynamically as you scroll through them. This will make a query over a large dataset appear much quicker then actually is, making it a poor choice for estimating query performance.
Nonetheless is their any obvious way to improve the performance of this query or am I simply hitting limits of what SQLite is capable of?

This is the query plan of your query (either one):
0|0|0|SCAN TABLE source
0|1|1|SEARCH TABLE target USING COVERING INDEX ti (item_id=? AND time=?)
This is pretty much as efficient as possible:
Every row in source must be checked, by
searching for a matching row in target.
It might be possible to make one little improvement.
The source rows are probably not ordered, so the target search will do a lookup at a random position in the index.
If we can force the source scan to be in index order, the target lookups will be in order too, which makes it more likely for these index pages to already be in the cache.
SQLite will use the source index if we do not use any columns not in the index, i.e., if we drop the data column:
> EXPLAIN QUERY PLAN
SELECT source.item_id, source.time
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL;
0|0|0|SCAN TABLE source USING COVERING INDEX si
0|1|1|SEARCH TABLE target USING COVERING INDEX ti (item_id=? AND time=?)
This might not help much.
But if it helps, and if you want the other columns in source, you can do this by doing the join first, and then looking up the source rows by their rowid (the extra lookup should not hurt if you have very few results):
SELECT *
FROM source
WHERE rowid IN (SELECT source.rowid
FROM source
LEFT JOIN target USING (item_id, time)
WHERE target.item_id IS NULL)

Related

Oracle compressed/b-tree index how and when to use

I would like to add a compressed index to the Oracle Applications workflow table hr.pqh_ss_transaction_history in order to access specific types of workflows (process_name) and workflows for specific people (selected_person_id).
There are lots of repeating values in process_name although the data is skewed. I would however want to access the TFG_HR_NEW_HIRE_PLACE_JSP_PRC and TFG_HR_TERMINATION_JSP_PRC process types.
"PROCESS_NAME","CNT"
"HR_GENERIC_APPROVAL_PRC",40347
"HR_PERSONAL_INFO_JSP_PRC",39284
"TFG_HR_NEW_HIRE_PLACE_JSP_PRC",18117
"TFG_HREMPSTS_TERMS_CHG_JSP_PRC",14076
"TFG_HR_TERMINATION_JSP_PRC",8764
"HR_ADV_INDIVIDUAL_COMP_PRC",4907
"TFG_HR_SIT_NOAPP",3979
"TFG_YE_TAX_PROV",2663
"HR_TERMINATION_JSP_PRC",1310
"HR_CHANGE_PAY_JSP_PRC",953
"TFG_HR_SIT_EXIT_JSP_PRC",797
"HR_SIT_JSP_PRC",630
"HR_QUALIFICATION_JSP_PRC",282
"HR_CAED_JSP_PRC",250
"TFG_HR_EMP_TERM_JSP_PRC",211
"PER_DOR_JSP_PRC",174
"HR_AWARD_JSP_PRC",101
"TFG_HR_SIT_REP_MOT",32
"TFG_HR_SIT_NEWPOS_NIB_JSP_PRC",30
"TFG_HR_SIT_NEWPOS_INBU_JSP_PRC",28
"HR_NEW_HIRE_PLACE_JSP_PRC",22
"HR_NEWHIRE_JSP_PRC",6
selected_person_id would obviously be more selective. Unfortunately there are 3774 nulls for this column and the highest count after that is 73 for one person. A lot of people would only have 1 row. The total row count is 136963.
My query would be in this format:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, :p_person_id) = :p_person_id
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I am on Oracle 12c release 1.
I assume it would be a good idea to put a non-compressed b-tree index on selected_person_id since the values returned would fall in the less than 5% of the total rows scenario, but how do you handle the nulls in the column which would not go into the index when you select using nvl(psth.selected_person_id, :p_person_id) = :p_person_id? Is there a more efficient way to write the sql and how should you create this index?
For process_name I would like to use a compressed b-tree index. I am assuming that the statement is
CREATE INDEX idxname ON pqh_ss_transaction_history(process_name) COMPRESS
where there would be an implicit second column for rowid. Is it safe for it to use rowid here, since normally it is not advised to use rowid? Is the skewed data an issue (most of the time I would be selecting on the high volume side)? I don't understand how compressed indexes would be efficient. For b-tree indexes you would normally want to return 5% of the data, otherwise a full table scan is actually more efficient. How does the compressed index return so many rowids and then do lookup into the full table using those rowids, faster than a full table scan?
Or since the optimizer will only be able to use one of the two indexes should I rather create an uncompressed function based index with selected_person_id and process_name concatenated?
Perhaps you could create this index:
CREATE INDEX idxname ON pqh_ss_transaction_history
(process_name, NVL(selected_person_id,-1)) COMPRESS 1
Then change your query to:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, -1) in (:p_person_id,-1)
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date

Adding Index To A Column Having Flag Values

I am a novice in tuning oracle queries thus need help.
If I have a sql query like:
select a.ID,a.name.....
from a,b,c
where a.id=b.id
and ....
and b.flag='Y';
then will adding index to the FLAG column of table b help to tune the query by any means? The FLAG column has only 2 values Y and N
With a standard btree index, the SQL engine can find the row or rows in the index for the specified value quickly due to its binary structure, then use the physical address (the rowid) stored in the index to access the desired row in a second hop. It's like looking in the index of a book to find the page number. So that is:
Go to index with the key value you want to look up.
The index tells you the physical address in the table.
Go straight to that physical address.
That is nice and quick for something like a unique customer ID. It's still OK for something nonunique, like a customer ID in a table of orders, although the database has to go through the index entries and for each one go to the indicated address. That can still be faster than slogging through the entire table from top to bottom.
But for a column with only two distinct values, you can see that it is going to be more work going through all of the index entries for 'Y' for example, and for each one going to the indicated location in the table, than it would be to just forget the index and scan the whole table in one shot.
That's unless the values are unevenly distributed. If there are a million Y rows and ten N rows then an index will help you find those N rows fast but be no use for Y.
Adding an index to a column with only 2 values normally isn't very useful, because Oracle might just as well do a full table scan.
From your query it looks like it would be more useful to have an index on id, because that would help with the join a.id=b.id.
If you really want to get into tuning then learn to use "explain plan", as that will give you some indication of how much work Oracle needs to do for a query. Add (or remove) an index, then rerun the explain plan.

Oracle database help optimizing LIKE searches

I am on Oracle 11g and we have these 3 core tables:
Customer - CUSTOMERID|DOB
CustomerName - CUSTOMERNAMEID|CustomerID|FNAME|LNAME
Address - ADDRESSID|CUSTOMERID|STREET|CITY|STATE|POSTALCODE
I have about 60 million rows on each of the tables and the data is a mix of US and Canadian population.
I have a front-end application that calls a web service and they do a last name and partial zip search. So my query basically has
where CUSTOMERNAME.LNAME = ? and ADDRESS.POSTALCODE LIKE '?%'
They typically provide the first 3 digits of the zip.
The address table has an index on all street/city/state/zip and another one on state and zip.
I did try adding an index exclusively for the zip and forced oracle to use that index on my query but that didn't make any difference.
For returning about 100 rows (I have pagination to only return 100 at a time) it takes about 30 seconds which isn't ideal. What can I do to make this better?
The problem is that the filters you are applying are not very selective and they apply to different tables. This is bad for an old-fashioned btree index. If the content is very static you could try bitmap indexes. More precisely a function based bitmap join index on the first three letter of the last name and a bitmap join index on the postal code column. This assumes that very few people with the whose last name starts with certain letters live in an are with a certain postal code.
CREATE BITMAP INDEX ix_customer_custname ON customer(SUBSTR(cn.lname,1,3))
FROM customer c, customername cn
WHERE c.customerid = cn.customerid;
CREATE BITMAP INDEX ix_customer_postalcode ON customer(SUBSTR(a.postalcode,1,3))
FROM customer c, address a
WHERE c.customerid = a.customerid;
If you are successful you should see the two bitmap indexes becoming AND connected. The execution time should drop to a couple of seconds. It will not be as fast as a btree index.
Remarks:
You may have to play around a bit whether it is more efficient to make one or two indexes and whether the function are helpful useful.
If you decide to do it function based you should include the exact same function calls in the where clause of your query. Otherwise the index will not be used.
DML operations will be considerably slower. This is only useful for tables with static data. Note that DML operations will block whole row "ranges". Concurrent DML operations will run into problems.
Response time will probably still be seconds not instanteously like a BTREE index.
AFAIK this will work only on the enterprise edition. The syntax is untested because I do not have an enterprise db available at the moment.
If this is still not fast enough you can create a materialized view with customerid, last name and postal code and but a btree index on it. But that is kind of expensive, too.

Speeding up a postgres query (which works on 2 tables)

I am doing, in postgresql, something like this:
select A.first,
count(B.second) as count,
array_agg(A.second) as second,
array_agg(A.third) as third,
array_agg(B.kids) as kids
from A join B on A.first=B.second
group by A.first;
And it's taking forever (also because the tables are pretty big). Limiting the output to 10 row and looking with explain analyze told me there's a nested loop which is huge and takes most of the time.
Is there any way in which I can write this query (which I'll then use in CREATE TABLE AS to create a new table) to speed it up, while conserving the same output, which is what I want?
Thanks!
Ensure the column bring used as a foreign key is indexed:
create index b_second on b(second);
Without such an index, every row of a would cause a table scan of b, which would make your query crawl.

Postgres optimize UPDATE

I have to do a bit complicated data import. I need to do a number of UPDATEs which currently updating over 3 million rows in one query. This query is applying about 30-45 sec each (some of them even 4-5 minutes). My question is, whether I can speed it up. Where can I read something about it, what kind of indexes and on which columns I can set to improve those updates. I don't need exacly answer, so I don't show the tables. I am looking for some stuff to learn about it.
Two things:
1) Post an EXPLAIN ANALYZE of your UPDATE query.
2) If your UPDATE does not need to be atomic, then you may want to consider breaking apart the number of rows affected by your UPDATE. To minimize the number of "lost rows" due to exceeding the Free Space Map, consider the following approach:
BEGIN
UPDATE ... LIMIT N; or some predicate that would limit the number of rows (e.g. WHERE username ilike 'a%';).
COMMIT
VACUUM table_being_updated
Repeat steps 1-4 until all rows are updated.
ANALYZE table_being_updated
I suspect you're updating every row in your table and don't need all rows to be visible with the new value at the end of a single transaction, therefore the above approach of breaking the UPDATE up in to smaller transactions will be a good approach.
And yes, an INDEX on the relevant columns specified in the UPDATE's predicate will help will dramatically help. Again, post an EXPLAIN ANALYZE if you need further assistance.
If by a number of UPDATEs you mean one UPDATE command to each updated row then the problem is that all the target table's indexes will be updated and all constraints will be checked at each updated row. If that is the case then try instead to update all rows with a single UPDATE:
update t
set a = t2.b
from t2
where t.id = t2.id
If the imported data is in a text file then insert it in a temp table first and update from there. See my answer here

Resources