Why does the same select statement have different costs in Oracle? - oracle

Recently I used Oracle 11g database to do my homework. I had 12 tables, like trip_data_11 and trip_data_12.
They have same structure and the number of records is almost the same. I created the same indexes on each table.
So for trip_data_11 table:
create index pick_add_11 on trip_data_11(pickup_longitude,pickup_latitude);
create index drop_add_11 on trip_data_11(dropoff_longitude,dropoff_latitude);
The same operation to trip_data_12.
Then I used the following select statement to select the taxi numbers per day.
SELECT
COUNT(DISTINCT(td.medallion)) AS taxi_num
FROM
SYS.TRIP_DATA_11 td
WHERE
(td.pickup_longitude >= -74.2593 AND td.pickup_longitude <= -73.7011
AND td.pickup_latitude >= 40.4770 AND td.pickup_latitude <= 40.9171
)
AND
(td.dropoff_longitude >= -74.2593 AND td.dropoff_longitude <= -73.7011
AND td.dropoff_latitude >= 40.4770 AND td.dropoff_latitude <= 40.9171
)
AND
td.trip_distance > 0
AND
td.passenger_count > 0
GROUP BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}')
ORDER BY
regexp_substr(td.pickup_datetime,'\d{4}-\d{2}-\d{2}');
It costs 38sec。When I changed the table name to SYS.TRIP_DATA_12, the problem coming, it costs more than 2 hours.
What's more, it did not end. I don't know why.
Today I ask my classmate and he said: clear the cache. So I used the following statements to do it.
alter system flush shared_pool;
alter system flush buffer_cache;
alter system flush global context;
Now when I use the same select statement for SYS.TRIP_DATA_11 I get the same poor performance like SYS.TRIP_DATA_12. Why?

It seems like your classmate was having a good joke at your expense.
Clearly your query was only performing well because you had a warm buffer cache full of all the data you needed from TRIP_DATA_11. By flushing the caches you have zapped all that, and now you have the same bad performance for all tables.
Tuning queries is hard, because there are lots of possibilities. Please read the documentation on it.
To pick just one thing: you're searching ranges, which is problematic. How many rows fill -74.2593 to -73.7011 ? It might be a lot more than say -71.00 to -68.59 even though that's a broader range. Understanding your data - its volume, its distribution and its skew - is crucial.
As a first step learn how to use EXPLAIN PLAN. Find out more. To get better plans, gather statistics on your tables and their indexes, using DBMS_STATS package. Find out more.
One tip. Oracle only uses one index to access a table. So it will choose pick_add_11 or drop_add_11 but not both. It will then read all the matching records from the table and filter them by the other criteria. You may get much better performance from a index designed to service this query:
create index add_11 on trip_data_11
(pickup_longitude
, pickup_latitude
, dropoff_longitude
, dropoff_latitude
, trip_distance
, passenger_count )
;
The select statement will execute the entire filter against this index and only touch the table to get the MEDALLION values. (You could add medallion to the index too). Experiment with the column order. As latitude has a narrower range than longitude probably that should go first; maybe drop-off value should appear before pick-up. You want an index in which the greatest number of related records are clustered together.
Indexes like this can be an overhead, so we wouldn't want to maintain too many of them in real life. But they are a valuable technique for tuning expensive queries which are run frequently.
Oh, and #Justin's right: don't use SYS for doing application work. Even for a school assignment you should create a fresh schema and create your tables, etc in that.

Related

num_buckets (dba_tab_col_statistics)

I noticed when looking at dba_tab_col_statistics for my table (m_CURRENT) that the num_buckets value for column TNE is 75 on one database and 254 on another. DB is Oracle 10g.
This seems to be the main difference between the two tables. Is there a way to get both databases to match num_bucket values?
I have a delete statement that is fast on one database and very slow on another. I know there are several reasons a query plan can differ on two databases. After a lot of analysis, I suspect that getting the slow query database to have the same num_bucket setting could ensure my delete statement does a range_scan (fast) and not a fast_full_scan (slow in this case) on index TNE_idx.
how do you gather stats on both databases, do you have a regular stats gather script? as you can as a one off do this to gather histograms on that one column alone:
begin
dbms_stats.gather_table_stats(user,'M_CURRENT',
method_opt=>'for columns TNE size 254', cascade=>false,
granularity=>'ALL', degree=>8);
end;
/
the size parameter will set the buckets (if there's less than that number of distinct values, the result will be less).
The above doesn't specify estimate_pct so will sample a smaller population of values rather than 100%. if you want 100 percent then specify this in the estimate_pct parameter*), but if you have a regular script, this may get overwritten later on.
* you can check the current sample size by comparing sample_size on the dba_tab_col_statistics to the num_rows on dba_tables
I want to follow up, DazzaL's answer was very informative, but I was wrong, the buckets did not make a difference. I could have saved a lot of time by running
explain plan for delete blah blah blah
then
select * from table(dbms_xplan.display)
and focusing on the right step id.
I zeroed in on step id 5 where a index fast full scan was being used when I should have started on step 2 where the two database plans were different. I saw the fast database doing nested loop anti join and the slow database doing a merge join anti. Once I added the (undocumented?) hint
NL_AJ
with no parameters in the subquery, the desired fast index_range_scan was used.
As I write this I wonder if the bucket number matching was important, but I don't have the time to retest. It ain't broke, so I will not touch it!

ETL for processing history records

I am in sort of a DWH project (not quite, but still). And there is this issue we constantly run into which I was wondering if there would be a better solution. Follows
We receive some big files with records containing the all states a user have been into, like:
UID | State | Date
1 | Active | 20120518
2 | Inactive | 20120517
1 | Inactive | 20120517
...
And we are usually inly interested in the latest state of each user. So far so good, with just a little sorting and we could get the way we want it. Only problem is, these files are usually big.. like 20-60gb, sorting these guys sometimes is a pain since the logic for sorting isn't usually so straight forward.
What we do generally is load everything into our Oracle and use intermediary tables and materialized views to have it done. Still, sometimes performance bites us.
20-60gb might be big, but not that big. I mean, should be a somewhat more specialised way to deal with these records, shouldn't it?
I imagine two basic ways of seeing tackling the issue:
1) Programming outside the DBMS, scripts and compiled things. But maybe this is not very flexible unless some bigger amount of time is invested developing something. Also, I might have to busy myself administrating the box resources, whereas I wish not to worry with that.
2) Load everything into the DBMS (Oracle in my case) and use whatever tools it provide to sort and clip the data. This would be my case, though, I am not sure we are using all the tools or simply doing it the right way that would be for Oracle 10g.
Question is then:
You have a 60gb file with millions of historical records like the one above and your user want a table in DB with the last state for each user.
how would you guys do?
thanks!
There are two things you can do to speed up the process.
The first thing is to throw compute power at it. If you have Enterprise Edition and lots of cores you will get significant reductions in load time with parallel query.
The other thing is to avoid loading the records you don't want. This is why you mention pre-processing the file. I'm not sure there's much you can do there, unless you have access to a Hadoop cluster to run some map-reduce jobs on your file (well, reduce mainly, the structure you post is about as mapped as can be already).
But there is an alternative: external tables. External tables are tables which have their data in OS files rather then tablespaces. And they can be parallel enabled (providing your file meet certain criteria). Find out more.
So, you might have an external table like this
CREATE TABLE user_status_external (
uid NUMBER(6),
status VARCHAR2(10),
sdate DATE
ORGANIZATION EXTERNAL
(TYPE oracle_loader
DEFAULT DIRECTORY data_dir
ACCESS PARAMETERS
(
RECORDS DELIMITED BY newline
BADFILE 'usrsts.bad'
DISCARDFILE 'usrsts.dis'
LOGFILE 'usrsts.log'
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
(
uid INTEGER EXTERNAL(6),
status CHAR(10),
sdate date 'yyyymmdd' )
)
LOCATION ('usrsts.dmp')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Note that you need read and write permissions on the DATA_DIR directory object.
Having created the external table you can load the only desired data into your target table with an insert statement:
insert into user_status (uid, status, last_status_date)
select sq.uid
, sq.status
, sq.sdate
from (
select /*+ parallel (et,4) */
et.uid
, et.status
, et.sdate
, row_number() over (partition by et.uid order by et.sdate desc) rn
from user_status_external et
) sq
where sq.rn = 1
Note that as with all performance advice, there are no guarantees. You need to benchmark things in your environment.
Another thing is the use of INSERT: I'm assuming these are all fresh USERIDs, as that is the scenario your post suggests. If you have a more complicated scenario then you probably want to look at MERGE or a different approach altogether.
One last thing: you seem to be assuming this is a common situation, which has some standard approaches. But most data warehouses load all the data they get. They may then filter it for various different uses, data marts, etc. But they almost always maintain a history in the actual warehouse of all the distinct records. So that's why you might not get an industry standard solution.
I'd go with something along the lines of what APC said as a first go. However, I think parallel tables can only load data in parallel if the data is in multiple files, so you might have to cut the files into several. How are the files generated? A 20 - 60GB file is a real pain to deal with - can you get the generation of the files changed so you get X 2GB files for example?
After getting all the records into the database, you might run into problems attempting to sort 60GB of data - it would be worth having a look at the sort stage of the query you are using to extract the latest status. In the past I helped large sorts by hash partitioning the data on one of the fields to be sorted, in this case user_id. Then Oracle only has to do X smaller sorts, each of which can proceed in parallel.
So, my thoughts would be:
Try and get many smaller files generated instead of 1 big one
Using External tables, see if it is feasible to extract the data you want directly from the external tables
If not, load the entire contents of the files into a hash partition table - at this stage make sure you do insert /*+ append nologging */ to avoid undo generation and redo generation. If your database has force_logging set to true, the nologging hint will have no effect.
Run the select on the staged data to extract only the rows you care about and then trash the staged data.
The nologging option is probably critical to you getting good performance - to load 60GB of data, you are going to generate at least 60GB of redo logs, so if that can be avoided, all the better. You would probably need to have a chat with your DBA about that!
Assuming you have lots of CPU available, it may also make sense to compress the data as you bulk load it into the staging table. Compression may well half the size of your data on disk if it has repeating fields - the disk IO saved when writing it usually more than beats any extra CPU consumed when loading it.
I may be oversimplifying the problem, but why not something like:
create materialized view my_view
tablespace my_tablespace
nologging
build immediate
refresh complete on demand
with primary key
as
select uid,state,date from
(
select /*+ parallel (t,4) */ uid, state, date, row_number() over (partition by uid order by date desc) rnum
from my_table t;
)
where rnum = 1;
Then refresh fully when you need to.
Edit: Any don't forget to rebuild stats and probably throw a unique index on uid.
I would write a program to iterate over each record and retain only those which are more recent than record previously seen. At the end, insert the data into the database.
How practical that is would depend on how many users we're talking about - you could end up having to think carefully about your intermediate storage.
In general, this becomes (in pseudo-code):
foreach row in file
if savedrow is null
save row
else
if row is more desirable than savedrow
save row
end
end
end
send saved rows to database
The point it, you need to define how one row is considered to be more desirable than another. In the simple case, for a given user, the current row's date is later than the last row we saved. At the end, you'd have a list of rows, one-per-user, each of which has the most recent date you saw.
You could general the script or program so that the framework is separate from the code that understands each data file.
It'll still take a while, mind :-)

(TSQL) INSERT doubling time of the query

I have a quite complex multi-join TSQL SELECT query that runs for about 8 seconds and returns about 300K records. Which is currently acceptable. But I need to reuse results of that query several times later, so I am inserting results of the query into a temp table. Table is created in advance with columns that match output of SELECT query. But as soon as I do INSERT INTO ... SELECT - execution time more than doubles to over 20 seconds! Execution plans shows that 46% of the query cost goes to "Table Insert" and 38% to Table Spool (Eager Spool).
Any idea why this is happening and how to speed it up?
Thanks!
The "Why" of it hard to say, we'd need a lot more information. (though my SWAG would be that it has to do with logging...)
However, the solution, 9 times out of 10 is to use SELECT INTO to make your temp table.
I would start by looking at standard tuning itmes. Is disk performing? Are there sufficient resources (IOs, RAM, CPU, etc)? Is there a bottleneck in the RDBMS? Does sound like the issue but what is happening with locking? Does other code give similar results? Is other code performant?
A few things I can suggest based on the information you have provided. If you don't care about dirty reads, you could always change the transaction isolation level (if you're using MS T-SQL)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
select ...
This may speed things up on your initial query as locks will not need to be done on the data you are querying from. If you're not using SQL server, do a google search for how to do the same thing with the technology you are using.
For the insert portion, you said you are inserting into a temp table. Does your database support adding primary keys or indexes on your temp table? If it does, have a dummy column in there that is an indexed column. Also, have you tried to use a regular database table with this? Depending on your set up, it is possible that using that will speed up your insert times.

Oracle Transaction - Count table

I have a table where I need to constrain by category and then find all overlapping dates against some date range. This takes about 2 seconds, which is unacceptable to do on every transaction which occurs at roughly 50/s. The alternative is to create some tally table -- then again, I don't know how great of an idea this is because things can get out of sync.
Date on Rent # Rented Category
9/5/2011 5 CATEGORY1
In Oracle (PL/SQL, if it matters), how can I maintain this performance, but ensure that concurrent transactions don't screw up the increment / decrement by making it one less or one more than it really is?
I have two types of transactions, kind of like a search and a rent. Only rents will be updating this tally table (and searches just reading from it). I don't mind if rents slow down, but do not want search performance impacted. Rents can occur as frequently as 5-10 / second.
Oracle uses locks to ensure data consistency during a query. The details of how they work can get complex, but the effect is that it guarantees that an update/insert to your tally table will only use the data in the main table as it was when the query began. If there is another update or insert to your main table while you are doing an update/insert to the tally table, it won't affect it.
You'll have to experiment with your data to see if keeping a summary/tally table helps or hurts you. It really depends on how quickly the main table is getting updated, how much time you spend updating your tally table vs how much time you save by being able to select on it, and how up-to-date you need selects to be.

Does the order of columns on a covered index in Sybase affect select performance?

We have a large table, with several indices (say, I1-I5).
The usage pattern is as follows:
Application A: all select queries 100% use indices I1-I4 (assume that they are designed well enough that they will never use I5).
Application B: has only one select query (fairly frequently run), which contains 6 fields and for which a fifth index I5 was created as a covered index.
The first 2 fields of the covered index are date, and a security ID.
The table contains rows for ~100 dates (in date order, enforced by a clustered index I1), and tens of thousands of security identifiers.
Question: dies the order of columns in the covered index affect the performance of the select query in Application B?
I.e., would the query performance change if we switched around the first two fields of the index (date and security ID)?
Would the query performance change if we switch around one of the last fields?
I am assuming that the logical IOs would remain un-affected by any order of fields in the covered index (though I'm not 100% sure).
But will there be other performance effects? (Optimizer speed, caching, etc...)
The question is version-generic, but if it matters, we use Sybase 12.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
It depends. If you have a WHERE clause such as the following, you will get better performance out an index on (security_ID, date_column) than the converse:
WHERE date_column BETWEEN DATE '2009-01-01' AND DATE '2009-08-31'
AND security_ID = 373239
If you have a WHERE clause such as the following, you will get better performance out of an index on (date_column, security_ID) than the converse:
WHERE date_column = DATE '2009-09-01'
AND security_ID > 499231
If you have a WHERE clause such as the following, it really won't matter very much which column appears first:
WHERE date_column = DATE '2009-09-13'
AND security_ID = 211930
We'd need to know about the selectivity and conditions on the other columns in the index to know if there are other ways of organizing your index to gain more performance.
Just like your question is version generic, my answer is DBMS-generic.
Unfortunately, the table is so huge that actually changing the index in practice and quantitatively confirming the effects of the change is extremely difficult.
The problem is not the size of the table. Millions of rows is nothing for Sybase.
The problem is an absence of a test system.

Resources