make the optimizer use all columns of an index - oracle

we have a few tables storing temporal data that have natural a primary key consisting of 3 columns. Example: maximum temperature for this day. This is the Composite Primary key index (in this order):
id number(10): the id of the timeserie.
day date: the day for which this data was reported
kill_at timestamp: the last timestamp before this data was deleted or updated.
Simplified logic: When we make a forecast at 10:00am, then the last entry found for this id/day combination has his create_at changed to 9:59am and the newly calculated value is stored with a kill_at timestamp of '31.12.2999'.
typical queries on this table are:
1) where id=? and day=? and kill_at=?
2) where id=? and day between (? and ?) and kill_at=?
3) where id=? and day between (? and ?)
4) where id=?
There are plenty of timeseries that we do not forecast. That means we get one valued when it's measured and it never changes. But there are some timeseries that we forecast 200-300 times. So for one id/day combination there are 200+ entries with different values for kill_at.
We currently only have the primary key (id, day, kill_at) as the only (unique) index on this table. But when I query with query 2 (exact id and day range), then the optimizer decides to only use the first column of the index.
ID OPERATION OPTIONS OBJECT_NAME OPTIMIZER SEARCH_COLUMNS
0 SELECT STATEMENT ALL_ROWS 0
1 FILTER 0
2 TABLE ACCESS BY INDEX ROWID DPD 0
3 INDEX RANGE SCAN DPD_PK 1
This really hurts us for those timeseries that have been updates 200+ times.
Now I was looking for a way to force the optimizer to use all 3 columns of our index, but I can't find a hint for that. Is there one?
Or are there any other suggestions on how to speed up my query? We try to reduce the peak durations. The average Durations are of lesser concern.
what confuses me:
The above execution plan is what I see in dba_hist_sql_plan. It is the only execution plan for this statement. But when I let my client show the explain plan, then it is sometimes a 1 or a 3 for search_columns. But it never is 3 for when our application runs this Statement.

we actually found the cause of this problem. We're using JPA/JDBC and the JDBC date types weren't modeled correctly. While the oracle date type is with second precision, somebody (I now hate him) made the "day" attribute in our entity of type java.sql.Timestamp (although it is only day without time).
The effect is that Oracle will need to cast (use a function on) each entry in the table to make it a Timestamp before it can compare with the Timestamp query parameter. That way the index cannot be used properly.

Related

Oracle: updating data in referenced partition scenario is taking longer time

I have table partitioned on a column(rcrd_expry_ts) of date type. We are updating this rcrd_expry_ts weekly by another job. We noticed the update query is taking quite longer time (1 to 1.5 min) even for few rows and I think longer time is taken for actually moving data internally to different partitioned. There can be a million of rows eligible to update rcrd_expry_ts by our weekly job.
CREATE TABLE tbl_parent
(
"parentId" NUMBER NOT NULL ENABLE,
"RCRD_DLT_TSTP" timestamp default timestamp '9999-01-01 00:00:00' NOT NULL
)
PARTITION BY RANGE ("RCRD_DLT_TSTP") INTERVAL (NUMTOYMINTERVAL('1','MONTH')) (PARTITION "P1" VALUES LESS THAN (TO_DATE('2010-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS')));
CREATE TABLE tbl_child
(
"foreign_id" NUMBER NOT NULL ENABLE,
"id" NUMBER NOT NULL ENABLE,
constraint fk_id foreign key("foreign_id") references
tbl_parent("parentId")
)partition by reference (fk_id);
I am updating RCRD_DLT_TSTP in parent table from some another job (using simple update query) but I noticed that it took around 1 to 1.5 min to execute, probably due to creating partition and move data into corresponding partition. Is there any better way to achieve this in Oracle
The table has a referenced partitioned child. So any rows moving partition in the parent will have to be cascaded to the child table too.
This means you could be moving substantially more rows that the "few rows" that change in the parent.
It's also worth checking if the update can identify the rows it needs to change faster too.
You can do this by getting the plan for the update statement like this:
update /*+ gather_plan_statistics */ <your update statement>;
select *
from table(dbms_xplan.display_cursor( format => 'ALLSTATS LAST' ));
This will give you the plan for the update with its run time stats. This will help in identifying if there are any indexes you can create to improve performance.
Is there any better way to achieve this in Oracle
This is a question that needs to be answered in the larger context. You may well be able to make this process faster by unpartitioning the table and using indexes to identify the rows to change.
But this affects all the other statements that access this table. To what extent do they benefit from partitioning? If the answer is substantially, is it worth making this process faster at the expense of these others? What trade-offs are you willing to make here?

When filtering a single table on only the primary key, why is the optimizer is doing a full table scan?

I have a monster sized non-partitioned table. I recently updated the statistics as on it as well.
The primary key is on a char field called "ID".
SELECT * FROM "MYDATA" WHERE "ID" = '0000492319'
The plan is saying TABLE ACCESS (FULL) and has a filter predicate for the the ID. This results in a query that takes 8 seconds to run.
If I give the optimizer a hint to use the primary key, the query takes 1.6 seconds to run.
Its bizarre to me that I should need to provide this hint. The indexed plan estimates a lower cost, and the optimizer should be aware of this.
Here is the filter predicate:
NLSSORT(INTERNAL_FUNCTION(ID),'nls_sort="JAPANESE_M"')=HEXTORAW('017...')
The database NLS_SORT is set to JAPANESE_M and the NLS_CHARAACTERSET is JA16SJIS.
So nothing seems to be mismatched that would cause a special sort function to be called. Its a bit odd though.
One more piece of information, if I select only the "ID" column in my query then the planer chooses INDEX (FAST FULL SCAN) automatically.
The problem only arises when I use select *.
Oracle Database Verion: 10.2.0.5.0.

Recommended way to index a date field in postgres?

I have a few tables with about 17M rows that all have a date column I would like to be able to utilize frequently for searches. I am considering either just throwing an index on the column and see how things go or sorting the items by date as a one time operation and then inserting everything into a new table so that the primary key ascends as the date ascends.
Since these are both pretty time consuming I thought it might be worth it to ask here first for input.
The end goal is for me to load sql queries into pandas for some analysis if that is relevant here.
The index on a date column makes sense when you are going to search the table for a given date(s), e.g.:
select * from test
where the_date = '2016-01-01';
-- or
select * from test
where the_date between '2016-01-01' and '2016-01-31';
-- etc
In these queries there is no matter whether the sort order of primary key and the date column are the same or not. Hence rewriting the data to the new table will be useless. Just create an index.
However, if you are going to use the index only in ORDER BY:
select * from test
order by the_date;
then a primary key integer index may be significantly (2-4 times) faster then an index on a date column.
Postgres supports to some extend clustered indexes, which is what you suggest by removing and reinserting the data.
In fact, removing and reinserting the data in the order you want will not change the time the query takes. Postgres does not know the order of the data.
If you know that the table's data does not change. Then cluster the data based on the index you create.
This operation reorders the table based on the order in the index. It is very effective until you update the table. The syntax is:
CLUSTER tableName USING IndexName;
See the manual for details.
I also recommend you use
explain <query>;
to compare two queries, before and after an index. Or before and after clustering.

Best way to identify a handful of records expected to have a flag set to TRUE

I have a table that I expect to get 7 million records a month on a pretty wide table. A small portion of these records are expected to be flagged as "problem" records.
What is the best way to implement the table to locate these records in an efficient way?
I'm new to Oracle, but is a materialized view an valid option? Are there such things in Oracle such as indexed views or is this potentially really the same thing?
Most of the reporting is by month, so partitioning by month seems like an option, but a "problem" record may be lingering for several months theorectically. Otherwise, the reporting shuold be mostly for the current month. Would you expect that querying across all month partitions to locate any problem record would cause significant performance issues compared to usinga single table?
Your general thoughts of where to start would be appreciated. I realize I need to read up and I'll do that but I wanted to get the community thought first to make sure I read the right stuff.
One more thought: The primary key is a GUID varchar2(36). In order of magnitude, how much of a performance hit would you expect this to be relative to using a NUMBER data type PK? This worries me but it is out of my control.
It depends what you mean by "flagged", but it sounds to me like you would benefit from a simple index, function based index, or an indexed virtual column.
In all cases you should be careful to ensure that all the index columns are NULL for rows that do not need to be flagged. This way your index will contain only the rows that are flagged (Oracle does not - by default - index rows in B-Tree indexes where all index column values are NULL).
Your primary key being a VARCHAR2 GUID should make no difference, at least with regards to the specific flagging of rows in this question, indexes will point to rows via Oracle internal ROWIDs.
Indexes support partitioning, so if your data is already partitioned, your index could be set to match.
Simple column index method
If you can dictate how the flagging works, or the column already exists, then I would simply add an index to it like so:
CREATE INDEX my_table_problems_idx ON my_table (problem_flag)
/
Function-based index method
If the data model is fixed / there is no flag column, then you can create a function-based index assuming that you have all the information you need in the target table. For example:
CREATE INDEX my_table_problems_fnidx ON my_table (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
Now if you use the same logic in your SELECT statement, you should find that it uses the index to efficiently match rows.
SELECT *
FROM my_table
WHERE CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END IS NOT NULL
/
This is a bit clunky though, and it requires you to use the same logic in queries as the index definition. Not great. You could use a view to mask this, but you're still duplicating logic in at least two places.
Indexed virtual column
In my opinion, this is the best way to do it if you are computing the value dynamically (available from 11g onwards):
ALTER TABLE my_table
ADD virtual_problem_flag VARCHAR2(1) AS (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
CREATE INDEX my_table_problems_idx ON my_table (virtual_problem_flag)
/
Now you can just query the virtual column as if it were a real column, i.e.
SELECT *
FROM my_table
WHERE virtual_problem_flag = 'Y'
/
This will use the index and puts the function-based logic into a single place.
Create a new table with just the pks of the problem rows.

How can I speed up a diff between tables?

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB...
My current query is:
SELECT * FROM tableA EXCEPT SELECT * FROM tableB;
and
SELECT * FROM tableB EXCEPT SELECT * FROM tableA;
When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.
I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.
Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours.
Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)
Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.
What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?
This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries
EDIT:
Here is the schema for the two tables, they are identical except the table name.
CREATE TABLE bulk.blue
(
"partA" text NOT NULL,
"type" text NOT NULL,
"partB" text NOT NULL
)
WITH (
OIDS=FALSE
);
In the statements above you are not using the indexes.
You could do something like:
SELECT * FROM tableA a
FULL OUTER JOIN tableB b ON a.someID = b.someID
You could then use the same statement to show which tables had missing values
SELECT * FROM tableA a
FULL OUTER JOIN tableB b ON a.someID = b.someID
WHERE ISNULL(a.someID) OR ISNULL(b.someID)
This should give you the rows that were missing in table A OR table B
Confirm you indexes are being used (they are likely not in such a generic except statement), but you are not joining against a specified column(s) so likely that lack of explicit join will not make for an optimized query:
http://www.postgresql.org/docs/9.0/static/indexes-examine.html
This will help you view the explain analyze more clearly:
http://explain.depesz.com
Also, make sure you do an analyze on the table after you create the index if you want it to perform well right away:}
The queries as specified require a comparison of every column of the tables.
For example if tableA and tableB each have five columns then the query is having to compare tableA.col1 to tableB.col1, tableA.col2 to tableB.col2, . . . tableA.col5 to tableB.col5
If there are just few columns that uniquely identify a record instead of all the columnS in the table then joining the tables on the specific columns that uniquely identify a record will improve your performance.
The above statement assumes that a primary key has not been created. If a primary key has been defined to indicated which columns uniquely identify a record then I believe the EXCEPT statement would take that into consideration.
What kind of index did you apply? Indexes are only useful to improve WHERE conditions. If you're doing a select *, you're grabbing all the fields and the index is probably not doing anything, but taking up space, and adding a little more processing behind the scenes for the db-engine to compare the query to the index cache.
Instead of SELECT *, you can try selecting your unique fields and create an index for those unique fields
You can also use an OUTER JOIN to show results from both tables that did not match on the unique fields
You may want to consider is clustering your tables
What version of Postgres are you running?
When was the last time you vacuumed?
Other than the above, 13GB is pretty large, so you'll want to check your config settings. It shouldn't take hours to run that, unless you don't have enough memory on your system.

Resources