Oracle compressed/b-tree index how and when to use - oracle

I would like to add a compressed index to the Oracle Applications workflow table hr.pqh_ss_transaction_history in order to access specific types of workflows (process_name) and workflows for specific people (selected_person_id).
There are lots of repeating values in process_name although the data is skewed. I would however want to access the TFG_HR_NEW_HIRE_PLACE_JSP_PRC and TFG_HR_TERMINATION_JSP_PRC process types.
"PROCESS_NAME","CNT"
"HR_GENERIC_APPROVAL_PRC",40347
"HR_PERSONAL_INFO_JSP_PRC",39284
"TFG_HR_NEW_HIRE_PLACE_JSP_PRC",18117
"TFG_HREMPSTS_TERMS_CHG_JSP_PRC",14076
"TFG_HR_TERMINATION_JSP_PRC",8764
"HR_ADV_INDIVIDUAL_COMP_PRC",4907
"TFG_HR_SIT_NOAPP",3979
"TFG_YE_TAX_PROV",2663
"HR_TERMINATION_JSP_PRC",1310
"HR_CHANGE_PAY_JSP_PRC",953
"TFG_HR_SIT_EXIT_JSP_PRC",797
"HR_SIT_JSP_PRC",630
"HR_QUALIFICATION_JSP_PRC",282
"HR_CAED_JSP_PRC",250
"TFG_HR_EMP_TERM_JSP_PRC",211
"PER_DOR_JSP_PRC",174
"HR_AWARD_JSP_PRC",101
"TFG_HR_SIT_REP_MOT",32
"TFG_HR_SIT_NEWPOS_NIB_JSP_PRC",30
"TFG_HR_SIT_NEWPOS_INBU_JSP_PRC",28
"HR_NEW_HIRE_PLACE_JSP_PRC",22
"HR_NEWHIRE_JSP_PRC",6
selected_person_id would obviously be more selective. Unfortunately there are 3774 nulls for this column and the highest count after that is 73 for one person. A lot of people would only have 1 row. The total row count is 136963.
My query would be in this format:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, :p_person_id) = :p_person_id
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I am on Oracle 12c release 1.
I assume it would be a good idea to put a non-compressed b-tree index on selected_person_id since the values returned would fall in the less than 5% of the total rows scenario, but how do you handle the nulls in the column which would not go into the index when you select using nvl(psth.selected_person_id, :p_person_id) = :p_person_id? Is there a more efficient way to write the sql and how should you create this index?
For process_name I would like to use a compressed b-tree index. I am assuming that the statement is
CREATE INDEX idxname ON pqh_ss_transaction_history(process_name) COMPRESS
where there would be an implicit second column for rowid. Is it safe for it to use rowid here, since normally it is not advised to use rowid? Is the skewed data an issue (most of the time I would be selecting on the high volume side)? I don't understand how compressed indexes would be efficient. For b-tree indexes you would normally want to return 5% of the data, otherwise a full table scan is actually more efficient. How does the compressed index return so many rowids and then do lookup into the full table using those rowids, faster than a full table scan?
Or since the optimizer will only be able to use one of the two indexes should I rather create an uncompressed function based index with selected_person_id and process_name concatenated?

Perhaps you could create this index:
CREATE INDEX idxname ON pqh_ss_transaction_history
(process_name, NVL(selected_person_id,-1)) COMPRESS 1
Then change your query to:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, -1) in (:p_person_id,-1)
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date

Related

Faster select when filtering with second or third sorted column

We have a time series table with the following definition
CREATE TABLE timeseries.mytable
(
`ts` DateTime('UTC'),
`src_ip` String,
`dst_ip` String,
`col_other` String
)
ENGINE = MergeTree()
PARTITION BY toDate(tr)
ORDER BY (dst_ip,ts,src_ip)
SETTINGS index_granularity = 8192
SELECT count(*) FROM timeseries.mytable;
# Elapsed 0.004 sec. Has 383M records
SELECT count(*) FROM timeseries.timeseries WHERE dst_ip = 'a.b.c.d';
# Elapsed: 0.085 sec.
SELECT count(*) FROM timeseries.timeseries WHERE src_ip = 'a.b.c.d';
# Elapsed: 53.031 sec
As can be seen above, filtering the data using the first sorted column (dst_ip) is very quick.
How can I make the select using the third sorted column (src_ip) faster?
Some remarks:
the third query (WHERE src_ip = 'a.b.c.d') works slowly because of index is not used and CH uses full scan. No good way to make it faster besides as redesign the primary key or if this query calculates just aggregates use the additional AggregatingMergeTree-table
use-cases which you provided looks as artificial because the calculation of row count by all dataset is not key use-case for timeseries data. Why the result not restricted by dst_ip and ts?
consider using ClickHouse AggregatingMergeTree Approach when need to calculate aggregated-values (as count in your case)
design of primary key required the understanding as CH use it in query optimization (see Primary Keys and Indexes in Queries, More secrets of ClickHouse Query Performance)
it recommends using the monotonic index
to choose the best index need to make the series of tests to find the index fittest for concrete use-cases
I would suggest the next primary keys:
/* [pretty suspicious suggestion] Remove date-column (it makes much slower the all date range queries with a range less than Daily). */
ORDER BY (dst_ip, src_ip)
/* Define the granularity of date. Instead of toStartOfHour can be used any interval less than 'Daily' (where Daily is defined by partition key) */
ORDER BY (dst_ip, toStartOfHour(ts), src_ip)
/* Move the date to the first position (it makes faster queries with date range without dst_ip and get monotonic-index related advantages). */
ORDER BY (toStartOfHour(ts), dst_ip, src_ip)
For each primary key need to choose the more effective Index granularity-value.
As for 2022, the solution is to use Data Skipping Index https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-data_skipping-indexes for src_ip
You should try testing by keeping different order in ORDER BY clause depending upon value cardinality of your columns. In this case, maybe trying bringing src_ip before ts in ORDER by class.
In MergeTree engine, rows are sorted on the basis of ORDER BY keys in each partition.
After that, you can decide the final arrangement of columns in ORDER by clause depending upon how your application will query data most of the items.
You can find a similar discussion here.

Adding Index To A Column Having Flag Values

I am a novice in tuning oracle queries thus need help.
If I have a sql query like:
select a.ID,a.name.....
from a,b,c
where a.id=b.id
and ....
and b.flag='Y';
then will adding index to the FLAG column of table b help to tune the query by any means? The FLAG column has only 2 values Y and N
With a standard btree index, the SQL engine can find the row or rows in the index for the specified value quickly due to its binary structure, then use the physical address (the rowid) stored in the index to access the desired row in a second hop. It's like looking in the index of a book to find the page number. So that is:
Go to index with the key value you want to look up.
The index tells you the physical address in the table.
Go straight to that physical address.
That is nice and quick for something like a unique customer ID. It's still OK for something nonunique, like a customer ID in a table of orders, although the database has to go through the index entries and for each one go to the indicated address. That can still be faster than slogging through the entire table from top to bottom.
But for a column with only two distinct values, you can see that it is going to be more work going through all of the index entries for 'Y' for example, and for each one going to the indicated location in the table, than it would be to just forget the index and scan the whole table in one shot.
That's unless the values are unevenly distributed. If there are a million Y rows and ten N rows then an index will help you find those N rows fast but be no use for Y.
Adding an index to a column with only 2 values normally isn't very useful, because Oracle might just as well do a full table scan.
From your query it looks like it would be more useful to have an index on id, because that would help with the join a.id=b.id.
If you really want to get into tuning then learn to use "explain plan", as that will give you some indication of how much work Oracle needs to do for a query. Add (or remove) an index, then rerun the explain plan.

Oracle database help optimizing LIKE searches

I am on Oracle 11g and we have these 3 core tables:
Customer - CUSTOMERID|DOB
CustomerName - CUSTOMERNAMEID|CustomerID|FNAME|LNAME
Address - ADDRESSID|CUSTOMERID|STREET|CITY|STATE|POSTALCODE
I have about 60 million rows on each of the tables and the data is a mix of US and Canadian population.
I have a front-end application that calls a web service and they do a last name and partial zip search. So my query basically has
where CUSTOMERNAME.LNAME = ? and ADDRESS.POSTALCODE LIKE '?%'
They typically provide the first 3 digits of the zip.
The address table has an index on all street/city/state/zip and another one on state and zip.
I did try adding an index exclusively for the zip and forced oracle to use that index on my query but that didn't make any difference.
For returning about 100 rows (I have pagination to only return 100 at a time) it takes about 30 seconds which isn't ideal. What can I do to make this better?
The problem is that the filters you are applying are not very selective and they apply to different tables. This is bad for an old-fashioned btree index. If the content is very static you could try bitmap indexes. More precisely a function based bitmap join index on the first three letter of the last name and a bitmap join index on the postal code column. This assumes that very few people with the whose last name starts with certain letters live in an are with a certain postal code.
CREATE BITMAP INDEX ix_customer_custname ON customer(SUBSTR(cn.lname,1,3))
FROM customer c, customername cn
WHERE c.customerid = cn.customerid;
CREATE BITMAP INDEX ix_customer_postalcode ON customer(SUBSTR(a.postalcode,1,3))
FROM customer c, address a
WHERE c.customerid = a.customerid;
If you are successful you should see the two bitmap indexes becoming AND connected. The execution time should drop to a couple of seconds. It will not be as fast as a btree index.
Remarks:
You may have to play around a bit whether it is more efficient to make one or two indexes and whether the function are helpful useful.
If you decide to do it function based you should include the exact same function calls in the where clause of your query. Otherwise the index will not be used.
DML operations will be considerably slower. This is only useful for tables with static data. Note that DML operations will block whole row "ranges". Concurrent DML operations will run into problems.
Response time will probably still be seconds not instanteously like a BTREE index.
AFAIK this will work only on the enterprise edition. The syntax is untested because I do not have an enterprise db available at the moment.
If this is still not fast enough you can create a materialized view with customerid, last name and postal code and but a btree index on it. But that is kind of expensive, too.

Scan on DynamDB table or Query on secondary global index or a local index (What's the best solution)

I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference

Oracle optimizer is not accepting index hint

When I run the merge query then index cannot read and query is running very slow please advise me.
Index in stage_dim_accounts(rbc_code)
Index in map_rbc_etl(free_code_9)
MERGE INTO stage_dim_accounts t
USING map_rbc_etl s ON (t.rbc_code = s.free_code_9)
WHEN MATCHED THEN UPDATE
SET t.indx_no= s.indx_no
WHERE s.annexure= 'AXN-I'
AND (.free_code_9 <> 'NA' AND s.free_code_9 <> '0')
AND t.rbc_code <> 'NA'
Thanks in advance
The optimizer is smart enough to know that your indexes are useless.
An index on free_code might be useful if most of the values in that column were either '0' or 'NA'. As you haven't provided any information regarding data volumes or distribution we can't tell. But you have other restriction criteria on map_rbc_etl, so the database needs to go to the table anyway. My guess is that optimizer has chosen to use a full table scan on map_rbc_etl because that's quicker than a huge number of indexed reads.
This is because an indexed read is two operations - read the index, read the row. So it only pays dividends if the percentage of rows read is tiny. Otherwise it is just more efficient to read all the rows and winnow them in memory.
Here is the great "secret" of tuning: indexed reads are not always faster; full table scans are not always bad.
Similar logic applies to reading the stage_dim_accounts. The indexed column is unlikely to be selective. Unless ... unless the number of rows in map_rbc_etl is very small and only matches a small selection of rows in stage_dim_accounts. My previous comment on data metrics applies again.
indexes to use
on map_rbc_etl( free_code_9, annexure)
and on stage_dim_accounts(rbc_code);
now these may not be used for reasons in previous answer.
Additional reasons an index may not be used are:
1. The optimizer decides it would be more efficient not to use index.
2. if column is on view and has function call on column. To use this use function based indexes.
3. you perform mathematical operation in query. Note you can look at explain plan and create index to match how it is loading the rows.
4. you concat columns together in where clause. Use function based index for overcoming this.
5. You do not include first column in concatenated index in where clause of your statement. Note that Oracle 9i or greater do skip scanning and can use the index.
6. You use or clause. In this case it is best to create one index for all but the or clause and one for each of the or values then it will use all indexes appropriately.
if you don't know how to use function based indexes an example for a to_upper() in where clause you would use the following
create indexName on tableName(to_upper(colname));
any oracle sql function (built in or user created) can be in the index.

Resources