How to improve deletion times in Oracle for a self-referencing table - oracle

In our Oracle 11g database we have a table, that has a primary key I_Node (int) and also a column called I_Parent_Node (int) that references back to another record in the same table. The root node has I_Parent_Node = null. In this way we form a tree structure of nodes, leaves, branches, whatever you want to call them.
Frequently we need to delete an entire branch of nodes at once, meaning a node and all of its children. At times this is many, many records, say 50,000 or more. Since a cascade delete is not allowed on a self-referencing table, we are forced to delete one by one starting with the leaves and working our way back up the tree. We have experienced hours-long delete times.
We are considering doing a "marking for deletion" technique, where a separate program would clean out the nodes marked for deletion during off-peak hours, but I am interested in whether a database design change or some other Oracle construct could help out here. I am not trained in Oracle aside from what I've learned on the job, and the people who created the database did not have such large quantities in mind. I am open to database design changes since it is not yet a fixed design.

You may want to consider separating the hierarchy structure from the main table. So you main table would just have primary ids (lets call it "ID"), and your hierarchy table would have "ID, ParentID, TreeID". ParentID is that ID's parent node, and TreeID is the highest parent in the tree (level 1).
So, a level 1 node would look like:
ID, ParentID, TreeID
1, [null], 1
A level 2 node would look like:
ID, ParentID, TreeID
2, 1, 1
A level 3 node would look like:
ID, ParentID, TreeID
3, 2, 1
And so on.
You would use Oracle hierarchy queries (Connect by queries) to query or traverse the trees. This table will be very thin (not many columns, these 3 + some modified dates maybe), so updating these relationships should be much faster and scale better than messing with the main table.

You should be able to do this with deferrable constraints and a hierarchical query.
If your foreign key constraint (on I_Parent_Node) is not already deferrable, drop it and recreate it with the keyword "DEFERRABLE".
Here's an example using the EMPLOYEES table from Oracle's examples (I modified the DEPARTMENTS table too so that this would execute, that's really not needed for an example though):
Drop & Recreate your foreign key if it's not currently deferrable:
alter table employees drop constraint emp_manager_fk;
alter table employees add constraint emp_manager_fk foreign key (manager_id) references employees(employee_id) deferrable;
In your transaction, defer your contraints, and delete using a hierarchical query:
set constraints all deferred;
delete
from employees e
where employee_id in (select employee_id
from employees
start with employee_id = 108
connect by prior employee_id = manager_id);
The "108" is the ID of my "parent" record.

I assume you've already done standard tuning - i.e. are the node and parent node ID columns suitable indexed?
(1) One approach to the problem is to use PL/SQL. Bulk collect the IDs to be deleted, using a hierarchical query that returns the leaf rows first, into an array; then do a bulk delete (FORALL) using the array.
(2) Another approach is a soft-delete - mark the rows as "deleted", but never actually delete them. You would need to modify your application (or use Oracle VPD to automatically omit the "deleted" rows from queries). This might work reasonably well if deleting a node is relatively rare; but if you're routinely deleting lots of nodes then this would clutter the table with a lot of old data.

Related

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Oracle: Index organised table with null values

I have a table which is basically a tree structure with a column parent_id and id.
parent_id is null for root nodes.
There is also a self referential foreign key, so that every parent_id has a corresponding id.
This table is mainly read-only with mostly infrequent batch updates.
One of the most common queries from the application which accesses this table is select ... where parent_id = X. I thought this might be faster if this table was index organised on parent_id.
However, I'm not sure how to index organise this table if parent_id can be null. I'd rather not fudge things so that parent_id=0 is some special id, as I'd have to add dummy values to the table to ensure the foreign key constraints are satisfied, and it also changes the application logic.
Is there any way to index organise a table by possible null value columns?
Solution from question asker:
I found I could get the same benefits from index organisation just by adding the queried columns to the end of the parent_id index, i.e. instead of:
create index foo_idx on foo_tab(parent_id);
I do:
create index foo_idx on foo_tab(parent_id, col1, col2, col3);
Where col1, col2, col3 etc are frequently accessed columns.
I've only done this with indexes which are used to return multiple rows which benefit from the ordering and hence disk locality provided by the index, instead of having to jump around the table. Indexes which are generally used to return single rows I've left to reference the table, as there is only one row to read anyway so locality matters much less.
Like I mentioned, this is a mainly read table, and also space is not a huge concern, so I don't think the overhead to writes caused by these indexes is a big concern.
(I realise this won't index null parent_ids, but instead I've made another index on decode(parent_id, null, 1, null) which indexes nulls and only nulls).
I would try adding the index on the single column parent_id.
If all of the columns in your index are non-null, then this row does not appear in your index.
So for the parent_id = X you cite above, this should use the index. However, if you're doing parent_id is null, then it won't use the index, and you'll be getting the same performance as you have now. This sounds like behaviour that would suit you.
I have used this in the past to improve the performance of queries. It works particulalry well if the number of items in the index is small compared to the number of rows in the database. We had about 3% of our rows in this particular index, and it flew :-)
But, as always, you need to try it and measure the difference in performance. Your mileage may vary.

DB project - improving performance with relationships

I have two tables, let's call them TableA and TableB. One record in TableA is related to one or more in TableB. But there's also one special record within them in TableB for each record from TableA (for example with lowest ID), and I want to have quick access to that special one. Data from both tables aren't deleted - it's a kind of history rarely cleared. How do that the best in terms of performance?
I thought of:
1) two-way relationship, but it will affect insert performance
2) design next table, with primary key as FK_TableA (for TableA record exactly one is "special") and second column FK_TableB and then create view
3) design next table, with primary key as FK_TableA, FK_TableB, make FK_TableA unique and then create view
I'm open for all other ideas :)
4) I'd consider an indexed view to hide the JOIN and row restriction
This is similar to your options 2 and 3 but the DB engine will maintain it for you. With a new table you'll either compromise data integrity or have to manage the data via triggers

Is an Index Organized Table appropriate here?

I recently was reading about Oracle Index Organized Tables (IOTs) but am not sure I quite understand WHEN to use them. So I have a small table:
create table categories
(
id VARCHAR2(36),
group VARCHAR2(100),
category VARCHAR2(100
)
create unique index (group, category, id) COMPRESS 2;
The id column is a foreign key from another table entries and my common query is:
select e.id, e.time, e.title from entries e, categories c where e.id=c.id AND e.group=? AND c.category=? ORDER by e.time
The entries table is indexed properly.
Both of these tables have millions (16M currently) of rows and currently this query really stinks (note: I have it wrapped in a pagination query also so I only get back the first 20, but for simplicity I omitted that).
Since I am basically indexing the entire table, does it make sense to create this table as an IOT?
EDIT by popular demand:
create table entries
(
id VARCHAR2(36),
time TIMESTAMP,
group VARCHAR2(100),
title VARCHAR2(500),
....
)
create index (group, time) compress 1;
My real question I dont think depends on this though. Basically if you have a table with few columns (3 in this example) and you are planning on putting a composite index on all three rows is there any reason not to use an IOT?
IOTs are great for a number of purposes, including this case where you're gonna have an index on all (or most) of the columns anyway - but the benefit only materialises if you don't have the extra index - the idea is that the table itself is an index, so put the columns in the order that you want the index to be in. In your case, you're accessing category by id, so it makes sense for that to be the first column. So effectively you've got an index on (id, group, category). I don't know why you'd want an additional index on (group, category, id).
Your query:
SELECT e.id, e.time, e.title
FROM entries e, categories c
WHERE e.id=c.id AND e.group=? AND c.category=?
ORDER by e.time
You're joining the tables by ID, but you have no index on entries.id - so the query is probably doing a hash or sort merge join. I wouldn't mind seeing a plan for what your system is doing now to confirm.
If you're doing a pagination query (i.e. only interested in a small number of rows) you want to get the first rows back as quick as possible; for this to happen you'll probably want a nested loop on entries, e.g.:
NESTED LOOPS
ACCESS TABLE BY ROWID - ENTRIES
INDEX RANGE SCAN - (index on ENTRIES.group,time)
ACCESS TABLE BY ROWID - CATEGORIES
INDEX RANGE SCAN - (index on CATEGORIES.ID)
Since the join to CATEGORIES is on ID, you'll want an index on ID; if you make it an IOT, and make ID the leading column, that might be sufficient.
The performance of the plan I've shown above will be dependent on how many rows match the given "group" - i.e. how selective an average "group" is.
Have you looked at dba-oracle.com, asktom.com, IOUG, another asktom.com?
There are penalties to pay for IOTs - e.g., poorer insert performance
Can you prototype it and compare performance?
Also, perhaps you might want to consider a hash cluster.
IOT's are a trade off. You are getting access performance for decreased insert/update performance. We typically use them for reference data that is batch loaded daily and not updated during the day. This is not to say it's the only way to use them, just how we use them.
Few things here:
You mention pagination - have you considered the first_rows hint?
Is that the order your index is in, with group as the first field? If so I'd consider moving ID to be the first column since that index will not be used.
foreign keys should have an index on the column. Consider addind an index on the foreign key (id column).
Are you sure it's not the ORDER BY causing slowness?
What version of Oracle are you using?
I ASSUME there is a primary key on table entries for field id, correct?
Why the WHERE condition does not include "c.group = e.group" ?
Try to:
Remove the order by condition
Change the index definition from "create unique index (group,
category, id)" to "create unique index (id, group, category)"
Reorganise table categories as an IOT on (group, category, id)
Reorganise table categories as an IOT on (id, group, category)
In each of the above case use EXPLAIN PLAN to review the cost

Unindexed Foreign Key leads to TM Enqueue Contention

So we've been told that one source of TM Enq contention can be unindexed FK's. My question is which one.
I have an INSERT INTO Table_B that is recording TM Enq Wait.
It contains a PK that is the parent to other tables and it has columns that are FK constrained to other PKs.
So which FKs need indexed: that table's columns or its children?
NB: I know that this isn't the only cause of TM Contention. Can you explain why it couldn't possibly be this if that's the case.
Not sure about Oracle TM Contention, but I'd say normally both sides of a foreign key relation are indexed. Otherwise, the database will have to do table scans.
The index on the parent record is used whenever you insert a new child record, to verify that the parent exists. Often this is a primary key as well, so of course has an index.
The index on the child record is used whenever you change or delete a parent record, to perform cascades (including refusing the update/delete).
The indices on both sides also give the database a good chance of doing fast (indexed) joins, no matter which side its optimizer prefers to come from.
EDIT: Having Googled TM contention, it sounds like you're probably missing the keys on the child records. But make sure to have them on both sides, really.
EDIT 2: Answering the comment,
If you have a OLTP table that has 13 FKs to lookup tables, I'm not
keen on 13 index updates in addition to the table, pk and any other
indexes. An index is important but for specific reasons. If you never
update the parent PK nor delete from the parent, the child index is
not so useful. Or is it?
Depends on the joins and queries you're running, then. E.g., if you run
a query like:
SELECT o.something
FROM oltp_tab o JOIN lookup l ON (o.lookup_no = l.lookup_no)
WHERE l.lookup_name = ?
then the query optimizer would probably like the index on the child
records.
Also, according to http://ashmasters.com/waits/enq-tm-contention/ you
pretty much need to have the indices if you change the parent tables at
all. Apparently you get them from having concurrent changes to the
parent and child tables, unless you have the index. So this is probably
what you're seeing (assuming you're not doing the obvious things, like
updating the referred to columns or deleting rows)
The parent (referenced) column of an enabled foreign key relationship has to be indexed because it has to have an enabled unique or primary key constraint on it.
What mode of TM Enqueue are you seeing?

Resources