Is an Index Organized Table appropriate here? - oracle

I recently was reading about Oracle Index Organized Tables (IOTs) but am not sure I quite understand WHEN to use them. So I have a small table:
create table categories
(
id VARCHAR2(36),
group VARCHAR2(100),
category VARCHAR2(100
)
create unique index (group, category, id) COMPRESS 2;
The id column is a foreign key from another table entries and my common query is:
select e.id, e.time, e.title from entries e, categories c where e.id=c.id AND e.group=? AND c.category=? ORDER by e.time
The entries table is indexed properly.
Both of these tables have millions (16M currently) of rows and currently this query really stinks (note: I have it wrapped in a pagination query also so I only get back the first 20, but for simplicity I omitted that).
Since I am basically indexing the entire table, does it make sense to create this table as an IOT?
EDIT by popular demand:
create table entries
(
id VARCHAR2(36),
time TIMESTAMP,
group VARCHAR2(100),
title VARCHAR2(500),
....
)
create index (group, time) compress 1;
My real question I dont think depends on this though. Basically if you have a table with few columns (3 in this example) and you are planning on putting a composite index on all three rows is there any reason not to use an IOT?

IOTs are great for a number of purposes, including this case where you're gonna have an index on all (or most) of the columns anyway - but the benefit only materialises if you don't have the extra index - the idea is that the table itself is an index, so put the columns in the order that you want the index to be in. In your case, you're accessing category by id, so it makes sense for that to be the first column. So effectively you've got an index on (id, group, category). I don't know why you'd want an additional index on (group, category, id).
Your query:
SELECT e.id, e.time, e.title
FROM entries e, categories c
WHERE e.id=c.id AND e.group=? AND c.category=?
ORDER by e.time
You're joining the tables by ID, but you have no index on entries.id - so the query is probably doing a hash or sort merge join. I wouldn't mind seeing a plan for what your system is doing now to confirm.
If you're doing a pagination query (i.e. only interested in a small number of rows) you want to get the first rows back as quick as possible; for this to happen you'll probably want a nested loop on entries, e.g.:
NESTED LOOPS
ACCESS TABLE BY ROWID - ENTRIES
INDEX RANGE SCAN - (index on ENTRIES.group,time)
ACCESS TABLE BY ROWID - CATEGORIES
INDEX RANGE SCAN - (index on CATEGORIES.ID)
Since the join to CATEGORIES is on ID, you'll want an index on ID; if you make it an IOT, and make ID the leading column, that might be sufficient.
The performance of the plan I've shown above will be dependent on how many rows match the given "group" - i.e. how selective an average "group" is.

Have you looked at dba-oracle.com, asktom.com, IOUG, another asktom.com?
There are penalties to pay for IOTs - e.g., poorer insert performance
Can you prototype it and compare performance?
Also, perhaps you might want to consider a hash cluster.

IOT's are a trade off. You are getting access performance for decreased insert/update performance. We typically use them for reference data that is batch loaded daily and not updated during the day. This is not to say it's the only way to use them, just how we use them.
Few things here:
You mention pagination - have you considered the first_rows hint?
Is that the order your index is in, with group as the first field? If so I'd consider moving ID to be the first column since that index will not be used.
foreign keys should have an index on the column. Consider addind an index on the foreign key (id column).
Are you sure it's not the ORDER BY causing slowness?

What version of Oracle are you using?
I ASSUME there is a primary key on table entries for field id, correct?
Why the WHERE condition does not include "c.group = e.group" ?
Try to:
Remove the order by condition
Change the index definition from "create unique index (group,
category, id)" to "create unique index (id, group, category)"
Reorganise table categories as an IOT on (group, category, id)
Reorganise table categories as an IOT on (id, group, category)
In each of the above case use EXPLAIN PLAN to review the cost

Related

Query a table in different ways or orderings in Cassandra

I've recently started to play around with Cassandra. My understanding is that in a Cassandra table you define 2 keys, which can be either single column or composites:
The Partitioning Key: determines how to distribute data across nodes
The Clustering Key: determines in which order the records of a same partitioning key (i.e. within a same node) are written. This is also the order in which the records will be read.
Data from a table will always be sorted in the same order, which is the order of the clustering key column(s). So a table must be designed for a specific query.
But what if I need to perform 2 different queries on the data from a table. What is the best way to solve this when using Cassandra ?
Example Scenario
Let's say I have a simple table containing posts that users have written :
CREATE TABLE posts (
username varchar,
creation timestamp,
content varchar,
PRIMARY KEY ((username), creation)
);
This table was "designed" to perform the following query, which works very well for me:
SELECT * FROM posts WHERE username='luke' [ORDER BY creation DESC];
Queries
But what if I need to get all posts regardless of the username, in order of time:
Query (1): SELECT * FROM posts ORDER BY creation;
Or get the posts in alphabetical order of the content:
Query (2): SELECT * FROM posts WHERE username='luke' ORDER BY content;
I know that it's not possible given the table I created, but what are the alternatives and best practices to solve this ?
Solution Ideas
Here are a few ideas spawned from my imagination (just to show that at least I tried):
Querying with the IN clause to select posts from many users. This could help in Query (1). When using the IN clause, you can fetch globally sorted results if you disable paging. But using the IN clause quickly leads to bad performance when the number of usernames grows.
Maintaining full copies of the table for each query, each copy using its own PRIMARY KEY adapted to the query it is trying to serve.
Having a main table with a UUID as partitioning key. Then creating smaller copies of the table for each query, which only contain the (key) columns useful for their own sort order, and the UUID for each row of the main table. The smaller tables would serve only as "sorting indexes" to query a list of UUID as result, which can then be fetched using the main table.
I'm new to NoSQL, I would just want to know what is the correct/durable/efficient way of doing this.
The SELECT * FROM posts ORDER BY creation; will results in a full cluster scan because you do not provide any partition key. And the ORDER BY clause in this query won't work anyway.
Your requirement I need to get all posts regardless of the username, in order of time is very hard to achieve in a distributed system, it supposes to:
fetch all user posts and move them to a single node (coordinator)
order them by date
take top N latest posts
Point 1. require a full table scan. Indeed as long as you don't fetch all records, the ordering can not be achieve. Unless you use Cassandra clustering column to order at insertion time. But in this case, it means that all posts are being stored in the same partition and this partition will grow forever ...
Query SELECT * FROM posts WHERE username='luke' ORDER BY content; is possible using a denormalized table or with the new materialized view feature (http://www.doanduyhai.com/blog/?p=1930)
Question 1:
Depending on your use case I bet you could model this with time buckets, depending on the range of times you're interested in.
You can do this by making the primary key a year,year-month, or year-month-day depending on your use case (or finer time intervals)
The basic idea is that you bucket changes for what suites your use case. For example:
If you often need to search these posts over months in the past, then you may want to use the year as the PK.
If you usually need to search the posts over several days in the past, then you may want to use a year-month as the PK.
If you usually need to search the post for yesterday or a couple of days, then you may want to use a year-month-day as your PK.
I'll give a fleshed out example with yyyy-mm-dd as the PK:
The table will now be:
CREATE TABLE posts_by_creation (
creation_year int,
creation_month int,
creation_day int,
creation timeuuid,
username text, -- using text instead of varchar, they're essentially the same
content text,
PRIMARY KEY ((creation_year,creation_month,creation_day), creation)
)
I changed creation to be a timeuuid to guarantee a unique row for each post creation event. If we used just a timestamp you could theoretically overwrite an existing post creation record in here.
Now we can then insert the Partition Key (PK): creation_year, creation_month, creation_day based on the current creation time:
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update1';
INSERT INTO posts_by_creation (creation_year, creation_month, creation_day, creation, username, content) VALUES (2016, 4, 2, now() , 'fromanator', 'content update2';
now() is a CQL function to generate a timeUUID, you would probably want to generate this in the application instead, and parse out the yyyy-mm-dd for the PK and then insert the timeUUID in the clustered column.
For a usage case using this table, let's say you wanted to see all of the changes today, your CQL would look like:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2;
Or if you wanted to find all of the changes today after 5pm central:
SELECT * FROM posts_by_creation WHERE creation_year = 2016 AND creation_month = 4 AND creation_day = 2 AND creation >= minTimeuuid('2016-04-02 5:00-0600') ;
minTimeuuid() is another cql function, it will create the smallest possible timeUUID for the given time, this will guarantee that you get all of the changes from that time.
Depending on the time spans you may need to query a few different partition keys, but it shouldn't be that hard to implement. Also you would want to change your creation column to a timeuuid for your other table.
Question 2:
You'll have to create another table or use materialized views to support this new query pattern, just like you thought.
Lastly if your not on Cassandra 3.x+ or don't want to use materialized views you can use Atomic batches to ensure data consistency across your several de-normalized tables (that's what it was designed for). So in your case it would be a BATCH statement with 3 inserts of the same data to 3 different tables that support your query patterns.
The solution is to create another tables to support your queries.
For SELECT * FROM posts ORDER BY creation;, you may need some special column for grouping it, maybe by month and year, e.g. PRIMARY KEY((year, month), timestamp) this way the cassandra will have a better performance on read because it doesn't need to scan the whole cluster to get all data, it will also save the data transfer between nodes too.
Same as SELECT * FROM posts WHERE username='luke' ORDER BY content;, you must create another table for this query too. All column may be same as your first table but with the different Primary Key, because you cannot order by the column that is not the clustering column.

Best way to identify a handful of records expected to have a flag set to TRUE

I have a table that I expect to get 7 million records a month on a pretty wide table. A small portion of these records are expected to be flagged as "problem" records.
What is the best way to implement the table to locate these records in an efficient way?
I'm new to Oracle, but is a materialized view an valid option? Are there such things in Oracle such as indexed views or is this potentially really the same thing?
Most of the reporting is by month, so partitioning by month seems like an option, but a "problem" record may be lingering for several months theorectically. Otherwise, the reporting shuold be mostly for the current month. Would you expect that querying across all month partitions to locate any problem record would cause significant performance issues compared to usinga single table?
Your general thoughts of where to start would be appreciated. I realize I need to read up and I'll do that but I wanted to get the community thought first to make sure I read the right stuff.
One more thought: The primary key is a GUID varchar2(36). In order of magnitude, how much of a performance hit would you expect this to be relative to using a NUMBER data type PK? This worries me but it is out of my control.
It depends what you mean by "flagged", but it sounds to me like you would benefit from a simple index, function based index, or an indexed virtual column.
In all cases you should be careful to ensure that all the index columns are NULL for rows that do not need to be flagged. This way your index will contain only the rows that are flagged (Oracle does not - by default - index rows in B-Tree indexes where all index column values are NULL).
Your primary key being a VARCHAR2 GUID should make no difference, at least with regards to the specific flagging of rows in this question, indexes will point to rows via Oracle internal ROWIDs.
Indexes support partitioning, so if your data is already partitioned, your index could be set to match.
Simple column index method
If you can dictate how the flagging works, or the column already exists, then I would simply add an index to it like so:
CREATE INDEX my_table_problems_idx ON my_table (problem_flag)
/
Function-based index method
If the data model is fixed / there is no flag column, then you can create a function-based index assuming that you have all the information you need in the target table. For example:
CREATE INDEX my_table_problems_fnidx ON my_table (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
Now if you use the same logic in your SELECT statement, you should find that it uses the index to efficiently match rows.
SELECT *
FROM my_table
WHERE CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END IS NOT NULL
/
This is a bit clunky though, and it requires you to use the same logic in queries as the index definition. Not great. You could use a view to mask this, but you're still duplicating logic in at least two places.
Indexed virtual column
In my opinion, this is the best way to do it if you are computing the value dynamically (available from 11g onwards):
ALTER TABLE my_table
ADD virtual_problem_flag VARCHAR2(1) AS (
CASE
WHEN amount > 100 THEN 'Y'
ELSE NULL
END
)
/
CREATE INDEX my_table_problems_idx ON my_table (virtual_problem_flag)
/
Now you can just query the virtual column as if it were a real column, i.e.
SELECT *
FROM my_table
WHERE virtual_problem_flag = 'Y'
/
This will use the index and puts the function-based logic into a single place.
Create a new table with just the pks of the problem rows.

oracle- index organized table

what is use-case of IOT (Index Organized Table) ?
Let say I have table like
id
Name
surname
i know the IOT but bit confuse about the use case of IOT
Your three columns don't make a good use case.
IOT are most useful when you often access many consecutive rows from a table. Then you define a primary key such that the required order is represented.
A good example could be time series data such as historical stock prices. In order to draw a chart of the stock price of a share, many rows are read with consecutive dates.
So the primary key would be stock ticker (or security ID) and the date. The additional columns could be the last price and the volume.
A regular table - even with an index on ticker and date - would be much slower because the actual rows would be distributed over the whole disk. This is because you cannot influence the order of the rows and because data is inserted day by day (and not ticker by ticker).
In an index-organized table, the data for the same ticker ends up on a few disk pages, and the required disk pages can be easily found.
Setup of the table:
CREATE TABLE MARKET_DATA
(
TICKER VARCHAR2(20 BYTE) NOT NULL ENABLE,
P_DATE DATE NOT NULL ENABLE,
LAST_PRICE NUMBER,
VOLUME NUMBER,
CONSTRAINT MARKET_DATA_PK PRIMARY KEY (TICKER, P_DATE) ENABLE
)
ORGANIZATION INDEX;
Typical query:
SELECT TICKER, P_DATE, LAST_PRICE, VOLUME
FROM MARKET_DATA
WHERE TICKER = 'MSFT'
AND P_DATE BETWEEN SYSDATE - 1825 AND SYSDATE
ORDER BY P_DATE;
Think of index organized tables as indexes. We all know the point of an index: to improve access speeds to particular rows of data. This is a performance optimisation of trick of building compound indexes on sub-sets of columns which can be used to satisfy commonly-run queries. If an index can completely satisy the columns in a query's projection the optimizer knows it doesn't have to read from the table at all.
IOTs are just this approach taken to its logical confusion: buidl the index and throw away the underlying table.
There are two criteria for deciding whether to implement a table as an IOT:
It should consists of a primary key (one or more columns) and at most one other column. (okay, perhaps two other columns at a stretch, but it's an warning flag).
The only access route for the table is the primary key (or its leading columns).
That second point is the one which catches most people out, and is the main reason why the use cases for IOT are pretty rare. Oracle don't recommend building other indexes on an IOT, so that means any access which doesn't drive from the primary key will be a Full Table Scan. That might not matter if the table is small and we don't need to access it through some other path very often, but it's a killer for most application tables.
It is also likely that a candidate table will have a relatively small number of rows, and is likely to be fairly static. But this is not a hard'n'fast rule; certainly a huge, volatile table which matched the two criteria listed above could still be considered for implementations as an IOT.
So what makes a good candidate dor index organization? Reference data. Most code lookup tables are like something this:
code number not null primary key
description not null varchar2(30)
Almost always we're only interested in getting the description for a given code. So building it as an IOT will save space and reduce the access time to get the description.

Oracle: Index organised table with null values

I have a table which is basically a tree structure with a column parent_id and id.
parent_id is null for root nodes.
There is also a self referential foreign key, so that every parent_id has a corresponding id.
This table is mainly read-only with mostly infrequent batch updates.
One of the most common queries from the application which accesses this table is select ... where parent_id = X. I thought this might be faster if this table was index organised on parent_id.
However, I'm not sure how to index organise this table if parent_id can be null. I'd rather not fudge things so that parent_id=0 is some special id, as I'd have to add dummy values to the table to ensure the foreign key constraints are satisfied, and it also changes the application logic.
Is there any way to index organise a table by possible null value columns?
Solution from question asker:
I found I could get the same benefits from index organisation just by adding the queried columns to the end of the parent_id index, i.e. instead of:
create index foo_idx on foo_tab(parent_id);
I do:
create index foo_idx on foo_tab(parent_id, col1, col2, col3);
Where col1, col2, col3 etc are frequently accessed columns.
I've only done this with indexes which are used to return multiple rows which benefit from the ordering and hence disk locality provided by the index, instead of having to jump around the table. Indexes which are generally used to return single rows I've left to reference the table, as there is only one row to read anyway so locality matters much less.
Like I mentioned, this is a mainly read table, and also space is not a huge concern, so I don't think the overhead to writes caused by these indexes is a big concern.
(I realise this won't index null parent_ids, but instead I've made another index on decode(parent_id, null, 1, null) which indexes nulls and only nulls).
I would try adding the index on the single column parent_id.
If all of the columns in your index are non-null, then this row does not appear in your index.
So for the parent_id = X you cite above, this should use the index. However, if you're doing parent_id is null, then it won't use the index, and you'll be getting the same performance as you have now. This sounds like behaviour that would suit you.
I have used this in the past to improve the performance of queries. It works particulalry well if the number of items in the index is small compared to the number of rows in the database. We had about 3% of our rows in this particular index, and it flew :-)
But, as always, you need to try it and measure the difference in performance. Your mileage may vary.

Is an index clustered or unclustered in Oracle?

How can I determine if an Oracle index is clustered or unclustered?
I've done
select FIELD from TABLE where rownum <100
where FIELD is the field on which is built the index. I have ordered tuples, but the result is wrong because the index is unclustered.
By default all indexes in Oracle are unclustered. The only clustered indexes in Oracle are the Index-Organized tables (IOT) primary key indexes.
You can determine if a table is an IOT by looking at the IOT_TYPE column in the ALL_TABLES view (its primary key could be determined by querying the ALL_CONSTRAINTS and ALL_CONS_COLUMNS views).
Here are some reasons why your query might return ordered rows:
Your table is index-organized and FIELD is the leading part of its primary key.
Your table is heap-organized but the rows are by chance ordered by FIELD, this happens sometimes on an incrementing identity column.
Case 2 will return sorted rows only by chance. The order of the inserts is not guaranteed, furthermore Oracle is free to reuse old blocks if some happen to have available space in the future, disrupting the fragile ordering.
Case 1 will most of the time return ordered rows, however you shouldn't rely on it since the order of the rows returned depends upon the algorithm of the access path which may change in the future (or if you change DB parameter, especially parallelism).
In both case if you want ordered rows you should supply an ORDER BY clause:
SELECT field
FROM (SELECT field
FROM TABLE
ORDER BY field)
WHERE rownum <= 100;
There is no concept of a "clustered index" in Oracle as in SQL Server and Sybase. There is an Index-Organized Table, which is similar but not the same.
"Clustered" indices, as implemented in Sybase, MS SQL Server and possibly others, where rows are physically stored in the order of the indexed column(s) don't exist as such in Oracle. "Cluster" has a different meaning in Oracle, relating, I believe, to the way blocks and tables are organized.
Oracle does have "Index Organized Tables", which are physically equivalent, but they're used much less frequently because the query optimizer works differently.
The closest I can get to an answer to the identification question is to try something like this:
SELECT IOT_TYPE FROM user_tables
WHERE table_name = '<your table name>'
My 10g instance reports IOT or null accordingly.
Index Organized Tables have to be organized on the primary key. Where the primary key is a sequence generated value this is often useless or even counter-productive (because simultaneous inserts get into conflict for the same block).
Single table clusters can be used to group data with the same column value in the same database block(s). But they are not ordered.

Resources