Given an address table with millions of entries, partitioned by state_cd:
CREATE TABLE ADDRESSES
(
ADDRESS_ID INT NOT NULL PRIMARY KEY,
STATE_CD VARCHAR2(2) NOT NULL,
CITY VARCHAR2(100) NOT NULL
)
PARTITION BY LIST(STATE_CD)
(PARTITION P_AL VALUES ('AL'),...., PARTITION P_WY VALUES ('WY'));
And a measurements table with millions of measurements per day, range-partitioned by date:
CREATE TABLE MEASUREMENTS
(
MEASUREMENT_ID INT NOT NULL PRIMARY KEY,
MEASUREMENT_DT DATE NOT NULL,
ADDRESS_ID INT NOT NULL,
MEASUREMENT NUMBER NOT NULL,
CONSTRAINT MEASUREMENTS_FK1 FOREIGN KEY (ADDRESS_ID)
REFERENCES ADDRESSES (ADDRESS_ID)
)
PARTITION BY RANGE (MEASUREMENT_DT)
(PARTITION P_20150101 VALUES LESS THAN (DATE '2015-01-02'),...);
Many queries would be greatly improved with partition-wise joining of MEASUREMENTS and ADDRESSES, e.g.:
SELECT TRUNC(MEASUREMENT_DT) MNTH,
CITY,
AVG(MEASUREMENT)
FROM MEASUREMENTS
JOIN ADDRESSES USING (ADDRESS_ID)
GROUP BY TRUNC(MEASUREMENT_DT), CITY;
However, adding STATE_CD to MEASUREMENTS is an unacceptable violation of normal form (introducing entirely new performance issues, e.g. ADDRESSES JOIN MEASUREMENTS USING (ADDRESS_ID, STATE_CD) really messes with the CBO's cardinality estimates).
Optimal solution would be sub-partitioning by reference, which Oracle does not support: CREATE TABLE MEASUREMENTS ... PARTITION BY RANGE(MEASUREMENT_DT) SUBPARTITION BY REFERENCE (MEASUREMENTS_FK1).
This seems like it'd be a fairly straightforward application of reference partitioning. But not only is the syntax not supported; I wasn't able to find a lot of forum activity clamoring for such a feature. This led to my posting here: typically when I am looking for a feature that no one else is looking for, it means I've overlooked an equivalent (or potentially even better) alternative.
Related
I'm trying to make sure I'm getting the benefit of selecting from a partition when using reference partitions.
In normal partitions, I know you have to include the column(s) on which the partition is defined in order for Oracle to know it can just search one specific partition.
My question is, when I'm selecting from a reference-partitioned table, do I just need to include the column on which the reference foreign key is defined? Or do I need to join and include the parent table's column on which the partition is actually defined?
create table alpha (
name varchar2(240) not null,
partition_no number(14) not null,
constraint alpha_pk
primary key (name),
constraint alpha_c01
check (partition_no > 0)
)
partition by range(partition_no)
interval (1)
(partition empty values less than (1))
;
create table beta (
name varchar2(240) not null,
alpha_name varchar2(240) not null,
some_data number not null,
constraint beta_pk
primary key (name),
constraint beta_f01
foreign key (alpha_name)
references alpha (name)
)
partition by reference (beta_f01)
;
Assume the tables in production will have much more data in them, with hundreds of millions of rows in the beta table, but merely thousands per partition.
Is this all I need?
select b.some_data
from beta b
where b.alpha_name = 'Blah'
;
Thanks if anyone can verify this for me. Or can explain anything else I'm missing with regard to properly creating indexes in reference-partitioned tables.
[Edit] Removed part of the example where clause that shouldn't have been there. The example is meant to represent reading the reference-partitioned with just the reference partition foreign key in the where clause.
Consider the following table
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);
If we have 100 million random data in this table.
Select age from company where id=2855265
Executed in less than a millisecond
Select age from company where id<353
Return less than 50 rows and Executed in less than a millisecond
Both query uses index
But the following query use full table scan and executed in 3 seconds
Select age from company where id<2855265
Return less than 500 rows
How can I speed up the query that select primary key less than variable?
Performance
The predicate id < 2855265 potentially returns a large percentage of rows in the table. Unless Postgres has information in table statistics to expect only around 500 rows, it might switch from an index scan to a bitmap index scan or even a sequential scan. Explanation:
Postgres not using index when index scan is much better option
We would need to see the output from EXPLAIN (ANALYZE, BUFFERS) for your queries.
When you repeat the query, do you get the same performance? There may be caching effects.
Either way, 3 seconds is way to slow for 500 rows, Postgres might be working with outdated or inexact table statistics. Or there may be issues with your server configuration (not enough resources). Or there can be several other not so common reasons, including hardware issues ...
If VACUUM ANALYZE did not help, VACUUM FULL ANALYZE might. It effectively rewrites the whole table and all indexes in pristine condition. Takes an exclusive lock on the table and might conflict with concurrent access!
I would also consider increasing the statistics target for the id column. Instructions:
Keep PostgreSQL from sometimes choosing a bad query plan
Table definition?
Whatever else you do, there seem to be various problems with your table definition:
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL, -- int is probably enough. "id" is a terrible column name
NAME TEXT NOT NULL, -- "name" is a teriible column name
AGE INT NOT NULL, -- typically bad idea to store age, store birthday instead
ADDRESS CHAR(50), -- never use char(n)!
SALARY REAL -- why would a company have a salary? never store money as real
);
You probably want something like this instead:
CREATE TABLE emmployee(
emploee_id serial PRIMARY KEY
company_id int NOT NULL -- REFERENCES company(company_id)?
, birthday date NOT NULL
, employee_name text NOT NULL
, address varchar(50) -- or just text
, salary int -- store amount as *Cents*
);
Related:
How to implement a many-to-many relationship in PostgreSQL?
Any downsides of using data type "text" for storing strings?
You will need to perform a VACUUM ANALYZE company; to update the planning.
We receive several millions of records per day on temperature metrics. Most of the pertinent metadata for these records is maintained in a single partitioned table by date (month). We are going to start receiving up to 20 individual codes associated with this data. The intent is to ultimately allow searching by these codes.
What would be the most effective way to store this data to minimize search response time? The database is Oracle 11gR2.
Some options I was taking into consideration:
(1) Create a separate table with main record id and code values. Something like
id code
-- ----
1 AA
1 BB
1 CC
2 AA
2 CC
2 DD
Concerns:
would probably require a bitmap index on the code column, but the table is highly transactional so no bitmap
table would get huge over time with up to 20 codes per main record
(2) Create a separate table partitioned by code values.
Concerns:
partition maintenance for new codes
search performance
(3) Add a XMLType column to existing table and format the codes for each record into XML and create an XMLIndex on the column: Something like:
<C1>AA</C1>
<C2>BB</C2>
<C3>CC</C3>
Concerns:
query response time when searching on CODE probably would be poor
Any recommendations are welcome.
Thanks.
You need to benchmark different approaches. There's no way we can give you meaningful solutions without knowing much more about your scenario. How many different codes will there be in total? What's the average number of codes per reading? Will there be a noticeable skew in the distribution of codes? What access paths do you need to support for searching by code?
Then there's the matter of how you load data (batches? drip feed?). And what benefits you derive from using partitioning.
Anyway. Here's one more approach, which is an amalgamation of your (1) and (2).
Given that your main table is partitioned by month you should probably partition any child table with the same scheme. You can subpartition by code as well.
create table main_codes
( reading_dt date not null
, main_id number not null
, code varchar2(2)
, constraint main_codes_pk primary key (code, reading_dt, main_id) using index local
)
partition by range (reading_dt)
subpartition by list (code)
subpartition template
(
subpartition sp_aa values ( 'AA' ),
subpartition sp_bb values ( 'BB' ),
subpartition sp_cc values ( 'CC' ),
subpartition sp_dd values ( 'DD' )
)
(
partition p_2015JAN values less than (date '2015-02-01' ),
partition p_2015FEB values less than (date '2015-03-01' ),
partition p_2015MAR values less than (date '2015-04-01' )
)
/
You'll probably want a foreign on the main table too:
alter table main_codes
add constraint code_main_fk foreign key (reading_dt, main_id)
references main_table (reading_dt, main_id)
/
create index code_main_idx on main_codes (entry_dt, id) local
/
Depending on the number of codes you have, creating the subpartition template could be tedious. This is why Nature gave us cut'n'paste.
But whatever you do, don't go down the XML path.
I am in the middle of designing a table which include two columns valid_from and valid_to to track historical changes. For example, my table structure is like below:
create table currency_data
(
currency_code varchar(16) not null,
currency_desc varchar(16) not null,
valid_from date not null,
valid_to date,
d_insert_date date,
d_last_update date,
constraint pk_currency_data primary key (currency_code, valid_from)
)
The idea is to leave the valid_to as blank to start with, and if the currency_desc changes in the future, I will need to set a valid_to to the date that the old description is not valid any more, and create a new rows with a new valid_from. But how can I ensure that there will be never a overlap between these 2 rows. For example the query below should only yield one row.
select currency_desc
from currency_data
where currency_code = 'USD'
and trunc(sysdate) between valid_from and nvl(valid_to, sysdate)
Is there a better way to achieve this please other than make sure all developers/end users aware of this rule. Many thanks.
There is a set of implementation approaches known as slowly changing dimensions (SCD) for handling this kind of storage.
What you are currently implementing is SCD II, however, there are more.
Regarding your possible interval overlap issue - there is no simple way to enforce table-level (instead of row-level) consistency with standard constraints, so I guess a robust approach would be to restrict direct DML to this table and wrap it into some standartized pl/sql API which will enforce your riles prior to insert/update and which every developer will use.
what is use-case of IOT (Index Organized Table) ?
Let say I have table like
id
Name
surname
i know the IOT but bit confuse about the use case of IOT
Your three columns don't make a good use case.
IOT are most useful when you often access many consecutive rows from a table. Then you define a primary key such that the required order is represented.
A good example could be time series data such as historical stock prices. In order to draw a chart of the stock price of a share, many rows are read with consecutive dates.
So the primary key would be stock ticker (or security ID) and the date. The additional columns could be the last price and the volume.
A regular table - even with an index on ticker and date - would be much slower because the actual rows would be distributed over the whole disk. This is because you cannot influence the order of the rows and because data is inserted day by day (and not ticker by ticker).
In an index-organized table, the data for the same ticker ends up on a few disk pages, and the required disk pages can be easily found.
Setup of the table:
CREATE TABLE MARKET_DATA
(
TICKER VARCHAR2(20 BYTE) NOT NULL ENABLE,
P_DATE DATE NOT NULL ENABLE,
LAST_PRICE NUMBER,
VOLUME NUMBER,
CONSTRAINT MARKET_DATA_PK PRIMARY KEY (TICKER, P_DATE) ENABLE
)
ORGANIZATION INDEX;
Typical query:
SELECT TICKER, P_DATE, LAST_PRICE, VOLUME
FROM MARKET_DATA
WHERE TICKER = 'MSFT'
AND P_DATE BETWEEN SYSDATE - 1825 AND SYSDATE
ORDER BY P_DATE;
Think of index organized tables as indexes. We all know the point of an index: to improve access speeds to particular rows of data. This is a performance optimisation of trick of building compound indexes on sub-sets of columns which can be used to satisfy commonly-run queries. If an index can completely satisy the columns in a query's projection the optimizer knows it doesn't have to read from the table at all.
IOTs are just this approach taken to its logical confusion: buidl the index and throw away the underlying table.
There are two criteria for deciding whether to implement a table as an IOT:
It should consists of a primary key (one or more columns) and at most one other column. (okay, perhaps two other columns at a stretch, but it's an warning flag).
The only access route for the table is the primary key (or its leading columns).
That second point is the one which catches most people out, and is the main reason why the use cases for IOT are pretty rare. Oracle don't recommend building other indexes on an IOT, so that means any access which doesn't drive from the primary key will be a Full Table Scan. That might not matter if the table is small and we don't need to access it through some other path very often, but it's a killer for most application tables.
It is also likely that a candidate table will have a relatively small number of rows, and is likely to be fairly static. But this is not a hard'n'fast rule; certainly a huge, volatile table which matched the two criteria listed above could still be considered for implementations as an IOT.
So what makes a good candidate dor index organization? Reference data. Most code lookup tables are like something this:
code number not null primary key
description not null varchar2(30)
Almost always we're only interested in getting the description for a given code. So building it as an IOT will save space and reduce the access time to get the description.