I am in the middle of designing a table which include two columns valid_from and valid_to to track historical changes. For example, my table structure is like below:
create table currency_data
(
currency_code varchar(16) not null,
currency_desc varchar(16) not null,
valid_from date not null,
valid_to date,
d_insert_date date,
d_last_update date,
constraint pk_currency_data primary key (currency_code, valid_from)
)
The idea is to leave the valid_to as blank to start with, and if the currency_desc changes in the future, I will need to set a valid_to to the date that the old description is not valid any more, and create a new rows with a new valid_from. But how can I ensure that there will be never a overlap between these 2 rows. For example the query below should only yield one row.
select currency_desc
from currency_data
where currency_code = 'USD'
and trunc(sysdate) between valid_from and nvl(valid_to, sysdate)
Is there a better way to achieve this please other than make sure all developers/end users aware of this rule. Many thanks.
There is a set of implementation approaches known as slowly changing dimensions (SCD) for handling this kind of storage.
What you are currently implementing is SCD II, however, there are more.
Regarding your possible interval overlap issue - there is no simple way to enforce table-level (instead of row-level) consistency with standard constraints, so I guess a robust approach would be to restrict direct DML to this table and wrap it into some standartized pl/sql API which will enforce your riles prior to insert/update and which every developer will use.
Related
I have the pretty bad headache about primary key based on the three rows:
ID DATE_START DATE_END
They describes the version of object. Version of object is the combination of ID, D_S, D_E and it has no any breaks inside of timeline.
There is a few states of object:
State 1 (latest version):
ID DATE_START DATE_END
1 01-01-2022 00:00:00 31-12-9999 00:00:00
State 2 (insert new version)
ID DATE_START DATE_END
1 01-01-2022 00:00:00 01-02-2022 23:59:59
1 02-02-2022 00:00:00 31-12-9999 00:00:00
So you can see each time user creates new version, we should update DATE_END on previous, like this:
DATE_END = DATE_START - 1 Second
I'm almost sure that updating part of PK is the bad practice. I'm on to drop DATE_END from PK at all, in case of there's no any breaks inside timeline.
But my colleagues try to convince me that we should have DATE_END - it's pretty slow to search version on some date cuz we should compare DATE_START of each row against to comparing DATE_START and DATE_END in one row.
Could someone explain me who's wrong and what is the best solution in that case?
You could only store the start date and calculate the end-date; or you could store the end date and have to UPDATE the last record.
The first option incurs greater relative cost when you SELECT the output and a lower cost on INSERT;
The second option incurs a greater relative cost when you INSERT (and UPDATE) and a lower cost on SELECT.
So if all you are doing is INSERTing and SELECTing then you can look at which option you do more and prioritise the lower-cost option for that.
However, if the primary key is also the target of referential constraints then your costs are exponentially higher for having to modify an existing primary key are you will also have to modify all the columns referenced through the constraints and that will quickly become a nightmare.
What you should do is divorce the primary key used in referential constraints from the date columns that are going to be updated and use a surrogate primary key:
CREATE TABLE table_name (
key NUMBER
GENERATED ALWAYS AS IDENTITY
PRIMARY KEY,
id NUMBER
NOT NULL,
start_date DATE
NOT NULL,
end_date DATE
NOT NULL,
CONSTRAINT table_name__id__sd__ed__unique UNIQUE (id, start_date, end_date)
);
or:
CREATE TABLE table_name (
key NUMBER
GENERATED ALWAYS AS IDENTITY
PRIMARY KEY,
id NUMBER
NOT NULL,
start_date DATE
NOT NULL,
CONSTRAINT table_name__id__sd__ed__unique UNIQUE (id, start_date)
);
Then you do not need to modify any referential constraints and you can evaluate whether you want to have greater costs during INSERTion or SELECTion.
Whichever of those two you prioritise, that decision is a business decision and there is no "best" choice as what might work in one situation will not work for another situation.
I'd say that modifying primary key really is bad practice. Mind you, primary keys often are referenced by foreign keys; even if you tried to update such a primary key (referenced by other foreign key(s)), you won't be able to do it because Oracle will complain that - most probably - parent key doesn't exist. What will you do then? Disable foreign key constraint, update primary key value, update foreign key(s), enable foreign key constraints and hope that nobody/nothing did something that will violate referential integrity during that time (with disabled foreign keys).
So, why wouldn't you rather use a surrogate primary key? In modern Oracle database versions, use an identity column. If your Oracle database doesn't support it, use a sequence. Never modify its value, reference it from other tables and maintain referential integrity. Then, if you want, update those end dates, if you must, as it won't break anything.
I have table partitioned on a column(rcrd_expry_ts) of date type. We are updating this rcrd_expry_ts weekly by another job. We noticed the update query is taking quite longer time (1 to 1.5 min) even for few rows and I think longer time is taken for actually moving data internally to different partitioned. There can be a million of rows eligible to update rcrd_expry_ts by our weekly job.
CREATE TABLE tbl_parent
(
"parentId" NUMBER NOT NULL ENABLE,
"RCRD_DLT_TSTP" timestamp default timestamp '9999-01-01 00:00:00' NOT NULL
)
PARTITION BY RANGE ("RCRD_DLT_TSTP") INTERVAL (NUMTOYMINTERVAL('1','MONTH')) (PARTITION "P1" VALUES LESS THAN (TO_DATE('2010-01-01 00:00:00', 'YYYY-MM-DD HH24:MI:SS')));
CREATE TABLE tbl_child
(
"foreign_id" NUMBER NOT NULL ENABLE,
"id" NUMBER NOT NULL ENABLE,
constraint fk_id foreign key("foreign_id") references
tbl_parent("parentId")
)partition by reference (fk_id);
I am updating RCRD_DLT_TSTP in parent table from some another job (using simple update query) but I noticed that it took around 1 to 1.5 min to execute, probably due to creating partition and move data into corresponding partition. Is there any better way to achieve this in Oracle
The table has a referenced partitioned child. So any rows moving partition in the parent will have to be cascaded to the child table too.
This means you could be moving substantially more rows that the "few rows" that change in the parent.
It's also worth checking if the update can identify the rows it needs to change faster too.
You can do this by getting the plan for the update statement like this:
update /*+ gather_plan_statistics */ <your update statement>;
select *
from table(dbms_xplan.display_cursor( format => 'ALLSTATS LAST' ));
This will give you the plan for the update with its run time stats. This will help in identifying if there are any indexes you can create to improve performance.
Is there any better way to achieve this in Oracle
This is a question that needs to be answered in the larger context. You may well be able to make this process faster by unpartitioning the table and using indexes to identify the rows to change.
But this affects all the other statements that access this table. To what extent do they benefit from partitioning? If the answer is substantially, is it worth making this process faster at the expense of these others? What trade-offs are you willing to make here?
currently in our on-prem Hadoop environment we are using hive table with transaction properties. However as we have moving to AWS we don't have that feature yet. and so want to understand how to handle SCD Type 2 without updates.
for example.
for following record.
With Updates
In table with transaction properties enabled, when I get an update for a record, I go ahead and change the end_date to current date and create new record with effective_date as current date and end_date as 12/31/9999, as shows in above table. And so it's easier to find my active record (where end_date = "12/31/9999").
However, if I can't update the past record. I have two records with same end_date. as shows in table below.
My question are.
if I can update end_date of past record,
How do I get the historical duration of stay?
How do i get active record?
without updates
First of all, convert all dates to the 'yyyy-MM-dd' format, so they all will be sortable and analytic functions will work. Then you can use lead(effective_date, '2019-01-01') over(partition by id order by effective_date). For id=1 and effective_date = 2019-01-01 it should give you '2020-08-15' and you can assign this value as end_date for '2019-01-01' record. If there is no record with bigger effective_date, '9999-01-01' will be assigned. After this transformation Active record is that having '9999-01-01'.
Suppose dates are already converted to yyyy-MM-dd, this is how you can rewrite your table (after insert):
insert overwrite table your_table
select name, id, location, effective_date,
lead(effective_date,'2019-01-01') over(partition by id order by effective_date) as end_date
from your_table
Or without doing insert first, you can UNION ALL existing records with new records, in a subquery, then calculate lead.
Actually, SCD2 is not recommended for historical data rewriting because of non-equi join implementation in hive. It is implemented as cross-join + filter (or duplicating join on dim.id=fact.id (this will duplicate rows) + where fact.date<=dim.end_date and fact.date>=dim.effective_date - this should filter one record). This join is very expensive if the dimension and fact are big because of duplication before filtering.
Consider the following table
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL,
NAME TEXT NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR(50),
SALARY REAL
);
If we have 100 million random data in this table.
Select age from company where id=2855265
Executed in less than a millisecond
Select age from company where id<353
Return less than 50 rows and Executed in less than a millisecond
Both query uses index
But the following query use full table scan and executed in 3 seconds
Select age from company where id<2855265
Return less than 500 rows
How can I speed up the query that select primary key less than variable?
Performance
The predicate id < 2855265 potentially returns a large percentage of rows in the table. Unless Postgres has information in table statistics to expect only around 500 rows, it might switch from an index scan to a bitmap index scan or even a sequential scan. Explanation:
Postgres not using index when index scan is much better option
We would need to see the output from EXPLAIN (ANALYZE, BUFFERS) for your queries.
When you repeat the query, do you get the same performance? There may be caching effects.
Either way, 3 seconds is way to slow for 500 rows, Postgres might be working with outdated or inexact table statistics. Or there may be issues with your server configuration (not enough resources). Or there can be several other not so common reasons, including hardware issues ...
If VACUUM ANALYZE did not help, VACUUM FULL ANALYZE might. It effectively rewrites the whole table and all indexes in pristine condition. Takes an exclusive lock on the table and might conflict with concurrent access!
I would also consider increasing the statistics target for the id column. Instructions:
Keep PostgreSQL from sometimes choosing a bad query plan
Table definition?
Whatever else you do, there seem to be various problems with your table definition:
CREATE TABLE COMPANY(
ID BIGINT PRIMARY KEY NOT NULL, -- int is probably enough. "id" is a terrible column name
NAME TEXT NOT NULL, -- "name" is a teriible column name
AGE INT NOT NULL, -- typically bad idea to store age, store birthday instead
ADDRESS CHAR(50), -- never use char(n)!
SALARY REAL -- why would a company have a salary? never store money as real
);
You probably want something like this instead:
CREATE TABLE emmployee(
emploee_id serial PRIMARY KEY
company_id int NOT NULL -- REFERENCES company(company_id)?
, birthday date NOT NULL
, employee_name text NOT NULL
, address varchar(50) -- or just text
, salary int -- store amount as *Cents*
);
Related:
How to implement a many-to-many relationship in PostgreSQL?
Any downsides of using data type "text" for storing strings?
You will need to perform a VACUUM ANALYZE company; to update the planning.
if i have table1, table2, table3..table50 that stores different information about a product
what would be the efficient way to keeping track of incremental changes in a way that if i want to go back and pull how that particular product looked in a give date, it would be very fast and accurate.
i would want to track changes in a way that it can be retrieved very fast and also reduce too many redundancy.
1.If you are on Oracle 11g, Oracle Flashback technology is the feature that lets you do this.
http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28424/adfns_flashback.htm#BJFGJHCJ.
2.In older versions, you can use the DBMS_WM package and enable versioning for the tables that you need. However, there are certain restrictions on the kinds of tables you can enable versioning for.
http://download.oracle.com/docs/cd/B10501_01/appdev.920/a96628/long_ref.htm#80312
3.The other implementations I have seen so far have their own version of some procedures of DBMS_WM. Basically, have a structure like..
SQL> desc scott_emp;
Name Null? Type
----------------------------------------- -------- -------------
EMPNO NOT NULL NUMBER(4)
ENAME NOT NULL VARCHAR2(10)
JOB NOT NULL VARCHAR2(9)
MGR NUMBER(4)
HIREDATE NOT NULL DATE
SAL NOT NULL NUMBER(7,2)
COMM NUMBER(7,2)
DEPTNO NOT NULL NUMBER(2)
EFF_DATE DATE
END_DATE DATE
Where the final two columns are used to see for what time period a record was "logically active" in the Database. The implementation is done using triggers where
Each Insert/Update is Converted to
"Expire the Current Row(update)+
Insert a New Row"
Each Delete is
Converted to "Expire the Current
row"
The last approach might solve your purpose if you only want to track changes to some columns (eg. Let's say only dept and salary changes are all you care about).
Please do not choose a model like this. (Do not Store each column change as a separate row)
http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:1769392200346820632