Why is my date dimension table useless? (Confusion over PostgreSQL storage...) - performance

I have looked over this about 4 times and am still perplexed with these results.
Take a look at the following (which I originally posted here)
Date dimension table --
-- Some output omitted
DROP TABLE IF EXISTS dim_calendar CASCADE;
CREATE TABLE dim_calendar (
id SMALLSERIAL PRIMARY KEY,
day_id DATE NOT NULL,
year SMALLINT NOT NULL, -- 2000 to 2024
month SMALLINT NOT NULL, -- 1 to 12
day SMALLINT NOT NULL, -- 1 to 31
quarter SMALLINT NOT NULL, -- 1 to 4
day_of_week SMALLINT NOT NULL, -- 0 () to 6 ()
day_of_year SMALLINT NOT NULL, -- 1 to 366
week_of_year SMALLINT NOT NULL, -- 1 to 53
CONSTRAINT con_month CHECK (month >= 1 AND month <= 31),
CONSTRAINT con_day_of_year CHECK (day_of_year >= 1 AND day_of_year <= 366), -- 366 allows for leap years
CONSTRAINT con_week_of_year CHECK (week_of_year >= 1 AND week_of_year <= 53),
UNIQUE(day_id)
);
INSERT INTO dim_calendar (day_id, year, month, day, quarter, day_of_week, day_of_year, week_of_year) (
SELECT ts,
EXTRACT(YEAR FROM ts),
EXTRACT(MONTH FROM ts),
EXTRACT(DAY FROM ts),
EXTRACT(QUARTER FROM ts),
EXTRACT(DOW FROM ts),
EXTRACT(DOY FROM ts),
EXTRACT(WEEK FROM ts)
FROM generate_series('2000-01-01'::timestamp, '2024-01-01', '1day'::interval) AS t(ts)
);
/* ==> [ INSERT 0 8767 ] */
Tables for testing --
DROP TABLE IF EXISTS just_dates CASCADE;
DROP TABLE IF EXISTS just_date_ids CASCADE;
CREATE TABLE just_dates AS
SELECT a_date AS some_date
FROM some_table;
/* ==> [ SELECT 769411 ] */
CREATE TABLE just_date_ids AS
SELECT d.id
FROM just_dates jd
INNER JOIN dim_calendar d
ON d.day_id = jd.some_date;
/* ==> [ SELECT 769411 ] */
ALTER TABLE just_date_ids ADD CONSTRAINT jdfk FOREIGN KEY (id) REFERENCES dim_calendar (id);
Confusion --
pocket=# SELECT pg_size_pretty(pg_relation_size('dim_calendar'));
pg_size_pretty
----------------
448 kB
(1 row)
pocket=# SELECT pg_size_pretty(pg_relation_size('just_dates'));
pg_size_pretty
----------------
27 MB
(1 row)
pocket=# SELECT pg_size_pretty(pg_relation_size('just_date_ids'));
pg_size_pretty
----------------
27 MB
(1 row)
Why is a table consisting of a bunch of smallints the same size as a table consisting of a bunch of dates? And I should mention that before, when dim_calendar.id was a normal SERIAL, it gave the same 27MB result.
Also, and more importantly -- WHY does a table with 769411 records with a single smallint field have a size of 27MB, which is > 32bytes/record???
P.S. Yes, I will have billions (or at a minimum hundreds of millions) of records, and am trying to add performance and space optimizations wherever possible.
EDIT
This might have something to do with it, so throwing it out there --
pocket=# select count(id) from just_date_ids group by id;
count
--------
409752
359659
(2 rows)

In tables with one or two columns, the biggest part of the size is always the Tuple Header.
Have a look here http://www.postgresql.org/docs/current/interactive/storage-page-layout.html, it explains how the data is stored. I'm quoting the part of the above page that is most relevant with your question
All table rows are structured in the same way. There is a fixed-size header (occupying 23 bytes on most machines), followed by an optional null bitmap, an optional object ID field, and the user data.
This mostly explains the question
WHY does a table with 769411 records with a single smallint field have a size of 27MB, which is > 32bytes/record???
The other part of your question has to do with the byte alignment of postgres data. Smallints are aligned in 2-byte offsets, but ints (and dates of course... date is an int4 after all) are aligned in 4 bytes offsets. So the order in which the table columns are devlared plays a significant role.
Having a table with smallint, date, smallint needs 12 bytes for user data (not counting the overhead), while declaring smallint, smallint, date only will need 8 bytes. See a great (and surprisingly not accepted) answer here Calculating and saving space in PostgreSQL

Related

Oracle -- Datatype of column which can store value "13:45"

We need to store a value "13:45" in the column "Start_Time" of an Oracle table.
Value can be read as 45 minutes past 13:00 hours
Which datatype to be used while creating the table? Also, once queried, we would like to see only the value "13:45".
I would make it easier:
create table t_time_only (
time_col varchar2(5),
time_as_interval INTERVAL DAY TO SECOND invisible
generated always as (to_dsinterval('0 '||time_col||':0')),
constraint check_time
check ( VALIDATE_CONVERSION(time_col as date,'hh24:mi')=1 )
);
Check constraint allows you to validate input strings:
SQL> insert into t_time_only values('25:00');
insert into t_time_only values('25:00')
*
ERROR at line 1:
ORA-02290: check constraint (CHECK_TIME) violated
And invisible virtual generated column allows you to make simple arithmetic operations:
SQL> insert into t_time_only values('15:30');
1 row created.
SQL> select trunc(sysdate) + time_as_interval as res from t_time_only;
RES
-------------------
2020-09-21 15:30:00
Your best option is to store the data in a DATE type column. If you are going to be any comparisons against the times (querying, sorting, etc.), you will want to make sure that all of the times are using the same day. It doesn't matter which day as long as they are all the same.
CREATE TABLE test_time
(
time_col DATE
);
INSERT INTO test_time
VALUES (TO_DATE ('13:45', 'HH24:MI'));
INSERT INTO test_time
VALUES (TO_DATE ('8:45', 'HH24:MI'));
Test Query
SELECT time_col,
TO_CHAR (time_col, 'HH24:MI') AS just_time,
24 * (time_col - LAG (time_col) OVER (ORDER BY time_col)) AS difference_in_hours
FROM test_time
ORDER BY time_col;
Test Results
TIME_COL JUST_TIME DIFFERENCE_IN_HOURS
____________ ____________ ______________________
01-SEP-20 08:45
01-SEP-20 13:45 5
Table Definition using INTERVAL
create table tab
(tm INTERVAL DAY(1) to SECOND(0));
Input value as literal
insert into tab (tm) values (INTERVAL '13:25' HOUR TO MINUTE );
Input value dynamically
insert into tab (tm) values ( (NUMTODSINTERVAL(13, 'hour') + NUMTODSINTERVAL(26, 'minute')) );
Output
you may either EXTRACT the hour and minute
EXTRACT(HOUR FROM tm) int_hour,
EXTRACT(MINUTE FROM tm) int_minute
or use formatted output with a trick by adding some fixed DATE
to_char(DATE'2000-01-01'+tm,'hh24:mi') int_format
which gives
13:25
13:26
Please see this answer for other formating options HH24:MI
The used INTERVAL definition may store seconds as well - if this is not acceptable, add CHECK CONSTRAINT e.g. as follows (adjust as requiered)
tm INTERVAL DAY(1) to SECOND(0)
constraint "wrong interval" check (tm <= INTERVAL '23:59' HOUR TO MINUTE and EXTRACT(SECOND FROM tm) = 0 )
This rejects the following as invalid input
insert into tab (tm) values (INTERVAL '13:25:30' HOUR TO SECOND );
-- ORA-02290: check constraint (X.wrong interval) violated

Sysdate+days as default value in table column - Oracle

I'm working on my table which is supposed to store data about rented cars.
And there are 3 important columns:
RENT_DATE DATE DEFAULT TO_DATE (SYSDATE, 'DD-MM-YYYY'),
DAYS NUMBER NOT NULL,
RETURN_DATE DATE DEFAULT TO_DATE(SYSDATE+DAYS, 'DD-MM-YYYY')
My problem is that RETURN_DATE column is giving me error:
00984. 00000 - "column not allowed here"
What i want is that RENT_DATE set automatically date when record is added.
DAYS column is to store for how much days someone is renting car.
And the last column should store date of when car should be returned.
Thank you for any type of help.
This doesn't make sense:
DEFAULT TO_DATE (SYSDATE, 'DD-MM-YYYY')
SYSDATE is already a date. TO_DATE requires a char, so this takes a date, Oracle implicitly turns the date into a char, and then TO_DATE converts it back to a date. This is risky/unreliable because it uses a hardcoded date format to operate on a date that has been implicitly turned to a string using the system default format, which might one day not be DD-MM-YYYY (you're building a latent bug into your software)
If you want a date without a time on it use TRUNC(SYSDATE)
The other problem doesn't make sense either: you're storing a number of days rented for and also the return date, when one is a function of the other. Storing redundant data becomes a headache because you have to keep them in sync. My person class stores my birthdate, and I calculate how old I am. I don't store my age too and then update my table every day/year etc
Work out which will be more beneficial to you to store, and store it, then calculate the other whenever you want it. Personally I would store the return date as it's absolute, rather than open to interpretation of "is that working days, calendar days? what about public holidays? if the start date is jan 1 and the rental is for 10 days, is the car brought back on the 10th or the 11th?"
If you're desperate to have both columns in your DB consider using a view to calculate it or a function based column (again, to calculate one from the other) so they stay in sync
All in, you could look at this:
create table X(
RENT_DATE DATE DEFAULT TRUNC(SYSDATE) NOT NULL,
RETURN_DATE DATE NOT NULL,
DAYS AS (TRUNC(RETURN_DATE - RENT_DATE) + 1)
)
I put the days as +1 because to me, a car taken on the 1st and returned on the second is 2 days, but you might want to get more accurate - if it's taken on the first and returned before 10am on the second then it's one day otherwise it's 2 etc...
Use a virtual column:
CREATE TABLE table_name (
RENT_DATE DATE
DEFAULT TRUNC( SYSDATE )
CONSTRAINT table_name__rent_date__nn NOT NULL
CONSTRAINT table_name__rent_date_chk CHECK ( rent_date = TRUNC( rent_date ) ),
DAYS NUMBER
DEFAULT 7
CONSTRAINT table_name__days__nn NOT NULL,
RETURN_DATE DATE
GENERATED ALWAYS AS ( RENT_DATE + DAYS ) VIRTUAL
);
Then you can insert values:
INSERT INTO table_name ( rent_date, days ) VALUES ( DEFAULT, DEFAULT );
INSERT INTO table_name ( rent_date, days ) VALUES ( DATE '2020-01-01', 1 );
And:
SELECT * FROM table_name;
Outputs:
RENT_DATE | DAYS | RETURN_DATE
:------------------ | ---: | :------------------
2020-09-12T00:00:00 | 7 | 2020-09-19T00:00:00
2020-01-01T00:00:00 | 1 | 2020-01-02T00:00:00
db<>fiddle here

plsql - can anyone help me out how do we reset a sequence in oracle 11 g database at two particular times in one single day?

Let me explain you the scenario of hospital management tool.
In every hospital, we have n no. of doctors, n no. of admins, n no. of security, etc departments respectively. Each and every hospital we have an out-patient consultations in the morning time approximately around 8:00 am to 10:00 am, from 10:00 am to evening 5:00 pm doctors will undertake operations and treatments for In-patients in "Intensive care unit" (ICU). So now after 5:00 pm again the doctors will have an out-patients consultation in the hospital from 18:00 pm to 20:00 pm.
Now, let me explain the same in technical terminology to you.
When the out-patients come ask a token number of so and so doctor. The admin will then select particular department in the UI and select particular doctor as per patient's problem. For this, i'm maintaining a each doctor's tables in the database which doctor name itself..
example :
1)Neurology Department
i) Dr. Sarath Kumar
ii) Dr. anil kumar
2)Cardiology Department
i) Dr. Madhu
ii) Dr. Anji Reddy
3)Orthopedics Department
i) Dr. Murali
ii) Dr. Sirisha
etc...
Creation of a doctor table :
create table sarath_Kumar(token_no not null primary key,
patient_name char(50) not null ,
patient_age number(3) not null ,
patient_phonenumber number(12) not null unique ,
patient_email varchar2(50) not null unique,
patient_gender char(1) not null,
patient_location varchar2(50) not null,
patient_dateofappointment date not null,
CONSTRAINT sk_token_no CHECK (token_no<=20);
Note:
if we think generally admin doesn't know which token number is going on for each and every doctor.
As we have the same table structure for each and every doctor by their name. But now the thing is the first column in each doctor table has to generate automatically from 1, to do this i created a sequence and a procedure to execute the sequence before an insertion happens by the admin.
let's take morning session of out-patients consultation from 8:00 am to 10:00 am. Each doctor will only have a 20 patients for consultation.
Sequence Creation :
create sequence appointment_sequence
start with 1
increment by 1
minvalue 1
maxvalue 20
cache 5
;
Procedure Creation :
create or replace trigger appointment_sequence before insert on sarath_kumar for each row
begin
:new.token_no := appointment_sequence.NEXTVAL;
end;
/
what i need from you is :
After reaching 20 patients for any doctor during consultation i.e., the token number reached it's maximum level between 8:00 am to 10:00 am. If any person asks for a appointment for that particular doctor. The admin shouldn't able to provide any kind of appointment for that doctor and insist the patient to come in evening time consultation which is from 18:00 pm to 20:00pm .
I need a procedure or function in which the doctor table should get truncated and the sequence should get reset back to minvalue at 10:00 am and in the evening after 20:00 pm respectively.
First of all, You should have the patient_appoint table instead of a separate table with the doctor's name and just pass the doctor's ID in that table.
create table patient_appoint(token_no not null primary key,
doctor_id number not null,
patient_name char(50) not null ,
patient_age number(3) not null ,
patient_phonenumber number(12) not null unique ,
patient_email varchar2(50) not null unique,
patient_gender char(1) not null,
patient_location varchar2(50) not null,
patient_dateofappointment date not null,
CONSTRAINT sk_token_no CHECK (token_no<=20);
For resetting the sequence to 1, Use CYCLE property of sequence. Use below code to generate the sequence -
create sequence appointment_sequence
start with 1
increment by 1
minvalue 1
maxvalue 20
cycle
cache 5
;
For restricting to only 20 patients per day, you may use below trigger -
CREATE OR REPLACE TRIGGER TR_PATIENT_APPOINT
AFTER INSERT ON PATIENT_APPOINT
DECLARE
v_count NUMBER;
BEGIN
SELECT COUNT(*)
INTO v_count
FROM PATIENT_APPOINT
WHERE TRUNC(patient_dateofappointment) = TRUNC(SYSDATE);
IF (v_count > 20) THEN
raise_application_error(-20000, 'Maximum 20 appointments allowed per day.');
END IF;
END TR_PATIENT_APPOINT;
As others have pointed out or at lest hinted at this will be a maintenance nightmare, with each doctor having their own table and their own sequence. Consider what happens when a patient cancels. You don't get that sequence value back, so that doctor can only see 19 patients. And that is an easy situation to handle. There is an easier way: don't use sequences.
If you break it down each patent is essentially given a 6min time slot (120min/20slots). With his you generate a skeleton schedule for each doctor that does not have patient information initially. Admins then fill in patient information when needed, and can actually view the available time for each doctor. The following shows how to generate such a schedule. (Note it assumes you have normalized you doctor table (1 table containing all doctors) and created a patient table (1 table containing all patients).
--- prior Setup
create table doctors(doc_id integer, name varchar2(50), ..., constraint doc_pk primary key (doc_id));
create table patients(pat_id integer, name varchar2(50), ..., constraint pat_pk primary key (pat_id));
--- Daily Out patient Schedule.
create table out_patient_schedule (
ops_id integer
, doc_id integer not null
, pat_id integer
, apt_schedule date
, constraint ops_pk primary key (ops_id)
, constraint ops2doc_fk foreign key (doc_id) references doctors(doc_id)
, constraint ops2pat_fk foreign key (pat_id) references patients(pat_id)
);
--- Generate skeleton schedule
create or replace procedure gen_outpatient_skeleton_schedule
as
begin
insert into out_patient_schedule( doc_id, apt_schedule)
with apt_times as
( select trunc(sysdate, 'day') + 8/24 + (120/20)*(level-1)/(60*24) apt_time from dual connect by level <= 20
union all
select trunc(sysdate, 'day') + 18/24 + (120/20)*(level-1)/(60*24) from dual connect by level <= 20
)
select doc_id, apt_time from doctors, apt_times;
end gen_outpatient_skeleton_schedule;
Now create an Oracle Job, or an entry for what ever job schedule you have, that executes the above procedure between midnight and 8:00.
There is a race condition you need to handle, but doing so would be much easier that trying it with sequences.
Good Luck either way.

Benchmark: bigint vs int on PostgreSQL

I want to increase my database performance. In a project, all tables went from int to bigint, which I think is a bad choice not only regarding storage, since int requires 4 bytes, and bigint requires 8 bytes;but also regarding performance.
So I created a small table with 10 millions entries, with a script in Python:
import uuid
rows=10000000
output='insert_description_bigint.sql'
f = open(output, 'w')
set_schema="SET search_path = norma;\n"
f.write(set_schema)
for i in range(1,rows):
random_string=uuid.uuid4()
query="insert into description_bigint (description_id, description) values (%d, '%s'); \n"
f.write(query % (i,random_string))
And this is how I created my two tables:
-- BIGINT
DROP TABLE IF EXISTS description_bigint;
CREATE TABLE description_bigint
(
description_id BIGINT PRIMARY KEY NOT NULL,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
select count(1) from description_bigint;
select * from description_bigint;
select * from description_bigint where description_id = 9999999;
-- INT
DROP TABLE IF EXISTS description_int;
CREATE TABLE description_int
(
description_id INT PRIMARY KEY NOT NULL,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
After inserting all this data, I do a query for both tables, to measure the difference between them. And for my surprise they both have the same performance:
select * from description_bigint; -- 11m55s
select * from description_int; -- 11m55s
Am I doing something wrong with my benchmark ? Shouldn't int be faster than bigint ? Especially, when the primary key is by definition an index which means, to create an index for bigint would be slower than create an index for int, with the same amount of data, right ?
I know that is not just a small thing that will make a huge impact regarding performance on my database, but I want to ensure that we are using the best practices and focused into performance here.
In a 64-bit system the two tables are nearly identical. The column description_id in description_int covers 8 bytes (4 for integer and 4 as alignment). Try this test:
select
pg_relation_size('description_int')/10000000 as table_int,
pg_relation_size('description_bigint')/10000000 as table_bigint,
pg_relation_size('description_int_pkey')/10000000 as index_int,
pg_relation_size('description_bigint_pkey')/10000000 as index_bigint;
The average row size of both tables is virtually the same. This is because the integer column occupies 8 bytes (4 bytes for a value and 4 bytes of alignment) exactly like bigint (8 bytes for a value without a filler). The same applies to index entries. This is a special case, however. If we add one more integer column to the first table:
CREATE TABLE two_integers
(
description_id INT PRIMARY KEY NOT NULL,
one_more_int INT,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
the average row size should remain the same because the first 8 bytes will be used for two integers (without filler).
Find more details in Calculating and saving space in PostgreSQL.

ON DELETE CACADE is very slow

I am using Postgres 8.4. My system configuration is window 7 32 bit 4 gb ram and 2.5ghz.
I have a database in Postgres with 10 tables t1, t2, t3, t4, t5.....t10.
t1 has a primary key a sequence id which is a foreign key reference to all other tables.
The data is inserted in database (i.e. in all tables) apart from t1 all other tables have nearly 50,000 rows of data but t1 has one 1 row whose primary key is referenced from all other tables. Then I insert the 2nd row of data in t1 and again 50,000 rows with this new reference in other tables.
The issue is when I want to delete all the data entries that are present in other tables:
delete from t1 where column1='1'
This query takes nearly 10 min to execute.
I created indexes also and tried but the performance is not at all improving.
what can be done?
I have mentioned a sample schema below
CREATE TABLE t1
(
c1 numeric(9,0) NOT NULL,
c2 character varying(256) NOT NULL,
c3ver numeric(4,0) NOT NULL,
dmlastupdatedate timestamp with time zone NOT NULL,
CONSTRAINT t1_pkey PRIMARY KEY (c1),
CONSTRAINT t1_c1_c2_key UNIQUE (c2)
);
CREATE TABLE t2
(
c1 character varying(100),
c2 character varying(100),
c3 numeric(9,0) NOT NULL,
c4 numeric(9,0) NOT NULL,
tver numeric(4,0) NOT NULL,
dmlastupdatedate timestamp with time zone NOT NULL,
CONSTRAINT t2_pkey PRIMARY KEY (c3),
CONSTRAINT t2_fk FOREIGN KEY (c4)
REFERENCES t1 (c1) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT t2_c3_c4_key UNIQUE (c3, c4)
);
CREATE INDEX t2_index ON t2 USING btree (c4);
Let me know if there is anything wrong with the schema.
With bigger tables and more than just two or three values, you need an index on the referenced column (t1.c1) as well as the referencing columns (t2.c4, ...).
But if your description is accurate, that can not be the cause of the performance problem in your scenario. Since you have only 2 distinct values in t1, there is just no use for an index. A sequential scan will be faster.
Anyway, I re-enacted what you describe in Postgres 9.1.9
CREATE TABLE t1
( c1 numeric(9,0) PRIMARY KEY,
c2 character varying(256) NOT NULL,
c3ver numeric(4,0) NOT NULL,
dmlastupdatedate timestamptz NOT NULL,
CONSTRAINT t1_uni_key UNIQUE (c2)
);
CREATE temp TABLE t2
( c1 character varying(100),
c2 character varying(100),
c3 numeric(9,0) PRIMARY KEY,
c4 numeric(9,0) NOT NULL,
tver numeric(4,0) NOT NULL,
dmlastupdatedate timestamptz NOT NULL,
CONSTRAINT t2_uni_key UNIQUE (c3, c4),
CONSTRAINT t2_c4_fk FOREIGN KEY (c4)
REFERENCES t1(c1) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
);
INSERT INTO t1 VALUES
(1,'OZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf', 234, now())
,(2,'agdsOZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf', 4564, now());
INSERT INTO t2
SELECT'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
,'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
, g, 2, 456, now()
from generate_series (1,50000) g
INSERT INTO t2
SELECT'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
,'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
, g, 2, 789, now()
from generate_series (50001, 100000) g
ANALYZE t1;
ANALYZE t2;
EXPLAIN ANALYZE DELETE FROM t1 WHERE c1 = 1;
Total runtime: 53.745 ms
DELETE FROM t1 WHERE c1 = 1;
58 ms execution time.
Ergo, there is nothing fundamentally wrong with your schema layout.
Minor enhancements:
You have a couple of columns defined numeric(9,0) or numeric(4,0). Unless you have a good reason to do that, you are probably a lot better off using just integer. They are smaller and faster overall. You can always add a check constraint if you really need to enforce a maximum.
I also would use text instead of varchar(n)
And reorder columns (at table creation time). As a rule of thumb, place fixed length NOT NULL columns first. Put timestamp and integer first and numeric or text last. More here..

Resources