Benchmark: bigint vs int on PostgreSQL - performance

I want to increase my database performance. In a project, all tables went from int to bigint, which I think is a bad choice not only regarding storage, since int requires 4 bytes, and bigint requires 8 bytes;but also regarding performance.
So I created a small table with 10 millions entries, with a script in Python:
import uuid
rows=10000000
output='insert_description_bigint.sql'
f = open(output, 'w')
set_schema="SET search_path = norma;\n"
f.write(set_schema)
for i in range(1,rows):
random_string=uuid.uuid4()
query="insert into description_bigint (description_id, description) values (%d, '%s'); \n"
f.write(query % (i,random_string))
And this is how I created my two tables:
-- BIGINT
DROP TABLE IF EXISTS description_bigint;
CREATE TABLE description_bigint
(
description_id BIGINT PRIMARY KEY NOT NULL,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
select count(1) from description_bigint;
select * from description_bigint;
select * from description_bigint where description_id = 9999999;
-- INT
DROP TABLE IF EXISTS description_int;
CREATE TABLE description_int
(
description_id INT PRIMARY KEY NOT NULL,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
After inserting all this data, I do a query for both tables, to measure the difference between them. And for my surprise they both have the same performance:
select * from description_bigint; -- 11m55s
select * from description_int; -- 11m55s
Am I doing something wrong with my benchmark ? Shouldn't int be faster than bigint ? Especially, when the primary key is by definition an index which means, to create an index for bigint would be slower than create an index for int, with the same amount of data, right ?
I know that is not just a small thing that will make a huge impact regarding performance on my database, but I want to ensure that we are using the best practices and focused into performance here.

In a 64-bit system the two tables are nearly identical. The column description_id in description_int covers 8 bytes (4 for integer and 4 as alignment). Try this test:
select
pg_relation_size('description_int')/10000000 as table_int,
pg_relation_size('description_bigint')/10000000 as table_bigint,
pg_relation_size('description_int_pkey')/10000000 as index_int,
pg_relation_size('description_bigint_pkey')/10000000 as index_bigint;
The average row size of both tables is virtually the same. This is because the integer column occupies 8 bytes (4 bytes for a value and 4 bytes of alignment) exactly like bigint (8 bytes for a value without a filler). The same applies to index entries. This is a special case, however. If we add one more integer column to the first table:
CREATE TABLE two_integers
(
description_id INT PRIMARY KEY NOT NULL,
one_more_int INT,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
the average row size should remain the same because the first 8 bytes will be used for two integers (without filler).
Find more details in Calculating and saving space in PostgreSQL.

Related

sqlite painfully slow querying even with index

I have a very trivial database with a single table:
CREATE TABLE records (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol VARCHAR(20) NOT NULL,
time_ts INTEGER NOT NULL,
open_ts INTEGER NOT NULL,
close_ts INTEGER NOT NULL,
open_price REAL NOT NULL,
high_price REAL NOT NULL,
low_price REAL NOT NULL,
close_price REAL NOT NULL,
trades_count INTEGER NOT NULL,
volume_amount REAL NOT NULL,
quote_asset_volume REAL NOT NULL,
taker_buy_base_asset_volume REAL NOT NULL,
taker_buy_quote_asset_volume REAL NOT NULL)
and an index:
CREATE INDEX symbol_index ON records (symbol)
A size of a database is 12.63GB.
I am running this query:
SELECT
symbol,
MAX(close_ts) max_close_ts,
MIN(close_ts) min_close_ts
FROM records
GROUP BY symbol
And it takes about a minute to execute it.
As you see, an index is created on symbol column.. However, even with this - the querying is painfully slow..
Even a query like:
select count(id) from records;
Takes about 77 seconds to execute. Total number of rows in table is 115_944_904.
I expect the record count to be increased twice in the future. Is there anything I can do to make queries work faster? Even with indexes on primary key and on a symbol column I am getting quite bad performance..
Have I hit a limit of any kind?
You could create covering index to avoid accessing table:
CREATE INDEX symbol_index ON records (symbol,close_ts)
SELECT
symbol,
MAX(close_ts) max_close_ts,
MIN(close_ts) min_close_ts
FROM records
GROUP BY symbol;
Should use the "EXPLAIN QUERY PLAN"and see the scan statistics,It will also show whetehr the index created by you is using the index. Also you should create covering indices "https://www.sqlite.org/queryplanner.html#covidx" for better performance.
Example:
EXPLAIN QUERY PLAN
select count(id) from records;
As Lukasz Szozda said the index
CREATE INDEX symbol_index ON records (symbol,close_ts)
should make your query faster, because you have two aggregate statement on each group.
This way, the DMBS will skip all the intermediate rows for each different entry in the symbol column.
The advantage you will take will be proportional to the symbols: less different symbol column entry you'll have, the more will be the speed increase of the query

Insert huge data into Oracle DB

My setup - Oracle DB 12.1C, Spring application with Hibernate.
table:
create table war
(
id int generated by default as identity not null constraint wars_pkey primary key,
t1_id int references t1 (id) on delete cascade not null,
t2_id int references t2 (id) on delete cascade not null,
day timestamp not null,
diff int not null
);
I would like to insert 10 000 records into table. With repository.saveAll(<data>) it takes 70s, with using JpaTemplate.batchUpdate(<insert statements>) it takes 68s. When I create new temporary table without constraints it takes 65s.
What is the best/fastest way, how to insert this amount of records into Oracle DB? Unfortunately CSV is not an option.
My solution was to redesign our model - we used int_array for storing diff -> this was almost 10 times faster than first solution.

Accelerate SQLite Query

I'm currently learning SQLite (called by Python).
According to my previous question (Reorganising Data in SQLLIte), I want to store multiple time series (Training data) in my database.
I have defined the following fields:
CREATE TABLE VARLIST
(
VarID INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
)
CREATE TABLE DATAPOINTS
(
DataID INTEGER PRIMARY KEY,
timeID INTEGER,
VarID INTEGER,
value REAL
)
CREATE TABLE TIMESTAMPS
(
timeID INTEGER PRIMARY KEY AUTOINCREMENT,
TRAININGS_ID INT,
TRAINING_TIME_SECONDS FLOAT
)
VARLIST has 8 entries, TIMESTAMPS 1e5 entries and DATAPOINTS around 5e6.
When I now want to extract data for a given TrainingsID and VarID, I try it like:
SELECT
(SELECT TIMESTAMPS.TRAINING_TIME_SECONDS
FROM TIMESTAMPS
WHERE t.timeID = timeID) AS TRAINING_TIME_SECONDS,
(SELECT value
FROM DATAPOINTS
WHERE DATAPOINTS.timeID = t.timeID and DATAPOINTS.VarID = 2) as value
FROM
(SELECT timeID
FROM TIMESTAMPS
WHERE TRAININGS_ID = 96) as t;
The command EXPLAIN QUERY PLAN delivers:
0|0|0|SCAN TABLE TIMESTAMPS
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SCAN TABLE DATAPOINTS
This basically works.
But there are two problems:
Minor problem: If there is a timeID where no data for the requested VarID is availabe, I get an line with the valueNone`.
I would prefer this line to be skipped.
Big problem: the search is incredibly slow (approx 5 minutes using http://sqlitebrowser.org/).
How do I best improve the performance?
Are there better ways to formulate the SELECT command, or should I modify the database structure itself?
Ok, based on the hints I have got I could extremly accelerate the search by applieng INDEXES as:
CREATE INDEX IF NOT EXISTS DP_Index on DATAPOINTS (VarID,timeID,DataID);
CREATE INDEX IF NOT EXISTS TS_Index on TIMESTAMPS(TRAININGS_ID,timeID);
The EXPLAIN QUERY PLAN output now reads as:
0|0|0|SEARCH TABLE TIMESTAMPS USING COVERING INDEX TS_Index (TRAININGS_ID=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE DATAPOINTS USING INDEX DP_Index (VarID=? AND timeID=?)
Thanks for your comments.

Oracle: constraint check number range collision

is there any way to check empty intersection of number range by constraint? Example:
CREATE TABLE "AGE_CATEGORIES" (
"AGE_CATEGORY_ID" CHAR(2 BYTE) NOT NULL PRIMARY KEY,
"NAME" NVARCHAR2(32) NOT NULL,
"RANGE_FROM" NUMBER(*,0) NOT NULL,
"RANGE_TO" NUMBER(*,0) NOT NULL,
CONSTRAINT "UK_AGE_CATEGORIES_NAME" UNIQUE ("NAME"),
CONSTRAINT "CHK_AGE_CATEGORIES_RANGE_COLLISION" CHECK (
???
) ENABLE
);
Question marks in the code above means something like:
(SELECT COUNT("AGE_CATEGORY_ID")
FROM "AGE_CATEGORIES" AC
WHERE "RANGE_FROM" < AC."RANGE_TO"
AND "RANGE_TO" > AC."RANGE_FROM") = 0
So I need to check if new age category has no intersection with any other interval stored in this table. Is it possible?
It can be done, but involves creating materialized views with constraints - see my blog post. However this approach would need to be carefully considered as it could be a performance hit. In reality this sort of logic is not checked via constraints, only via procedural code in APIs or triggers.

Very slow update on a relatively small table in PostgreSQL

Well i have the following table(info from pgAdmin):
CREATE TABLE comments_lemms
(
comment_id integer,
freq integer,
lemm_id integer,
bm25 real
)
WITH (
OIDS=FALSE
);
ALTER TABLE comments_lemms OWNER TO postgres;
-- Index: comments_lemms_comment_id_idx
-- DROP INDEX comments_lemms_comment_id_idx;
CREATE INDEX comments_lemms_comment_id_idx
ON comments_lemms
USING btree
(comment_id);
-- Index: comments_lemms_lemm_id_idx
-- DROP INDEX comments_lemms_lemm_id_idx;
CREATE INDEX comments_lemms_lemm_id_idx
ON comments_lemms
USING btree
(lemm_id);
And one more table:
CREATE TABLE comments
(
id serial NOT NULL,
nid integer,
userid integer,
timest timestamp without time zone,
lemm_length integer,
CONSTRAINT comments_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE comments OWNER TO postgres;
-- Index: comments_id_idx
-- DROP INDEX comments_id_idx;
CREATE INDEX comments_id_idx
ON comments
USING btree
(id);
-- Index: comments_nid_idx
-- DROP INDEX comments_nid_idx;
CREATE INDEX comments_nid_idx
ON comments
USING btree
(nid);
in comments_lemms there are 8 million entries, in comments - 270 thousands.
Im performing the following sql query:
update comments_lemms set bm25=(select lemm_length from comments where id=comment_id limit 1)
And it takes more than 20 minutes of running and i stop it because pgAdmin looks like its about to crash.
Is there any way to modify this query or indexes or whatever in my database to speed up things a bit? I have to run some similar queries in future and it's quite painful to wait more than 30 minutes for each one.
in comments_lemms there are 8 million entries, in comments - 270 thousands. Im performing the following sql query:
update comments_lemms set bm25=(select lemm_length from comments where id=comment_id limit 1)
In other words, you're making it go through 8M entries, and for each row you're doing a nested loop with an index loopup. PG won't rewrite/optimize it because of the limit 1 instruction.
Try this instead:
update comments_lemms set bm25 = comments.lemm_length
from comments
where comments.id = comments_lemms.comment_id;
It should do two seq scans and hash or merge join them together, then proceed with the update in one go.

Resources