sqlite painfully slow querying even with index - performance

I have a very trivial database with a single table:
CREATE TABLE records (
id INTEGER PRIMARY KEY AUTOINCREMENT,
symbol VARCHAR(20) NOT NULL,
time_ts INTEGER NOT NULL,
open_ts INTEGER NOT NULL,
close_ts INTEGER NOT NULL,
open_price REAL NOT NULL,
high_price REAL NOT NULL,
low_price REAL NOT NULL,
close_price REAL NOT NULL,
trades_count INTEGER NOT NULL,
volume_amount REAL NOT NULL,
quote_asset_volume REAL NOT NULL,
taker_buy_base_asset_volume REAL NOT NULL,
taker_buy_quote_asset_volume REAL NOT NULL)
and an index:
CREATE INDEX symbol_index ON records (symbol)
A size of a database is 12.63GB.
I am running this query:
SELECT
symbol,
MAX(close_ts) max_close_ts,
MIN(close_ts) min_close_ts
FROM records
GROUP BY symbol
And it takes about a minute to execute it.
As you see, an index is created on symbol column.. However, even with this - the querying is painfully slow..
Even a query like:
select count(id) from records;
Takes about 77 seconds to execute. Total number of rows in table is 115_944_904.
I expect the record count to be increased twice in the future. Is there anything I can do to make queries work faster? Even with indexes on primary key and on a symbol column I am getting quite bad performance..
Have I hit a limit of any kind?

You could create covering index to avoid accessing table:
CREATE INDEX symbol_index ON records (symbol,close_ts)
SELECT
symbol,
MAX(close_ts) max_close_ts,
MIN(close_ts) min_close_ts
FROM records
GROUP BY symbol;

Should use the "EXPLAIN QUERY PLAN"and see the scan statistics,It will also show whetehr the index created by you is using the index. Also you should create covering indices "https://www.sqlite.org/queryplanner.html#covidx" for better performance.
Example:
EXPLAIN QUERY PLAN
select count(id) from records;

As Lukasz Szozda said the index
CREATE INDEX symbol_index ON records (symbol,close_ts)
should make your query faster, because you have two aggregate statement on each group.
This way, the DMBS will skip all the intermediate rows for each different entry in the symbol column.
The advantage you will take will be proportional to the symbols: less different symbol column entry you'll have, the more will be the speed increase of the query

Related

find a best way to traverse oracle table

I have an oracle table. Table's DDL is (not have the primary key)
create table CLIENT_ACCOUNT
(
CLIENT_ID VARCHAR2(18) default ' ' not null,
ACCOUNT_ID VARCHAR2(18) default ' ' not null,
......
)
create unique index UK_ACCOUNT
on CLIENT_ACCOUNT (CLIENT_ID, ACCOUNT_ID)
Then, the data's scale is very huge, maybe 100M records. I want to traverse this whole table's data with batch.
Now, I use the table's index to batch traverse. But I have some oracle grammar problems.
# I want to use this SQL, but grammar error.
# try to use b-tree's index to locate start position, but not work
select * from CLIENT_ACCOUNT
WHERE (CLIENT_ID, ACCOUNT_ID) > (1,2)
AND ROWNUM < 1000
ORDER BY CLIENT_ID, ACCOUNT_ID
Has the fastest way to batch touch table data?
Wild guess:
select * from CLIENT_ACCOUNT
WHERE CLIENT_ID > '1'
and ACCOUNT_ID > '2'
AND ROWNUM < 1000;
It would at least compile, although whether it correctly implements your business logic is a different matter. Note that I have cast your filter criteria to strings. This is because your columns have a string datatype and you are defaulting them to spaces, so there's a high probability those columns contain non-numeric values.
If this doesn't solve your problem, please edit your question with more details; sample input data and expected output is always helpful in these situations.
Your data model seems odd.
Your columns are defined as varchar2. So why is your criteria numeric?
Also, why do you default the key columns to space? It would be better to leave unpopulated values as null. (To be clear, NULL is not a good thing in an indexed column, it's just better than a space.)

WITH Clause performance issue in Oracle 11g

Table myfirst3 have 4 columns and 1.2 million records.
Table mtl_object_genealogy has over 10 million records.
Running the below code takes very long time. How to tune this code using with options?
WITH level1 as (
SELECT mln_parent.lot_number,
mln_parent.inventory_item_id,
gen.lot_num ,--fg_lot,
gen.segment1,
gen.rcv_date.
FROM mtl_lot_numbers mln_parent,
(SELECT MOG1.parent_object_id,
p.segment1,
p.lot_num,
p.rcv_date
FROM mtl_object_genealogy MOG1 ,
myfirst3 p
START WITH MOG1.object_id = p.gen_object_id
AND (MOG1.end_date_active IS NULL OR MOG1.end_date_active > SYSDATE)
CONNECT BY nocycle PRIOR MOG1.parent_object_id = MOG1.object_id
AND (MOG1.end_date_active IS NULL OR MOG1.end_date_active > SYSDATE)
UNION all
SELECT p1.gen_object_id,
p1.segment1,
p1.lot_num,
p1.rcv_date
FROM myfirst3 p1 ) gen
WHERE mln_parent.gen_object_id = gen.parent_object_id )
select /*+ NO_CPU_COSTING */ *
from level1;
execution plan
CREATE TABLE APPS.MYFIRST3
(
TO_ORGANIZATION_ID NUMBER,
LOT_NUM VARCHAR2(80 BYTE),
ITEM_ID NUMBER,
FROM_ORGANIZATION_ID NUMBER,
GEN_OBJECT_ID NUMBER,
SEGMENT1 VARCHAR2(40 BYTE),
RCV_DATE DATE
);
CREATE TABLE INV.MTL_OBJECT_GENEALOGY
(
OBJECT_ID NUMBER NOT NULL,
OBJECT_TYPE NUMBER NOT NULL,
PARENT_OBJECT_ID NUMBER NOT NULL,
START_DATE_ACTIVE DATE NOT NULL,
END_DATE_ACTIVE DATE,
GENEALOGY_ORIGIN NUMBER,
ORIGIN_TXN_ID NUMBER,
GENEALOGY_TYPE NUMBER,
);
CREATE INDEX INV.MTL_OBJECT_GENEALOGY_N1 ON INV.MTL_OBJECT_GENEALOGY(OBJECT_ID);
CREATE INDEX INV.MTL_OBJECT_GENEALOGY_N2 ON INV.MTL_OBJECT_GENEALOGY(PARENT_OBJECT_ID);
Your explain plan shows some very big numbers. The optimizer reckons the final result set will be about 3227,000,000,000 rows. Just returning that many rows will take some time.
All table accesses are Full Table Scans. As you have big tables that will eat time too.
As for improvements, it's pretty hard to for us understand the logic of your query. This is your data model, you business rules, your data. You haven't explained anything so all we can do is guess.
Why are you using the WITH clause? You only use the level result set once, so just have a regular FROM clause.
Why are you using UNION ALL? That operation just duplicates the records retrieved from myfirst3 ( all those values are already included as rows where MOG1.object_id = p.gen_object_id.
The MERGE JOIN CARTESIAN operation is interesting. Oracle uses it to implement transitive closure. It is an expensive operation but that's because treewalking a hierarchy is an expensive thing to do. It is unfortunate for you that you are generating all the parent-child relationships for a table with 27 million records. That's bad.
The full table scans aren't the problem. There are no filters on myfirst3 so obviously the database has to get all the records. If there is one parent for each myfirst3 record that's 10% of the contents mtl_object_genealogy so a full table scan would be efficient; but you're rolling up the entire hierarchy so it's like you're looking at a much greater chunk of the table.
Your indexes are irrelevant in the face of such numbers. What might help is a composite index on mtl_object_genealogy(OBJECT_ID, PARENT_OBJECT_ID, END_DATE_ACTIVE).
You want all the levels of PARENT_OBJECT_ID for the records in myfirst3. If you run this query often and mtl_object_genealogy is a slowly changing table you should consider materializing the transitive closure into a table which just has records for all the permutations of leaf records and parents.
To sum up:
Ditch the WITH clause
Drop the UNION ALL
Tune the tree-walk with a composite index (or materializing it)

Benchmark: bigint vs int on PostgreSQL

I want to increase my database performance. In a project, all tables went from int to bigint, which I think is a bad choice not only regarding storage, since int requires 4 bytes, and bigint requires 8 bytes;but also regarding performance.
So I created a small table with 10 millions entries, with a script in Python:
import uuid
rows=10000000
output='insert_description_bigint.sql'
f = open(output, 'w')
set_schema="SET search_path = norma;\n"
f.write(set_schema)
for i in range(1,rows):
random_string=uuid.uuid4()
query="insert into description_bigint (description_id, description) values (%d, '%s'); \n"
f.write(query % (i,random_string))
And this is how I created my two tables:
-- BIGINT
DROP TABLE IF EXISTS description_bigint;
CREATE TABLE description_bigint
(
description_id BIGINT PRIMARY KEY NOT NULL,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
select count(1) from description_bigint;
select * from description_bigint;
select * from description_bigint where description_id = 9999999;
-- INT
DROP TABLE IF EXISTS description_int;
CREATE TABLE description_int
(
description_id INT PRIMARY KEY NOT NULL,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
After inserting all this data, I do a query for both tables, to measure the difference between them. And for my surprise they both have the same performance:
select * from description_bigint; -- 11m55s
select * from description_int; -- 11m55s
Am I doing something wrong with my benchmark ? Shouldn't int be faster than bigint ? Especially, when the primary key is by definition an index which means, to create an index for bigint would be slower than create an index for int, with the same amount of data, right ?
I know that is not just a small thing that will make a huge impact regarding performance on my database, but I want to ensure that we are using the best practices and focused into performance here.
In a 64-bit system the two tables are nearly identical. The column description_id in description_int covers 8 bytes (4 for integer and 4 as alignment). Try this test:
select
pg_relation_size('description_int')/10000000 as table_int,
pg_relation_size('description_bigint')/10000000 as table_bigint,
pg_relation_size('description_int_pkey')/10000000 as index_int,
pg_relation_size('description_bigint_pkey')/10000000 as index_bigint;
The average row size of both tables is virtually the same. This is because the integer column occupies 8 bytes (4 bytes for a value and 4 bytes of alignment) exactly like bigint (8 bytes for a value without a filler). The same applies to index entries. This is a special case, however. If we add one more integer column to the first table:
CREATE TABLE two_integers
(
description_id INT PRIMARY KEY NOT NULL,
one_more_int INT,
description VARCHAR(200),
constraint description_id_positive CHECK (description_id >= 0)
);
the average row size should remain the same because the first 8 bytes will be used for two integers (without filler).
Find more details in Calculating and saving space in PostgreSQL.

Accelerate SQLite Query

I'm currently learning SQLite (called by Python).
According to my previous question (Reorganising Data in SQLLIte), I want to store multiple time series (Training data) in my database.
I have defined the following fields:
CREATE TABLE VARLIST
(
VarID INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT UNIQUE NOT NULL
)
CREATE TABLE DATAPOINTS
(
DataID INTEGER PRIMARY KEY,
timeID INTEGER,
VarID INTEGER,
value REAL
)
CREATE TABLE TIMESTAMPS
(
timeID INTEGER PRIMARY KEY AUTOINCREMENT,
TRAININGS_ID INT,
TRAINING_TIME_SECONDS FLOAT
)
VARLIST has 8 entries, TIMESTAMPS 1e5 entries and DATAPOINTS around 5e6.
When I now want to extract data for a given TrainingsID and VarID, I try it like:
SELECT
(SELECT TIMESTAMPS.TRAINING_TIME_SECONDS
FROM TIMESTAMPS
WHERE t.timeID = timeID) AS TRAINING_TIME_SECONDS,
(SELECT value
FROM DATAPOINTS
WHERE DATAPOINTS.timeID = t.timeID and DATAPOINTS.VarID = 2) as value
FROM
(SELECT timeID
FROM TIMESTAMPS
WHERE TRAININGS_ID = 96) as t;
The command EXPLAIN QUERY PLAN delivers:
0|0|0|SCAN TABLE TIMESTAMPS
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SCAN TABLE DATAPOINTS
This basically works.
But there are two problems:
Minor problem: If there is a timeID where no data for the requested VarID is availabe, I get an line with the valueNone`.
I would prefer this line to be skipped.
Big problem: the search is incredibly slow (approx 5 minutes using http://sqlitebrowser.org/).
How do I best improve the performance?
Are there better ways to formulate the SELECT command, or should I modify the database structure itself?
Ok, based on the hints I have got I could extremly accelerate the search by applieng INDEXES as:
CREATE INDEX IF NOT EXISTS DP_Index on DATAPOINTS (VarID,timeID,DataID);
CREATE INDEX IF NOT EXISTS TS_Index on TIMESTAMPS(TRAININGS_ID,timeID);
The EXPLAIN QUERY PLAN output now reads as:
0|0|0|SEARCH TABLE TIMESTAMPS USING COVERING INDEX TS_Index (TRAININGS_ID=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE TIMESTAMPS USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE DATAPOINTS USING INDEX DP_Index (VarID=? AND timeID=?)
Thanks for your comments.

MySQL efficient "select id else insert" query

I have a MySQL table consisting of:
CREATE TABLE `url_list` (
`id` int(10) unsigned NOT NULL auto_increment,
`crc32` int(10) unsigned NOT NULL,
`url` varchar(512) NOT NULL,
PRIMARY KEY (`id`),
KEY `crc32` (`crc32`)
);
When inserting data into a related table I need to lookup the primary key from this table, and using the crc32 really speeds that up whilst allowing a small index. The URLs do need to be unique, but I'd like to avoid having more index than actual data.
If the value isn't present I need to insert it, but using structures such as INSERT IGNORE, or ON DUPLICATE KEY either requires me to put a unique on the huge varchar, or don't take advantage of my index.
How can I "SELECT id else INSERT", whilst preserving the lookup speed for the 80-90% of hits that are already in the table?
I would recommend ditching the id column and the crc32 because they're not necessary.
You can use an MD5() hash to provide a fixed-length, virtually unique value computed from the lengthy URL data, and then use that hash as the primary key.
CREATE TABLE `url_list` (
`url_hash` BINARY(16) NOT NULL PRIMARY KEY
`url` VARCHAR(512) NOT NULL
);
DELIM !!
CREATE TRIGGER `url_ins` BEFORE INSERT ON `url_list`
FOR EACH ROW
BEGIN
SET NEW.`url_hash` = UNHEX( MD5( NEW.`url` ) );
END!!
Then you can use INSERT..ON DUPLICATE KEY UPDATE because unlike crc32, the hash should have a very low chance of collision.
edit: See http://en.wikipedia.org/wiki/Birthday_attack. If you log 1 million distinct URL's per day for 2,000 years, the MD5 hashes of these URL's are still less likely to include a collision than your hard disk is to have an uncorrectable bit error.
This website offers a solution to a similar problem.

Resources