Very slow multi-table join in sqlite - performance

It's not clear to me why this is such a slow query:
SELECT count(*) FROM PanelsMeta
INNER JOIN Publishers ON PanelsMeta.publisherid = Publishers.id
INNER JOIN Geographies ON Geographies.geo = Publishers.geo;
Using the query analyzer, I see the queries are indexed:
QUERY PLAN
|--SCAN TABLE PanelsMeta USING COVERING INDEX PanPubId
|--SEARCH TABLE Publishers USING INTEGER PRIMARY KEY (rowid=?)
`--SEARCH TABLE Geographies USING COVERING INDEX geos (geo=?)
The tables are of the following sizes:
sqlite> select count(*) from Publishers;
55
sqlite> select count(*) from PanelsMeta;
2948875
sqlite> select count(*) from Geographies;
37323
What am I doing wrong?
Variations I attempt produce the same query plan and are also tens of minutes slow:
SELECT count(*) FROM Geographies
LEFT JOIN Publishers ON Publishers.geo = Geographies.geo
LEFT JOIN PanelsMeta ON PanelsMeta.publisherid = Publishers.id;
# QUERY PLAN
# |--SCAN TABLE Geographies USING COVERING INDEX geos
# |--SEARCH TABLE Publishers USING COVERING INDEX PubGeo (geo=?)
# `--SEARCH TABLE PanelsMeta USING COVERING INDEX PanPubId (publisherid=?)
SELECT count(*) FROM Publishers
LEFT JOIN PanelsMeta ON PanelsMeta.publisherid = Publishers.id
LEFT JOIN Geographies ON Geographies.geo = Publishers.geo;
# QUERY PLAN
# |--SCAN TABLE Publishers USING COVERING INDEX PubGeo
# |--SEARCH TABLE PanelsMeta USING COVERING INDEX PanPubId (publisherid=?)
# `--SEARCH TABLE Geographies USING COVERING INDEX geos (geo=?)
Update
Schema information is below:
CREATE TABLE PanelsMeta(
id INTEGER PRIMARY KEY AUTOINCREMENT,
f1 TEXT,
f2 TEXT,
f3 TEXT,
f4 DATETIME,
f5 DATETIME,
f6 TEXT,
f7 TEXT,
publisherid INTEGER,
FOREIGN KEY(publisherid) REFERENCES Publishers(id) ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE INDEX ids ON PanelsMeta (id);
CREATE INDEX pp1 ON PanelsMeta (publisherid);
CREATE INDEX pp2 ON PanelsMeta (f1);
CREATE INDEX pp3 ON PanelsMeta (f1,publisherid);
and
CREATE TABLE Publishers(
id INTEGER PRIMARY KEY AUTOINCREMENT,
geo TEXT,
f3 TEXT NOT NULL,
f4 TEXT NOT NULL,
f5 TEXT,
f6 TEXT
);
CREATE INDEX zf3 ON Publishers (f3);
CREATE INDEX zgeo ON Publishers (Geo);
CREATE INDEX zf6 ON Publishers (f6);
CREATE INDEX zid ON Publishers (id);
CREATE INDEX zf3g ON Publishers (f3,geo);
CREATE INDEX zf3gf6 ON Publishers (f3,geo,f6);
and
CREATE TABLE Geographies(
id INTEGER PRIMARY KEY AUTOINCREMENT,
geo TEXT NOT NULL,
f3 TEXT NOT NULL,
f4 TEXT,
f5 DATETIME,
f6 TEXT,
f7 TEXT,
f7 JSON DEFAULT '{}',
f8 TEXT
);
CREATE INDEX g ON Geographies (geo);
CREATE INDEX gf3 ON Geographies (f3);

I had the same problem when I tried to INNER JOIN 6 tables with each (1 - 100) rows in it. Each table had only one column.
However my full dataset is 18 GB and about 11 million rows
I solved the problem by putting all data in one table and then using the 'where in' statement. It's strange but it's way faster (about 1 second instead of couple minutes)

Related

Find column names with unique constraints applied, for a table in Vertica database

In a Vertica database, I want to know the columns of a certain table, where the constraint "Unique" is applied.
Example:
CREATE TABLE dim1 ( c1 INTEGER,
c2 INTEGER,
c3 INTEGER,
UNIQUE (c1, c2)
);
I want to run a query where I enter the name of the table "dim1" and the result would be "c1,c2"
For more info regarding unique( Last line in the link) https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/AdministratorsGuide/Constraints/UniqueConstraints.htm
That's pretty easy to do by querying the system catalog, specifically V_CATALOG.CONSTRAINT_COLUMNS:
select column_name from V_CATALOG.CONSTRAINT_COLUMNS
where table_name = 'dim1' and constraint_type = 'u'

Expensive subquery tuning with SQLite

I'm working on a small media/file management utility using sqlite for it's persistent storage needs. I have a table of files:
CREATE TABLE file
( file_id INTEGER PRIMARY KEY AUTOINCREMENT
, file_sha1 BINARY(20)
, file_name TEXT NOT NULL UNIQUE
, file_size INTEGER NOT NULL
, file_mime TEXT NOT NULL
, file_add_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
);
And also a table of albums
CREATE TABLE album
( album_id INTEGER PRIMARY KEY AUTOINCREMENT
, album_name TEXT
, album_poster INTEGER
, album_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL
, FOREIGN KEY (album_poster) REFERENCES file(file_id)
);
to which files can be assigned
CREATE TABLE album_file
( album_id INTEGER NOT NULL
, file_id INTEGER NOT NULL
, PRIMARY KEY (album_id, file_id)
, FOREIGN KEY (album_id) REFERENCES album(album_id)
, FOREIGN KEY (file_id) REFERENCES file(file_id)
);
CREATE INDEX file_to_album ON album_file(file_id, album_id);
Part of the functionality is to list albums, exposing
the album id,
the album's name,
an poster image for that album and
the number of files in the album
which currently uses this query:
SELECT a.album_id, a.album_name,
COALESCE(
a.album_poster,
(SELECT file_id FROM file
NATURAL JOIN album_file af
WHERE af.album_id = a.album_id
ORDER BY file.file_name LIMIT 1)),
(SELECT COUNT(file_id) AS file_count
FROM album_file WHERE album_id = a.album_id)
FROM album a
ORDER BY album_name ASC
The only "tricky" part of that query is that the album_poster column may be null, in which case COALESCE statement is used to just return the first file in the album as the "default poster".
With currently ~260000 files, ~2600 albums and ~250000 entries in the album_file table, this query takes over 10 seconds which makes for a not-so-great user experience. Here's the query plan:
0|0|0|SCAN TABLE album AS a
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|1|SEARCH TABLE album_file AS af USING COVERING INDEX album_to_file (album_id=?)
1|1|0|SEARCH TABLE file USING INTEGER PRIMARY KEY (rowid=?)
1|0|0|USE TEMP B-TREE FOR ORDER BY
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 2
2|0|0|SEARCH TABLE album_file USING COVERING INDEX album_to_file (album_id=?)
Replacing the COALESCE statement with just a.album_poster, sacrificing the auto-poster functionality, brings the query time down to a few milliseconds:
0|0|0|SCAN TABLE album AS a
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SEARCH TABLE album_file USING COVERING INDEX album_to_file (album_id=?)
0|0|0|USE TEMP B-TREE FOR ORDER BY
What I don't understand is that limiting the album listing to 1 or 1000 rows makes no difference. It seems SQLite is doing the expensive sub-query for the "default" poster on all albums, only to throw away most of the results when finally cutting down the result set to the LIMITs specified with the query.
Is there something I can do to make the original query substantially faster, especially given that I'm usually only querying a small subset (using LIMIT) of all rows for display?

Why does SQLite not use an index for queries on my many-to-many relation table?

It's been a while since I've written code, and I never used SQLite before, but many-to-many relationships used to be so fundamental, there must be a way to make them fast...
This is a abstracted version of my database:
CREATE TABLE a (_id INTEGER PRIMARY KEY, a1 TEXT NOT NULL);
CREATE TABLE b (_id INTEGER PRIMARY KEY, fk INTEGER NOT NULL REFERENCES a(_id));
CREATE TABLE d (_id INTEGER PRIMARY KEY, d1 TEXT NOT NULL);
CREATE TABLE c (_id INTEGER PRIMARY KEY, fk INTEGER NOT NULL REFERENCES d(_id));
CREATE TABLE b2c (fk_b NOT NULL REFERENCES b(_id), fk_c NOT NULL REFERENCES c(_id), CONSTRAINT PK_b2c_desc PRIMARY KEY (fk_b, fk_c DESC), CONSTRAINT PK_b2c_asc UNIQUE (fk_b, fk_c ASC));
CREATE INDEX a_a1 on a(a1);
CREATE INDEX a_id_and_a1 on a(_id, a1);
CREATE INDEX b_fk on b(fk);
CREATE INDEX b_id_and_fk on b(_id, fk);
CREATE INDEX c_id_and_fk on c(_id, fk);
CREATE INDEX c_fk on c(fk);
CREATE INDEX d_id_and_d1 on d(_id, d1);
CREATE INDEX d_d1 on d(d1);
I have put in any index i could think of, just to make sure (and more than is reasonable, but not a problem, since the data is read only). And yet on this query
SELECT count(*)
FROM a, b, b2c, c, d
WHERE a.a1 = "A"
AND a._id = b.fk
AND b._id = b2c.fk_b
AND c._id = b2c.fk_c
AND d._id = c.fk
AND d.d1 ="D";
the relation table b2c does not use any indexes:
0|0|2|SCAN TABLE b2c
0|1|1|SEARCH TABLE b USING INTEGER PRIMARY KEY (rowid=?)
0|2|0|SEARCH TABLE a USING INTEGER PRIMARY KEY (rowid=?)
0|3|3|SEARCH TABLE c USING INTEGER PRIMARY KEY (rowid=?)
0|4|4|SEARCH TABLE d USING INTEGER PRIMARY KEY (rowid=?)
The query is about two orders of magnitude to slow to be usable. Is there any way to make SQLite use an index on b2c?
Thanks!
In a nested loop join, the outermost table does not use an index for the join (because the database just goes through all rows anyway).
To be able to use an index for a join, the index and the other column must have the same affinity, which usually means that both columns must have the same type.
Change the types of the b2c columns to INTEGER.
If the lookups on a1 or d1 are very selective, using a or d as the outermost table might make sense, and would then allow to use an index for the filter.
Try running ANALYZE.
If that does not help, you can force the join order with CROSS JOIN or INDEXED BY.

T-SQL - wrong query execution plan behaviour

One of our queries degraded after generating load on the DB.
Our query is a join between 3 tables:
Base table which contain 10 M rows.
EventPerson table which contain 5000 rows.
EventPerson788 which is empty.
It seems that the optimizer scans the index on the EventPerson instead of seek, this the script for replicating the issue:
--Create Tables
CREATE TABLE [dbo].[BASE](
[ID] [bigint] NOT NULL,
[IsActive] BIT
PRIMARY KEY CLUSTERED ([ID] ASC)
)ON [PRIMARY]
GO
CREATE TABLE [dbo].[EventPerson](
[DUID] [bigint] NOT NULL,
[PersonInvolvedID] [bigint] NULL,
PRIMARY KEY CLUSTERED ([DUID] ASC)
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [EventPerson_IDX] ON [dbo].[EventPerson]
(
[PersonInvolvedID] ASC
)
CREATE TABLE [dbo].[EventPerson788](
[EntryID] [bigint] NOT NULL,
[LinkedSuspectID] [bigint] NULL,
[sourceid] [bigint] NULL,
PRIMARY KEY CLUSTERED ([EntryID] ASC)
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[EventPerson788] WITH CHECK
ADD CONSTRAINT [FK7A34153D3720F84A]
FOREIGN KEY([sourceid]) REFERENCES [dbo].[EventPerson] ([DUID])
GO
ALTER TABLE [dbo].[EventPerson788] CHECK CONSTRAINT [FK7A34153D3720F84A]
GO
CREATE NONCLUSTERED INDEX [EventPerson788_IDX]
ON [dbo].[EventPerson788] ([LinkedSuspectID] ASC)
GO
--POPOLATE BASE TABLE
DECLARE #I BIGINT=1
WHILE (#I<10000000)
BEGIN
begin transaction
INSERT INTO BASE(ID) VALUES(#I)
SET #I+=1
if (#I%10000=0 )
begin
commit;
end;
END
go
--POPOLATE EventPerson TABLE
DECLARE #I BIGINT=1
WHILE (#I<5000)
BEGIN
BEGIN TRANSACTION
INSERT INTO EventPerson(DUID,PersonInvolvedID) VALUES(#I,(SELECT TOP 1 ID FROM BASE ORDER BY NEWID()))
SET #I+=1
IF(#I%10000=0 )
COMMIT TRANSACTION ;
END
GO
This the query :
select
count(EventPerson.DUID)
from
EventPerson
inner loop join
Base on EventPerson.DUID = base.ID
left outer join
EventPerson788 on EventPerson.DUID = EventPerson788.sourceid
where
(EventPerson.PersonInvolvedID = 37909 or
EventPerson788.LinkedSuspectID = 37909)
AND BASE.IsActive = 1
Do you have any idea why the optimizer decides to use index scan instead of index seek?
Workaround that already done :
Analyze tables and build statistics.
Rebuild Indices.
Try the FORCESEEK hint
None of the above persuaded the optimizer to run an index seek on EventPerson and seek on the base tables.
Thanks for your help .
The scan is there because of the or condition and the outer join against EventPerson788.
Either it will return rows from EventPerson when EventPerson.PersonInvolvedID = 37909 or when the there exists rows in EventPerson788 where EventPerson788.LinkedSuspectID = 37909. The last part means that every row in EventPerson has to be checked against the join.
The fact that EventPerson788 is empty can not be used by the query optimizer since the query plan is saved to be reused later when there might be matching rows in EventPerson788.
Update:
You can rewrite your query using a union all instead of or to get a seek in EventPerson.
select count(EventPerson.DUID)
from
(
select EventPerson.DUID
from EventPerson
where EventPerson.PersonInvolvedID = 1556 and
not exists (select *
from EventPerson788
where EventPerson788.LinkedSuspectID = 1556)
union all
select EventPerson788.sourceid
from EventPerson788
where EventPerson788.LinkedSuspectID = 1556
) as EventPerson
inner join BASE
on EventPerson.DUID=base.ID
where
BASE.IsActive=1
Well, you're asking SQL Server to count the rows of the EventPerson table - so why do you expect a seek to be better than a scan here?
For a COUNT, the SQL Server optimizer will almost always use a scan - it needs to count the rows, after all - all of them... it will do a clustered index scan, if no other non-nullable columns are indexed.
If you have an index on a small, non-nullable column (e.g. on a ID INT or something like that), it would probably do a scan on that index instead (less data to read to count all rows).
But in general: seek is great for selecting one or a few rows - but it sucks if you're dealing with all rows (like for a count)
You can easily observe this behavior if you're using the AdventureWorks sample database.
When doing a COUNT(*) on the Sales.SalesOrderDetail table which has over 120000 rows like this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
then you'll get an index scan on IX_SalesOrderDetail_ProductID - it just doesn't pay off to do seeks on over 120000 entries!
However, if you do the same operation on a smaller set of data, like this:
SELECT COUNT(*) FROM Sales.SalesOrderDetail
WHERE ProductID = 897
then you get back 2 rows out of all of them - and SQL Server will now use an index seek on that same index.

ON DELETE CACADE is very slow

I am using Postgres 8.4. My system configuration is window 7 32 bit 4 gb ram and 2.5ghz.
I have a database in Postgres with 10 tables t1, t2, t3, t4, t5.....t10.
t1 has a primary key a sequence id which is a foreign key reference to all other tables.
The data is inserted in database (i.e. in all tables) apart from t1 all other tables have nearly 50,000 rows of data but t1 has one 1 row whose primary key is referenced from all other tables. Then I insert the 2nd row of data in t1 and again 50,000 rows with this new reference in other tables.
The issue is when I want to delete all the data entries that are present in other tables:
delete from t1 where column1='1'
This query takes nearly 10 min to execute.
I created indexes also and tried but the performance is not at all improving.
what can be done?
I have mentioned a sample schema below
CREATE TABLE t1
(
c1 numeric(9,0) NOT NULL,
c2 character varying(256) NOT NULL,
c3ver numeric(4,0) NOT NULL,
dmlastupdatedate timestamp with time zone NOT NULL,
CONSTRAINT t1_pkey PRIMARY KEY (c1),
CONSTRAINT t1_c1_c2_key UNIQUE (c2)
);
CREATE TABLE t2
(
c1 character varying(100),
c2 character varying(100),
c3 numeric(9,0) NOT NULL,
c4 numeric(9,0) NOT NULL,
tver numeric(4,0) NOT NULL,
dmlastupdatedate timestamp with time zone NOT NULL,
CONSTRAINT t2_pkey PRIMARY KEY (c3),
CONSTRAINT t2_fk FOREIGN KEY (c4)
REFERENCES t1 (c1) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT t2_c3_c4_key UNIQUE (c3, c4)
);
CREATE INDEX t2_index ON t2 USING btree (c4);
Let me know if there is anything wrong with the schema.
With bigger tables and more than just two or three values, you need an index on the referenced column (t1.c1) as well as the referencing columns (t2.c4, ...).
But if your description is accurate, that can not be the cause of the performance problem in your scenario. Since you have only 2 distinct values in t1, there is just no use for an index. A sequential scan will be faster.
Anyway, I re-enacted what you describe in Postgres 9.1.9
CREATE TABLE t1
( c1 numeric(9,0) PRIMARY KEY,
c2 character varying(256) NOT NULL,
c3ver numeric(4,0) NOT NULL,
dmlastupdatedate timestamptz NOT NULL,
CONSTRAINT t1_uni_key UNIQUE (c2)
);
CREATE temp TABLE t2
( c1 character varying(100),
c2 character varying(100),
c3 numeric(9,0) PRIMARY KEY,
c4 numeric(9,0) NOT NULL,
tver numeric(4,0) NOT NULL,
dmlastupdatedate timestamptz NOT NULL,
CONSTRAINT t2_uni_key UNIQUE (c3, c4),
CONSTRAINT t2_c4_fk FOREIGN KEY (c4)
REFERENCES t1(c1) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
);
INSERT INTO t1 VALUES
(1,'OZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf', 234, now())
,(2,'agdsOZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf', 4564, now());
INSERT INTO t2
SELECT'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
,'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
, g, 2, 456, now()
from generate_series (1,50000) g
INSERT INTO t2
SELECT'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
,'shOahaZGPIGp7tgp97tßp97tß97tgP?)/GP)7gf'
, g, 2, 789, now()
from generate_series (50001, 100000) g
ANALYZE t1;
ANALYZE t2;
EXPLAIN ANALYZE DELETE FROM t1 WHERE c1 = 1;
Total runtime: 53.745 ms
DELETE FROM t1 WHERE c1 = 1;
58 ms execution time.
Ergo, there is nothing fundamentally wrong with your schema layout.
Minor enhancements:
You have a couple of columns defined numeric(9,0) or numeric(4,0). Unless you have a good reason to do that, you are probably a lot better off using just integer. They are smaller and faster overall. You can always add a check constraint if you really need to enforce a maximum.
I also would use text instead of varchar(n)
And reorder columns (at table creation time). As a rule of thumb, place fixed length NOT NULL columns first. Put timestamp and integer first and numeric or text last. More here..

Resources