INNER JOIN Optimization

INNER JOIN Optimization - caching

I have two tables to join. TABLE_A (contains column 'a') and TABLE_BC (contains columns 'b' and 'c').
There is a condition on TABLE_BC. The two tables are joined by 'rowid'.
Something like:
SELECT a, b, c FROM main.TABLE_A INNER JOIN main.TABLE_BC WHERE (b > 10.0 AND c < 10.0) ON main.TABLE_A.rowid = main.TABLE_BC.rowid ORDER BY a;
Alternatively:
SELECT a, b, c FROM main.TABLE_A AS s1 INNER JOIN (SELECT rowid, b, c FROM main.TABLE_BC WHERE (b > 10.0 AND c < 10.0)) AS s2 ON s1.rowid = s2.rowid ORDER BY a;
I need to do this a couple of time with different TABLE_A, but TABLE_BC does not change... I could therefore speed things up by creating a temporary in-memory database (mem) for the constant part of the query.
CREATE TABLE mem.cache AS SELECT rowid, b, c FROM main.TABLE_BC WHERE (b > 10.0 AND c < 10.0);
followed by (many)
SELECT a, b, c FROM main.TABLE_A INNER JOIN mem.cache ON main.TABLE_A.rowid = mem.cache.rowid ORDER BY a;
I get the same result set from all the queries above, but the last option is by far the fastest one.
The problem is that I would like to avoid splitting the query into two parts. I would expect SQLite to do the same thing for me automatically (at least in the second scenario), but it does not seem to happen... Why is that?
Thanks.

SQLite is pretty light on optimization. The general rule of thumb: SmallTable Inner Join BigTable is faster than the reverse.
That being said I wonder if your first query would run faster in the following form:
SELECT a, b, c
FROM main.TABLE_A
INNER JOIN main.TABLE_BC ON main.TABLE_A.rowid = main.TABLE_BC.rowid
WHERE (b > 10.0 AND c < 10.0)
ORDER BY a;

Answer from the SQLite User Mailing List:
In short, because SQLite cannot read your mind.
To understand the answer compare speeds of executing one query (with
one TABLE_A) and creating an in-memory database, creating a table in
it and using that table in one query (with the same TABLE_A). I bet
the first option (straightforward query without in-memory database)
will be much faster. So SQLite selects the fastest way to execute your
query. It cannot predict what the future queries will be to understand
how to execute the whole set of queries faster. You can do that and
you should split your query in two parts.
Pavel

Related

Natural join vs Cartesian product with equality condition [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Explicit vs implicit SQL joins
I want to know the difference in performance of
select * from A,B,C where A.x = B.y and B.y = C.z
and
select * from A INNER JOIN B on A.x = B.y INNER JOIN C on B.y = C.z
Basically i want to know if inner join performs better than cartesian product?
Also, in inner join is cartesian product carried out internally?

First of All these two Operations are for Two different purposes , While Cartesian Product provides you a result made by joining each row from one table to each row in another table. while An inner join (sometimes called a simple join ) is a
join of two or more tables that returns only those rows that satisfy the join condition. Now coming to what You have Written here :
In case of Cartesian product First A table comprising of A,B,C is created and after that on the basis of what ever condition is given,we Get result. But as you see it's heavy process. On the other hand Inner join only chooses those result which are really fulfilling the given condition .Hence it's a better solution for achieving end results. First one is abuse of SQL language.

How to effeciently select data from two tables?

I have two tables: A, B.
A has prisoner_id and prisoner_name columns.
B has all other info about prisoners included prisoner_name column.
First I select all of the data that I need from B:
WITH prisoner_datas AS
(SELECT prisoner_name, ... FROM B WHERE ...)
Then I want to know all of the id of my prisoner_datas. To do this I need to combine information by prisoner_name column, because it's common for both tables
I did the following
SELECT A.prisoner_id, prisoner_datas.prisoner_name, prisoner_datas. ...,
FROM A, prisoner_datas
WHERE A.prisoner_name = prisoner_datas.prisoner_name
But it works very slow. How can I improve performance?

Add an index on the prisoner_name join column in the B table. Then the following join should have some performance improvement:
SELECT
A.prisoner_id,
B.prisoner_name,
B.prisoner_datas.id -- and other columns if needed
FROM A
INNER JOIN B
ON A.prisoner_name = B.prisoner_name
Note here that I used an explicit join syntax here. It isn't required, and the query plan might not change, but it makes the query easier to read. I don't think the CTE will change much, but the lack of an index on the join column should be important here.

Worse query plan with a JOIN after ANALYZE

I see that running ANALYZE results in significantly poor performance on a particular JOIN I'm making between two tables.
Suppose the following schema:
CREATE TABLE a ( id INTEGER PRIMARY KEY, name TEXT );
CREATE TABLE b ( a NOT NULL REFERENCES a, value INTEGER, PRIMARY KEY(a, b) );
CREATE VIEW ab AS SELECT a.name, b.text, MAX(b.value)
FROM a
JOIN b ON b.a = a.id;
GROUP BY a.id
ORDER BY a.name
Table a is approximately 10K rows, table b is approximately 48K rows (~5 rows per row in table a).
Before ANALYZE
Now when I run the following query:
SELECT * FROM ab;
The query plan looks as follows:
1|0|0|SCAN TABLE b
1|1|1|SEARCH TABLE a USING INTEGER PRIMARY KEY (rowid=?)
This is a good plan, b is larger and I want it to be in the outer loop, making use of the index in table a. It finishes well within a second.
After ANALYZE
When I execute the same query again, the query plan results in two table scans:
1|0|1|SCAN TABLE a
1|1|0|SCAN TABLE b
This is far for optimal. For some reason the query planner thinks that an outer loop of 10K rows and an inner loop of 48K rows is a better fit. This takes about 1.5 minute to complete.
Should I adapt the index in table b to make it work after ANALYZE? Anything else to change to the indexing/schema?
I just try to understand the problem here. I worked around it using a CROSS JOIN, but that feels dirty and I don't really understand why the planner would go with a plan that is orders of magnitude slower than the un-analyzed plan. It seems to be related to GROUP BY, since the query planner puts table b in the outer loop without it (but that renders the query useless for what I want).

Accidentally found the answer by adjusting the GROUP BY clause in the view definition. Instead of joining on a.id, I group on b.a instead, although they have the same values.
CREATE VIEW ab AS SELECT a.name, b.text, MAX(b.value)
FROM a
JOIN b ON b.a = a.id;
GROUP BY b.a -- <== changed this from a.id to b.a
ORDER BY a.name
I'm still not entirely sure what the difference is, since it groups the same data.

Multiple table join in hive

I have migrated Teradata tables' data into hive .
Now I have to build summary tables on top of imported data. Summary table needs to be built from five source tables
If I go with joins I'll need to join five tables is it possible in hive ? or should I break the query in five parts?
what should be advisable approach for this problem?
Please suggest

Five way joins in hive are of course possible and also (naturally) likely slow to very slow.
You should consider co-partitioning the tables on
identical partition columns
identical number of partitions
Other options include hints. For example consider if one of the tables were large and the others small. You may then be able to use streamtble hint
Assuming a is large:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1) join d on (d.key = c.key) join e on (e.key = d.key)
Adapted from : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
:
All five tables are joined in a single map/reduce job and the values
for a particular value of the key for tables b, c,d, and e are
buffered in the memory in the reducers. Then for each row retrieved
from a, the join is computed with the buffered rows. If the
STREAMTABLE hint is omitted, Hive streams the rightmost table in the
join.
Another hint is the mapjoin that is useful to cache small tables in memory.
Assuming a is large and b,c,d,e are small enough to fit in memory of each mapper:
SELECT /*+ MAPJOIN(b,c,d,e) */ a.val, b.val, c.val, d.val, e.val
FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
join d on (d.key = c.key) join e on (e.key = d.key)

Yes, you can join multiple tables in a single query. This allows many opportunities for Hive to make optimizations that couldn't be done if you broke it into separate queries.

Entity Framework "Joins" resulting in returning entire table from SQL

We are writing entity lambda expression query like this. But when we checked in profile. There were almost all the tables which were used in join returning entire table to the .net linq queries.
We have few transaction tables which has thousands of records. which is causing performance issue.
Please let us know if we can avoid table returning entire rows to .net
var result = (from f in f
join a in this.Context.a on f.primeryKey equals a.primeryKey
join d in this.Context.d on f.secondid equals d.secondid
join t in this.Context.t on d.thirdId equals t.thirdId
where t.isfoo && pfIds.Contains(a.fourthId.HasValue ? a.fourthId.Value : -1)
select f).Distinct().ToList();

Well, no real answer, for that I don't have enough info, but a few remarks to improve your query.
First remark: Don't do Contains and HasValue, because Linq won't SQL-ize these operations. I'm also not quite sure about the this.Context. stuff.
Second: NULL won't join in smart joins.
Third: Instead of selecting f, you'd typically select only a few fields of f that you need.

You'll need to rewrite your query. EF really needs to get all lines to utilize operator ? in order to evaluate value in a.fourthId column. I believe that
var result = (from f in f
join a in this.Context.a on f.primeryKey equals a.primeryKey
join d in this.Context.d on f.secondid equals d.secondid
join t in this.Context.t on d.thirdId equals t.thirdId
where t.isfoo && pfIds.Contains(a.fourthId)
select f).Distinct().ToList();
would meet your needs without necessary overhead, that evaluation seems to be superfluous.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

INNER JOIN Optimization - caching

Related

Natural join vs Cartesian product with equality condition [duplicate]

How to effeciently select data from two tables?

Worse query plan with a JOIN after ANALYZE

Multiple table join in hive

Entity Framework "Joins" resulting in returning entire table from SQL

Categories

Resources