Can Mapside Join and Reduce side join have different O/P - hadoop

The below code is present in PROD and runs daily, I am trying to optimize it.
I see that set hive.auto.convert.join=FALSE; is making it to do an Reduce side join which runs for 2.5 hours and produces an row count of 2324381 records.
If i set hive.auto.convert.join=TRUE; then it does an Map side join and runs only for 20 minutes and produces an row count of 5766529 records.
I need to know why the row counts differ and is this correct ? is it okay the row counts differ ? i was under the impression that the O/P or the query should remain the same irrespective of which join is happening.
The source data remains the same in both the case and every other condition is the same expect for the hive setting i am changing.
INSERT OVERWRITE TABLE krish
SELECT
s.svcrqst_id
s.svcrqst_lupdusr_id,
s.svcrqst_lstupd_dts as svcrqst_lupdt,
f.crsr_lupdt,
s.svcrqst_crt_dts,
s.svcrqst_asrqst_ind,
s.svcrtyp_cd,
s.svrstyp_cd,
s.asdplnsp_psuniq_id as psuniq_id,
s.svcrqst_rtnorig_in,
s.cmpltyp_cd,
s.catsrsn_cd,
s.apealvl_cd,
s.cnstnty_cd,
s.svcrqst_vwasof_dt,
f.crsr_master_claim_index,
t.svcrqct_cds,
r.sum_reason_cd,
r.sum_reason
from
table1 s
left outer join
(
select distinct
lpad(trim(i_srtp_sr_sbtyp_cd), 3, '0') as i_srtp_sr_sbtyp_cd,
lpad(trim(i_srtp_sr_typ_cd), 3, '0') as i_srtp_sr_typ_cd,
sum_reason_cd,
sum_reason
from table2
) r
on lpad(trim(s.svcrtyp_cd), 3, '0')=r.i_srtp_sr_typ_cd
and lpad(trim(s.svrstyp_cd), 3, '0')=r.i_srtp_sr_sbtyp_cd
left outer join table3 f
on trim(s.svcrqst_id)=trim(f.crsr_sr_id)
left outer join table4 t
on t.svcrqst_id=s.svcrqst_id
where
( year(s.svcrqst_lstupd_dts)=${hiveconf:YEAR} and month(s.svcrqst_lstupd_dts)=${hiveconf:MONTH} and day(s.svcrqst_lstupd_dts)=${hiveconf:DAY} )
or
( year(f.crsr_lupdt)=${hiveconf:YEAR} and month(f.crsr_lupdt)=${hiveconf:MONTH} and day(f.crsr_lupdt)=${hiveconf:DAY} )
;

After doing some more research with my data, I created all my source tables with same column partitioned and bucketed and reran my HQL.
This time the number of rows for both map side join and reduce side join came with same counts.
I think on the previous query since the data was not partitioned the map side and reduce side joins have different output.

Related

Hash Join with Partition restriction from third table

my current problem is in 11g, but I am also interested in how this might be solved smarter in later versions.
I want to join two tables. Table A has 10 million rows, Table B is huge and has a billion of records across about a thousand partitions. One partition has around 10 million records. I am not joining on the partition key. For most rows of Table A, one or more rows in Table B will be found.
Example:
select * from table_a a
inner join table_b b on a.ref = b.ref
The above will return about 50 million rows, whereas the results come from about 30 partitions of table b. I am assuming a hash join is the correct join here, hashing table a and FTSing/index-scanning table b.
So, 970 partitions were scanned for no reason. And, I have a third query that could tell oracle which 30 partitions to check for the join.
Example of third query:
select partition_id from table_c
This query gives exactly the 30 partitions for the query above.
To my question:
In PL/SQL one can solve this by
select the 30 partition_ids into a variable (be it just a select listagg(partition_id,',') ... into v_partitions from table_c
Execute my query like so:
execute immediate 'select * from table_a a
inner join table_b b on a.ref = b.ref
where b.partition_id in ('||v_partitions||')' into ...
Let's say this completes in 10 minutes.
Now, how can I do this in the same amount of time with pure SQL?
Just simply writing
select * from table_a a
inner join table_b b on a.ref = b.ref
where b.partition_id in (select partition_id from table_c)
does not do the trick it seems, or I might be aiming at the wrong plan.
The plan I think I want is
hash join
table a
nested loop
table c
partition pruning here
table b
But, this does not come back in 10 minutes.
So, how to do this in SQL and what execution plan to aim at? One variation I have not tried yet that might be the solution is
nested loop
table c
hash join
table a
partition pruning here (pushed predicate from the join to c)
table b
Another feeling I have is that the solution might lie in joining table a to table c (not sure on what though) and then joining this result to table b.
I am not asking you to type everything out for me. Just a general concept of how to do this (getting partition restriction from a query) in SQL - what plan should I aim at?
thank you very much! Peter
I'm not an expert at this, but I think Oracle generally does the joins first, then applies the where conditions. So you might get the plan you want by moving the partition pruning up into a join condition:
select * from table_a a
inner join table_b b on a.ref = b.ref
and b.partition_id in (select partition_id from table_c);
I've also seen people try to do this sort of thing with an inline view:
select * from table_a a
inner join (select * from table_b
where partition_id in (select partition_id from table_c)) b
on a.ref = b.ref;
thank you all for your discussions with me on this one. In my case this was solved (not by me) by adding a join-path between table_c and table_a and by overloading the join conditions as below. In my case this was possible by adding column partition_id to table_a:
select * from
table_c c
JOIN table_a a ON (a.partition_id = c.partition_id)
JOIN table_b b ON (b.partition_id = c.partition_id and b.partition_id = a.partition_id and b.ref = a.ref)
And this is the plan you want:
leading(c,b,a) use_nl(c,b) swap_join_inputs(a) use_hash(a)
So you get:
hash join
table a
nested loop
table c
partition list iterator
table b

how to get the count(id) from the multiple tables using join query

I want to display count based on the id from multiple tables. for two tables it is working fine but for three tables it is not displaying data
this is my query for three tables it is not working
select r.req_id
, r.no_of_positions
, count(j.cand_id) as no_of_closure
, count(cis.cand_id)
from requirement r
join joined_candidates j
on r.req_id=j.req_id
join candidate_interview_schedule cis
on cis.req_id=r.req_id
where cis.interview_status='Interview Scheduled'
group by r.req_id, r.no_of_positions;
Changed to left joins incase value doens't exist in a table
Changed count to use an window function so counts are not artificially inflated by joins
moved where clause to join criteria as on a left join, it would negate the null values, making it operate like a inner join.
..MAYBE...
SELECt r.req_id
, r.no_of_positions
, count(j.cand_id) over (partition by J.cand_ID) as no_of_closure
, count(cis.cand_id) over (partition by cis.cand_id) as no_of_CIS_CNT
FROM requirement r
LEFT join joined_candidates j
on r.req_id=j.req_id
LEFT join candidate_interview_schedule cis
on cis.req_id=r.req_id
and cis.interview_status='Interview Scheduled'
GROUP BY r.req_id, r.no_of_positions;
or perhaps... (if I can assume j.cand_ID and cis.cand_ID are unique) also to eliminate artificial count increase due to 1:M joins
SELECt r.req_id
, r.no_of_positions
, count(distinct j.cand_id) as no_of_closure
, count(distinct cis.cand_id) as no_of_CIS_CNT
FROM requirement r
LEFT join joined_candidates j
on r.req_id=j.req_id
LEFT join candidate_interview_schedule cis
on cis.req_id=r.req_id
and cis.interview_status='Interview Scheduled'
GROUP BY r.req_id, r.no_of_positions;

Worse query plan with a JOIN after ANALYZE

I see that running ANALYZE results in significantly poor performance on a particular JOIN I'm making between two tables.
Suppose the following schema:
CREATE TABLE a ( id INTEGER PRIMARY KEY, name TEXT );
CREATE TABLE b ( a NOT NULL REFERENCES a, value INTEGER, PRIMARY KEY(a, b) );
CREATE VIEW ab AS SELECT a.name, b.text, MAX(b.value)
FROM a
JOIN b ON b.a = a.id;
GROUP BY a.id
ORDER BY a.name
Table a is approximately 10K rows, table b is approximately 48K rows (~5 rows per row in table a).
Before ANALYZE
Now when I run the following query:
SELECT * FROM ab;
The query plan looks as follows:
1|0|0|SCAN TABLE b
1|1|1|SEARCH TABLE a USING INTEGER PRIMARY KEY (rowid=?)
This is a good plan, b is larger and I want it to be in the outer loop, making use of the index in table a. It finishes well within a second.
After ANALYZE
When I execute the same query again, the query plan results in two table scans:
1|0|1|SCAN TABLE a
1|1|0|SCAN TABLE b
This is far for optimal. For some reason the query planner thinks that an outer loop of 10K rows and an inner loop of 48K rows is a better fit. This takes about 1.5 minute to complete.
Should I adapt the index in table b to make it work after ANALYZE? Anything else to change to the indexing/schema?
I just try to understand the problem here. I worked around it using a CROSS JOIN, but that feels dirty and I don't really understand why the planner would go with a plan that is orders of magnitude slower than the un-analyzed plan. It seems to be related to GROUP BY, since the query planner puts table b in the outer loop without it (but that renders the query useless for what I want).
Accidentally found the answer by adjusting the GROUP BY clause in the view definition. Instead of joining on a.id, I group on b.a instead, although they have the same values.
CREATE VIEW ab AS SELECT a.name, b.text, MAX(b.value)
FROM a
JOIN b ON b.a = a.id;
GROUP BY b.a -- <== changed this from a.id to b.a
ORDER BY a.name
I'm still not entirely sure what the difference is, since it groups the same data.

Joins are not giving the right answers

1)select count(*) from LCL_RKM_AuditForm; **O/P : 868**
2)select count(*) from RKM_KnowledgeArticleManager; **O/P : 8511**
3)select count(*) from
LCL_RKM_AuditForm A
**right** outer join
RKM_KnowledgeArticleManager B
on A.ARTICLE_ID=B.DocID; **O/P : 9216**
4)select count(*) from
LCL_RKM_AuditForm A
**left** outer join
RKM_KnowledgeArticleManager B
on A.ARTICLE_ID=B.DocID; **O/P : 1973**
5)select count(*) from
LCL_RKM_AuditForm A,RKM_KnowledgeArticleManager B
**where** A.ARTICLE_ID=B.DocID; **O/P : 1973**
My understanding is that.,.
Left outer join will Displays all the values in A table and common values in B table.
Right outer join will Displays all the values in B table and common values in A table.
What does that Common Values refers to ? If its the left outer join which means it should give only 868 results right ? And if its right outer join which means it should give only 8511 results right ?
5th statement i have used WHERE clause which means it should give me only 868 entries right ?
Please help me on this.
Your expected results appear to be based on the false assumption that there is a one-to-one mapping between rows in the two tables.
For a standard inner join (as in your last query), every matching combination of rows from the two tables is returned. Since you are getting more results than there are rows in the first table, it must be true that a given row in the first table may have multiple matching rows in the second table.
For instance, if there is one row in table A with ArticleID = 1, and two rows in Table B with DocID = 1, then a join of the two tables on these fields will produce 2 rows.
When you change to an outer join, you will get at least the same number of rows as the inner join, and potentially more. An outer join will return the same rows as the corresponding inner join; plus, for any row in the "inner" table that does not have any match in the "outer" table, it will return that one row, with NULL values for columns from the second table.
Your LEFT OUTER JOIN returns the same number of rows as the inner join; this implies that every row in table A has at least one matching row in table B.
Your RIGHT OUTER JOIN returns many more rows. This implies that there are many rows in table B that have no matching row in table A.

PL SQL - Join 2 tables and return max from right table

Trying to retrive the MAX doc in the right table.
SELECT F43.PDDOCO,
F43.PDSFXO,
F43.PDLNID,
F43.PDAREC/100 As Received,
F431.PRAREC/100,
max(F431.PRDOC)
FROM PRODDTA.F43121 F431
LEFT OUTER JOIN PRODDTA.F4311 F43
ON
F43.PDKCOO=F431.PRKCOO
AND F43.PDDOCO=F431.PRDOCO
AND F43.PDDCTO=F431.PRDCTO
AND F43.PDSFXO=F431.PRSFXO
AND F43.PDLNID=F431.PRLNID
WHERE F431.PRDOCO = 401531
and F431.PRMATC = 2
and F43.PDLNTY = 'DC'
Group by
F43.PDDOCO,
F43.PDSFXO,
F43.PDLNID,
F43.PDAREC,
F431.PRAREC/100
This query is still returning the two rows in the right table. Fairly new to SQL and struggling with the statement. Any help would be appreciated.
Without seeing your data it is difficult to tell where the problem might so I will offer a few suggestions that could help.
First, you are joining with a LEFT JOIN on the PRODDTA.F4311 but you have in the WHERE clause a filter for that table. You should move the F43.PDLNTY = 'DC' to the JOIN condition. This is causing the query to act like an INNER JOIN.
Second, you can try using a subquery to get the MAX(PRDOC) value. Then you can limit the columns that you are grouping on which could eliminate the duplicates. The query would them be similar to the following:
SELECT F43.PDDOCO,
F43.PDSFXO,
F43.PDLNID,
F43.PDAREC/100 As Received,
F431.PRAREC/100,
F431.PRDOC
FROM PRODDTA.F43121 F431
INNER JOIN
(
-- subquery to get the max
-- then group by the distinct columns
SELECT PDKCOO, max(PRDOC) MaxPRDOC
FROM PRODDTA.F43121
WHERE PRDOCO = 401531
and PRMATC = 2
GROUP BY PDKCOO
) f2
-- join the subquery result back to the PRODDTA.F43121 table
on F431.PRDOC = f2.MaxPRDOC
AND F431.PDKCOO = f2.PDKCOO
LEFT OUTER JOIN PRODDTA.F4311 F43
ON F43.PDKCOO=F431.PRKCOO
AND F43.PDDOCO=F431.PRDOCO
AND F43.PDDCTO=F431.PRDCTO
AND F43.PDSFXO=F431.PRSFXO
AND F43.PDLNID=F431.PRLNID
AND F43.PDLNTY = 'DC' -- move this filter to the join instead of the WHERE
WHERE F431.PRDOCO = 401531
and F431.PRMATC = 2
If you provide your table structures and some sample data, it will be easier to determine the issue.

Resources