How to optimize big table read and outer join in pig - hadoop

I am joining a big table with 3 other tables,
A = join small table by (f1,f2) RIGHT OUTER , massiveTable by (f1,f2) ;
B = join AnotherSmall by (f3) RIGHT OUTER , A by (f3) ;
C = join AnotherSmall by (f4) , B by (f4) ;
The small tables may not fit in memory, but this forces a billion object read three times and time consuming, I was wondering if there is any way rereading can be avoided and process can be made more efficient?
Thanks in advance.

If you design your big table in HBase to have three column families, i.e. splitting f1 and f2, from f3 and from f4, you should be able to avoid the unnecessary reads.
Also if you think about it, you don't re-read but rather read a different part of a record: firstly f1 and f2, then f3 and finally f4.

Related

Force partition pruning on Oracle

I have a query similar to this
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
The huge_table is partitioned by DATE, and the PK is DATE, some_id and some_other_id (so the join not is done by pk index).
small_table just contains a few dates.
The total cost of the SQL is 48 minutes
By some reason the explain plan give me a "PARTITION RANGE (ALL)" with a high numbers on cardinality. Looks like access to the full table, not just the partitions indicated by small_table.DATE
If I put the SQL inside a loop and do
for o in (select date from small_table)
loop
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
where B.DATE=O.DATE
end loop;
Only takes 2 minutes 40 seconds (the full loop).
There is any way to force the partition pruning on Oracle 12c?
Additional info:
small_table has 37 records for 13 different dates. huge_table has 8,000 million of records with 179 dates/partitions. The SQL needs one field from small_table, but I can tweak the SQL to not use it
Update:
With the use_nl hint, now the cardinality show in the execution plan is more accurate and the execution time downs from 48 minutes to 4 minutes.
select /* use_nl(B) */*
from small_table A
inner join huge_table B on A.DATE =B.DATE
This seems like the problem:
"small_table have 37 registries for 13 different dates. huge_table has 8.000 millions of registries with 179 dates/partitions....
The SQL need one field from small_table, but I can tweak the SQL to not use it "
According to the SQL you posted you're joining the two tables on just their DATE columns with no additional conditions. If that's really the case you are generating a cross join in which each partition of huge_table is joined to small_table 2-3 times. So your result set may be much large than you're expecting, which means more database effort, which means more time.
The other thing to notice is that the cardinality of small_table to huge_table partitions is about 1:4; the optimizer doesn't know that there are really only thirteen distinct huge_table partitions in play.
Optimization ought to be a science and this is more guesswork than anything but try this:
select B.*
from ( select /*+ cardinality(t 13) */
distinct t.date
from small_table t ) A
inner join huge_table B
on A.DATE =B.DATE
This should communicate to the optimizer that only a small percentage of the huge_table partitions are required, which may make it choose partition pruning. Also it removes that Cartesian product, which should improve performance too. Obviously you will need to apply that tweak you mentioned, to remove the need to query anything else from small_table.

Hive - optimize multiple table joins

I need to join multiple tables in a single query and then overwrite another table.
Focus/Driver table: FACT (huge, bucketed on ID)
Join Table 1: T1 (big, but smaller than FACT, bucketed on ID and joined with FACT on FACT.ID)
Join Table 2: T2 (big, but smaller than T1 and joined with FACT on FACT.ID2)
Join Table 3: T3 (reference table, small enough to fit into memory, joined to FACT)
Join Table 4: T4 (reference table, small enough to fit into memory, joined to FACT)
Join Table 5: T5 (reference table, small enough to fit into memory, joined to FACT)
Now, I wanted to know what is the sequence of tables to be joined to achieve the best performance.
My thoughts and questions:
I want to first join FACT with T1 since both are bucketed. But is it a good idea to join the 2 big tables first since this huge joined dataset will be joined with the smaller ones (which means more data to be moved between mappers and reducers) or should we join with the smaller tables first? But, if we join the FACT with the smaller tables, I will not be able to perform a bucket join with T1, right (since the joined dataset will not be bucketed).

Left Join Vs Inner Join in Hive -- internals and performance on multiple joins, map joins

Does anyone know if there is a difference in performance for left join vs inner join in Hive, with Map Join enabled via hive.auto.convert.join=True?
The reason I ask, per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-JoinOptimization
Outer joins offer more challenges. Since a map-join operator can only
stream one table, the streamed table needs to be the one from which
all of the rows are required. For the left outer join, this is the
table on the left side of the join; for the right outer join, the
table on the right side, etc. This means that even though an inner
join can be converted to a map-join, an outer join cannot be
converted. An outer join can only be converted if the table(s) apart
from the one that needs to be streamed can be fit in the size
configuration.
It seems like this is saying (a) an outer join can't be converted to an inner join at all and (b) it can only be converted if the table that doesn't need to be streamed is the "left join" table(s). Does anyone know which one it is?
Also, is there a difference in performance for INNER JOIN vs LEFT JOIN in general, in Hive, as there is in SQL? Does that difference become more magnified (and or, start to exist in the first place) when several left joins are involved? The reason I ask is I'm considering adding several dummy entries to some left joined lookup tables to convert my joins to inner... intuitively it seems like it might make a difference, performance wise, but I can't find any documentation or discussion either way. Curious if anyone has experience with this.

Pig Inner Join produces a Job with 1 hanging reducer

I have a Pig Script that I've been working on that has an Inner Join from 2 different data sources. This join happens to be the 1st MapReducing causing operation. With the only operations before hand being filters and foreachs. When this Join is then executed everything goes threw the map phase perfectly and fast, but when it comes to the reduce phase all the reducers but 1 finish fast. However the 1 just sits there at the Reduce part of the Phase chugging over data at a very very slow pace. To the point that it can take up to an hour+ just waiting on that 1 reducer to complete. I have tried increasing the reducers as well as switching to a skewed join, but nothing seems to help.
Any ideas for things to look into.
I also did an explain to see if I could see anything, but that just shows a simple single job flow with nothing amazing.
Likely what is happening is a single key has huge number of instances on both sides and it's exploding.
For example, if you join:
x,4 x,'f'
x,5 x,'g'
x,6 X x,'h'
y,7 x,'i'
... you will get 12 pairs of x! So you can imagine that if you have 1000 of one key and 2000 of the same key in the other data set, you will get 2 million pairs just from those 2000 rows. The single reducer unfortunately has to take the brunt force of this explosion.
Adding reducers or using a skew join isn't going to help here, because at the end of the day, a single reducer needs to handle this one big explosion of pairs.
Here are a few things to check:
It sounds like only a single join key is causing this issue since only one reducer is getting hammered. The common culprit is NULL. Can the column in either of these be NULL? If so, it'll get a huge explosion! Try filtering out NULL on the foreign key of both relations before running through the join and see if there is a difference. Or, instead of NULL... perhaps you have some sort of default value or a single value that shows up a lot.
Try to figure out how many of each key there actually are, and figure out what the explosion will look like. Something like (warning: I'm not actually testing this code, hopefully it works):
A1 = LOAD ... -- load dataset 1
B1 = GROUP A1 BY fkey1;
C1 = FOREACH B1 GENERATE group, COUNT_STAR(A1) as cnt1;
A2 = LOAD ... -- load dataset 2
B2 = GROUP A2 BY fkey2;
C2 = FOREACH B2 GENERATE group, COUNT_STAR(A2) as cnt2;
D = JOIN C1 by fkey1, C2 by fkey2; -- do the join on the counts
E = FOREACH D GENERATE fkey1, (cnt1 * cnt2) as cnt; -- multiply out the counts
F = ORDER E BY cnt DESC; -- order it by the highest first
STORE F INTO ...
Similarly, it may have nothing to do with an explosion. One of your relations might just have a single key a ton of times. For example, in the word count example, the reducer that ends up with the word "the" is going to have a lot more counting to do than the one that gets "zebra". I don't think this is the case here since only one of your reducers is getting hammered, which is why I think #1 is probably the case.
If you have some huge number for one of the keys, that's why. And you also know what key is causing the issue.

Oracle always uses HASH JOIN even when both tables are huge?

my understanding is that HASH JOIN only makes sense when one of the 2 tables is small enough to fit into memory as a hash table.
but when I gave a query to oracle, with both tables having several hundred million rows, oracle still came up with a hash join explain plan. even when I tricked it with OPT_ESTIMATE(rows = ....) hints, it always decides to use HASH JOIN instead of merge sort join.
so I wonder how is HASH JOIN possible in the case of both tables being very large?
thanks
Yang
Hash joins obviously work best when everything can fit in memory. But that does not mean they are not still the best join method when the table can't fit in memory. I think the only other realistic join method is a merge sort join.
If the hash table can't fit in memory, than sorting the table for the merge sort join can't fit in memory either. And the merge join needs to sort both tables. In my experience, hashing is always faster than sorting, for joining and for grouping.
But there are some exceptions. From the Oracle® Database Performance Tuning Guide, The Query Optimizer:
Hash joins generally perform better than sort merge joins. However,
sort merge joins can perform better than hash joins if both of the
following conditions exist:
The row sources are sorted already.
A sort operation does not have to be done.
Test
Instead of creating hundreds of millions of rows, it's easier to force Oracle to only use a very small amount of memory.
This chart shows that hash joins outperform merge joins, even when the tables are too large to fit in (artificially limited) memory:
Notes
For performance tuning it's usually better to use bytes than number of rows. But the "real" size of the table is a difficult thing to measure, which is why the chart displays rows. The sizes go approximately from 0.375 MB up to 14 MB. To double-check that these queries are really writing to disk you can run them with /*+ gather_plan_statistics */ and then query v$sql_plan_statistics_all.
I only tested hash joins vs merge sort joins. I didn't fully test nested loops because that join method is always incredibly slow with large amounts of data. As a sanity check, I did compare it once with the last data size, and it took at least several minutes before I killed it.
I also tested with different _area_sizes, ordered and unordered data, and different distinctness of the join column (more matches is more CPU-bound, less matches is more IO bound), and got relatively similar results.
However, the results were different when the amount of memory was ridiculously small. With only 32K sort|hash_area_size, merge sort join was significantly faster. But if you have so little memory you probably have more significant problems to worry about.
There are still many other variables to consider, such as parallelism, hardware, bloom filters, etc. People have probably written books on this subject, I haven't tested even a small fraction of the possibilities. But hopefully this is enough to confirm the general consensus that hash joins are best for large data.
Code
Below are the scripts I used:
--Drop objects if they already exist
drop table test_10k_rows purge;
drop table test1 purge;
drop table test2 purge;
--Create a small table to hold rows to be added.
--("connect by" would run out of memory later when _area_sizes are small.)
--VARIABLE: More or less distinct values can change results. Changing
--"level" to something like "mod(level,100)" will result in more joins, which
--seems to favor hash joins even more.
create table test_10k_rows(a number, b number, c number, d number, e number);
insert /*+ append */ into test_10k_rows
select level a, 12345 b, 12345 c, 12345 d, 12345 e
from dual connect by level <= 10000;
commit;
--Restrict memory size to simulate running out of memory.
alter session set workarea_size_policy=manual;
--1 MB for hashing and sorting
--VARIABLE: Changing this may change the results. Setting it very low,
--such as 32K, will make merge sort joins faster.
alter session set hash_area_size = 1048576;
alter session set sort_area_size = 1048576;
--Tables to be joined
create table test1(a number, b number, c number, d number, e number);
create table test2(a number, b number, c number, d number, e number);
--Type to hold results
create or replace type number_table is table of number;
set serveroutput on;
--
--Compare hash and merge joins for different data sizes.
--
declare
v_hash_seconds number_table := number_table();
v_average_hash_seconds number;
v_merge_seconds number_table := number_table();
v_average_merge_seconds number;
v_size_in_mb number;
v_rows number;
v_begin_time number;
v_throwaway number;
--Increase the size of the table this many times
c_number_of_steps number := 40;
--Join the tables this many times
c_number_of_tests number := 5;
begin
--Clear existing data
execute immediate 'truncate table test1';
execute immediate 'truncate table test2';
--Print headings. Use tabs for easy import into spreadsheet.
dbms_output.put_line('Rows'||chr(9)||'Size in MB'
||chr(9)||'Hash'||chr(9)||'Merge');
--Run the test for many different steps
for i in 1 .. c_number_of_steps loop
v_hash_seconds.delete;
v_merge_seconds.delete;
--Add about 0.375 MB of data (roughly - depends on lots of factors)
--The order by will store the data randomly.
insert /*+ append */ into test1
select * from test_10k_rows order by dbms_random.value;
insert /*+ append */ into test2
select * from test_10k_rows order by dbms_random.value;
commit;
--Get the new size
--(Sizes may not increment uniformly)
select bytes/1024/1024 into v_size_in_mb
from user_segments where segment_name = 'TEST1';
--Get the rows. (select from both tables so they are equally cached)
select count(*) into v_rows from test1;
select count(*) into v_rows from test2;
--Perform the joins several times
for i in 1 .. c_number_of_tests loop
--Hash join
v_begin_time := dbms_utility.get_time;
select /*+ use_hash(test1 test2) */ count(*) into v_throwaway
from test1 join test2 on test1.a = test2.a;
v_hash_seconds.extend;
v_hash_seconds(i) := (dbms_utility.get_time - v_begin_time) / 100;
--Merge join
v_begin_time := dbms_utility.get_time;
select /*+ use_merge(test1 test2) */ count(*) into v_throwaway
from test1 join test2 on test1.a = test2.a;
v_merge_seconds.extend;
v_merge_seconds(i) := (dbms_utility.get_time - v_begin_time) / 100;
end loop;
--Get average times. Throw out first and last result.
select ( sum(column_value) - max(column_value) - min(column_value) )
/ (count(*) - 2)
into v_average_hash_seconds
from table(v_hash_seconds);
select ( sum(column_value) - max(column_value) - min(column_value) )
/ (count(*) - 2)
into v_average_merge_seconds
from table(v_merge_seconds);
--Display size and times
dbms_output.put_line(v_rows||chr(9)||v_size_in_mb||chr(9)
||v_average_hash_seconds||chr(9)||v_average_merge_seconds);
end loop;
end;
/
So I wonder how is HASH JOIN possible in the case of both tables being very large?
It would be done in multiple passes: the driven table is read and hashed in chunks, the leading table is scanned several times.
This means that with limited memory hash join scales at O(N^2) while merge joins scales at O(N) (with no sorting needed of course), and on really large tables merge outperforms hash joins. However, the tables should be really large so that benefits of single read would outweight drawbacks of non-sequential access, and you would need all data from them (usually aggregated).
Given the RAM sized on modern servers, we are talking about really large reports on really large databases which take hours to build, not something you would really see in everyday live.
MERGE JOIN may also be useful when the output recordset is limited with rownum < N. But this means that the joined inputs should be already sorted which means they both be indexed which means NESTED LOOPS is available too, and that's what is usually chosen by the optimizer, since this is more efficient when the join condition is selective.
With their current implementations, MERGE JOIN always scans and NESTED LOOPS always seeks, while a more smart combination of both methods (backed up by statistics) would be preferred.
You may want to read this article in my blog:
Things SQL needs: MERGE JOIN that would seek
A hash join does not have to fit the whole table into memory, but only the rows which match the where conditions of that table (or even only a hash + the rowid - I'm not sure about that).
So when Oracle decides that the selectivity of the part of the where conditions affecting one of the tables is good enough (i.e. few rows will have to be hashed), it might prefer a hash join even for very large tables.

Resources