Hive - optimize multiple table joins

Hive - optimize multiple table joins - performance

I need to join multiple tables in a single query and then overwrite another table.
Focus/Driver table: FACT (huge, bucketed on ID)
Join Table 1: T1 (big, but smaller than FACT, bucketed on ID and joined with FACT on FACT.ID)
Join Table 2: T2 (big, but smaller than T1 and joined with FACT on FACT.ID2)
Join Table 3: T3 (reference table, small enough to fit into memory, joined to FACT)
Join Table 4: T4 (reference table, small enough to fit into memory, joined to FACT)
Join Table 5: T5 (reference table, small enough to fit into memory, joined to FACT)
Now, I wanted to know what is the sequence of tables to be joined to achieve the best performance.
My thoughts and questions:
I want to first join FACT with T1 since both are bucketed. But is it a good idea to join the 2 big tables first since this huge joined dataset will be joined with the smaller ones (which means more data to be moved between mappers and reducers) or should we join with the smaller tables first? But, if we join the FACT with the smaller tables, I will not be able to perform a bucket join with T1, right (since the joined dataset will not be bucketed).

Related

Force partition pruning on Oracle

I have a query similar to this
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
The huge_table is partitioned by DATE, and the PK is DATE, some_id and some_other_id (so the join not is done by pk index).
small_table just contains a few dates.
The total cost of the SQL is 48 minutes
By some reason the explain plan give me a "PARTITION RANGE (ALL)" with a high numbers on cardinality. Looks like access to the full table, not just the partitions indicated by small_table.DATE
If I put the SQL inside a loop and do
for o in (select date from small_table)
loop
select *
from small_table A
inner join huge_table B on A.DATE =B.DATE
where B.DATE=O.DATE
end loop;
Only takes 2 minutes 40 seconds (the full loop).
There is any way to force the partition pruning on Oracle 12c?
Additional info:
small_table has 37 records for 13 different dates. huge_table has 8,000 million of records with 179 dates/partitions. The SQL needs one field from small_table, but I can tweak the SQL to not use it
Update:
With the use_nl hint, now the cardinality show in the execution plan is more accurate and the execution time downs from 48 minutes to 4 minutes.
select /* use_nl(B) */*
from small_table A
inner join huge_table B on A.DATE =B.DATE

This seems like the problem:
"small_table have 37 registries for 13 different dates. huge_table has 8.000 millions of registries with 179 dates/partitions....
The SQL need one field from small_table, but I can tweak the SQL to not use it "
According to the SQL you posted you're joining the two tables on just their DATE columns with no additional conditions. If that's really the case you are generating a cross join in which each partition of huge_table is joined to small_table 2-3 times. So your result set may be much large than you're expecting, which means more database effort, which means more time.
The other thing to notice is that the cardinality of small_table to huge_table partitions is about 1:4; the optimizer doesn't know that there are really only thirteen distinct huge_table partitions in play.
Optimization ought to be a science and this is more guesswork than anything but try this:
select B.*
from ( select /*+ cardinality(t 13) */
distinct t.date
from small_table t ) A
inner join huge_table B
on A.DATE =B.DATE
This should communicate to the optimizer that only a small percentage of the huge_table partitions are required, which may make it choose partition pruning. Also it removes that Cartesian product, which should improve performance too. Obviously you will need to apply that tweak you mentioned, to remove the need to query anything else from small_table.

what is skewed column in Oracle

I found some bottleneck of my query which select data from only single table then require time and i used non unique key index on two column and with column used in where clause.
select name ,isComplete from Student where year='2015' and isComplete='F'
Now i found some concept from internet like skewed column so what is it?
have an idea then plz help me?
and how to resolve problem of skewed column?
and how skewed column affect performance of the Query?

Skewed columns are columns in which the data is not evenly distributed among the rows.
For example, suppose:
You have a table order_lines with 100,000,000 rows
The table has a column named customer_id
You have 1,000,000 distinct customers
Some (very large) customers can have hundreds of thousands or millions of order lines.
In the above example, the data in order_lines.customer_id is skewed. On average, you'd expect each distinct customer_id to have 100 order lines (100 million rows divided by 1 million distinct customers). But some large customers have many, many more than 100 order lines.
This hurts performance because Oracle bases its execution plan on statistics. So, statistically speaking, Oracle thinks it can access order_lines based on a non-unique index on customer_id and get only 100 records back, which it might then join to another table or whatever using a NESTED LOOP operation.
But, then when it actually gets 1,000,000 order lines for a particular customer, the index access and nested loop join are hideously slow. It would have been far better for Oracle to do a full table scan and hash join to the other table.
So, when there is skewed data, the optimal access plan depends on which particular customer you are selecting!
Oracle lets you avoid this problem by optionally gathering "histograms" on columns, so Oracle knows which values have lots of rows and which have only a few. That gives the Oracle optimizer the information it needs to generate the best plan in most cases.

Full table scan and Index Scan both are depend on the Skewed column.
and Skewed column is nothing but your spread like gender column contain 60 male and 40 female.

Left Join Vs Inner Join in Hive -- internals and performance on multiple joins, map joins

Does anyone know if there is a difference in performance for left join vs inner join in Hive, with Map Join enabled via hive.auto.convert.join=True?
The reason I ask, per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization#LanguageManualJoinOptimization-JoinOptimization
Outer joins offer more challenges. Since a map-join operator can only
stream one table, the streamed table needs to be the one from which
all of the rows are required. For the left outer join, this is the
table on the left side of the join; for the right outer join, the
table on the right side, etc. This means that even though an inner
join can be converted to a map-join, an outer join cannot be
converted. An outer join can only be converted if the table(s) apart
from the one that needs to be streamed can be fit in the size
configuration.
It seems like this is saying (a) an outer join can't be converted to an inner join at all and (b) it can only be converted if the table that doesn't need to be streamed is the "left join" table(s). Does anyone know which one it is?
Also, is there a difference in performance for INNER JOIN vs LEFT JOIN in general, in Hive, as there is in SQL? Does that difference become more magnified (and or, start to exist in the first place) when several left joins are involved? The reason I ask is I'm considering adding several dummy entries to some left joined lookup tables to convert my joins to inner... intuitively it seems like it might make a difference, performance wise, but I can't find any documentation or discussion either way. Curious if anyone has experience with this.

oracle partitioning on columns frequently used in joins and where conditions

The customer table contains 9.5 million records. The customer_id column is the primary key. The database is Oracle.
Questions:
1) Should the table contain main partitions or sub-partitions? How do I decide?
Also, I don't think indexing columnA or columnB will help here because of the type of data.
TableA.columnA (varchar) has more than 80% of the records for columnA values 5,6,7. The columnA has values from 1 to 7 only.
TableA.columnB (varchar) has 90% of the records for columnB value = 102. The columnB has values from 1 to 999.
Moreover, the typical queries are (in no particular order):
Query1: where tableA.columnA = values
Query2: where tableA.columnB = values
Query3: where tableA.columnA = values AND/OR tableA.columnB = values
2) When we create sub-partitions, what happens if the query only contains a where clause for sub-partition column? Does the query execution go directly to sub-partition or through main partition?
3) the join contains tableA.partitioned_column = tableB.indexed_column
(eg. customer_Table.branch_code = branch_table.branch_code)
Does partitioning help in the case of JOIN? Will it improve performance?

1) It's very difficult to answer not knowing table structure, the way it's usually used etc. But generally for big tables partitioning is very often necessity.
2) If you will not specify partition then Oracle will have to browse through all partitions to find where the subpartition is (which is not very slow). And then use partition pruning on subpartition. It will be still significantly faster then not having subpartitions at all. But the best situation is to refer in WHERE to partition and subpartition.
3) For 99% I think it will help, because Oracle can use partition pruning to get at once needed rows from tableA. You will be 100% sure if you check query plan. But the best situation is when both column are partition keys.

If 80-90% of these columns have the same values and they are the most often queried values, then partitioning will help some. You would be pruning 10-20% of the data during these queries but you probably want to find another way for Oracle to hone in on the data your query needs (dates, perhaps?)
The value distribution in your two columns also brings up the point of statistics and making sure they are being gathered properly (with histograms to describe the skew in these columns).
As #psur points out, without knowing the details of your system it's hard give concrete suggestions.

Skewed tables in Hive

I am learning hive and came across skewed tables. Help me understanding it.
What are skewed tables in Hive?
How do we create skewed tables?
How does it effect performance?

What are skewed tables in Hive?
A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file..
How do we create skewed tables?
create table <T> (schema) skewed by (keys) on ('value1', 'value2') [STORED as DIRECTORIES];
Example :
create table T (c1 string, c2 string) skewed by (c1) on ('x1')
How does it affect performance?
By specifying the skewed values Hive will split those out into separate files automatically and take this fact into account during queries so that it can skip (or include) whole files if possible thus enhancing the performance.
EDIT :
x1 is actually the value on which column c1 is skewed. You can have multiple such values for multiple columns. For example,
create table T (c1 string, c2 string) skewed by (c1) on ('x1', 'x2', 'x3')
Advantage of having such a setup is that for the values that appear more frequently than other values get split out into separate files(or separate directories if we are using STORED AS DIRECTORIES clause). And this information is used by the execution engine during query execution to make processing more efficient.

In Skewed Tables, partition will be created for the column value which has many records and rest of the data will be moved to another partition. Hence number of partitions, number of mappers and number of intermediate files will be reduced.
For ex: out of 100 patients, 90 patients have high BP and other 10 patients have fever, cold, cancer etc. So one partition will be created for 90 patients and one partition will be created for other 10 patients.
I hope this will answer your question.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio