Issue with joining tables in hadoop , where driver table has 10M records and child tables are left joined having 1M records - hadoop

Facing an issue with joining 3 tables in hadoop , where left most table has 10M records and each right table has 1M records. The right most tables are left join with parent table.
SELECT distinct Table1.cid,Table2.gdtyp,Table3.ager,Table3.edcd FROM (SELECT
distinct(cid) FROM Table1 WHERE Table1.orgcd='T002' AND
(Table1.cacttrdt>=19980101 AND Table1.cacttrdt<=20171120) limit 2) Table1
LEFT JOIN Table2 Table2 ON (Table2.cid=Table1.cid) LEFT JOIN Table3 Table3
ON (Table3.cid=Table1.cid)
The above query is getting stuck during mapreduce.
Have set auto convert join to false.

Vectorized query execution improves performance of operations like joins, scans, aggregations and filters, by performing them in batches of 1024 rows at once instead of single row each time.
Introduced in Hive 0.13, this feature significantly improves query execution time, and is easily enabled with two parameters settings:
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
Also use the TEZ as the execution engine than Mapreduce.

Related

Oracle SQL - Does the JOIN order in a FROM clause impact performance optimization?

A long while ago I was once told during a SQL course that the JOIN order in a FROM clause of a query can impact the performance of the query. So for example if I had the following
SELECT * FROM
TABLE_1 INNER JOIN --5000 rows
TABLE_2 ON TABLE_1.COL1=TABLE_2.COL1 INNER JOIN --200 rows
TABLE_3 ON TABLE_2.COL1=TABLE_3.COL1--50 rows
.....
This should be reordered to the following
SELECT * FROM
TABLE_3 INNER JOIN --50 rows
TABLE_2 ON TABLE_2.COL1=TABLE_3.COL1 INNER JOIN --200 rows
TABLE_1 ON TABLE_1.COL1=TABLE_2.COL1 --5000 rows
.....
So the leading/driving table is the least amount of rows first (hypothetically). I have read though that unless a HINT is used to force the order, the cost based optimizer within Oracle would just re-arrange the JOIN as it saw fit.
Just curious, does the JOIN order without using HINTS matter in a SQL statement?
Exactly, a long time ago there was impact, when RBO (Rule based optimizer) was used.
In modern Oracle releases, CBO (Cost based optimizer) chooses the best execution plan and does that dirty job for you so - no, you don't have to reorder tables any more.
does the JOIN order without using HINTS matter in a SQL statement?
No. That's basic optimisation for the database; the optimizer will decide what is the best strategy to join the tables, regardless of the order in which they appear in the from clause.
The oracle optimizer component "Query Transformer" transform your query, it does this automatically if needed with the statistics available´when it finds your to be transformed.

Partition pruning issue

I’m joining 2 tables. Pruning is happening on table 1 but not on table 2 even though there is an outer join.
Example:
select *
from table1 t1, table2 t2
where t1.sk in (select sk from filter_table)
and t2.sk(+) = t1.sk
When I check the plan and noticed t1 table has KEY partition scan, but T2 is scanning all the partition(~4500). so the query is taking more than 4hrs just to pull 50 recs.
Is there any way to force the pruning on table 2 as well?
I am using Oracle 11g.
Without more data it is hard to say for sure what the problem can be. I have rewritten the query for clarity and with a simple test schema I get pruning for both tables with Oracle 12c (I don't have 11g handy). The first with key and the second with Bloom Filter (:BF0000 in the plan).
select t1.*, t2.*
from filter_table ft
join table1 t1 on t1.sk = ft.sk
left outer join table2 t2 on t2.sk = ft.sk;
Be sure to gather statistics for all three tables! Often when the optimizer seems to be stupid it is because the statistics are missing or not up to date.

Update delta records in hive table

I have a table with history data which is more than a TB size and I would be receiving delta (updated info) records on daily basis which will be in GB size and stored in delta table. Now I want to compare the delta records with the history records and update the History table with the latest data from Delta table.
What is the best approach to do this in Hive since I would be dealing with millions of rows. I have searched the web and found the below approach.
http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive
But I don't think it would a be best approach in the aspect of performance.
In Latest hive (0.14), you can do updates. You need to keep the table in ORC format and bucket by the searching key.
Oh, and I need to add this link for more information:
Hive Transactions
In addition:
Do you have a good partitioning key so that the updates will only have to work on latest partitions? it can be good to do the following:
get data from required partitions to a temp table (T1)
let's say T2 is the new table with update records. need to be partitioned the same way as T1
Join T1 and T2 with key(s) and take the ones only present in T1 and not in T2. Let's say this table is T3
Union T2 and T3 to create table T4
Drop the previously taken partitions from T1
Insert T4 into T1
Remember, the operations may not be atomic and during the time step 5 and 6 happens, any query running on T1 can have intermediate results.

Joins on two large tables using UDF in Hive - performance is too slow

I have two tables in hive. One has around 2 millions of records and other has 14 miliions of records. I am joining these two tables. Also I am applying UDF in WHERE clause. It is taking too much time to perform JOIN operation.
I have tried to run the query for many times but it run for around 2 hrs and still my reducer remains at 70% and after that I am getting exception "java.io.IOException: No space left on device" and job gets killed.
I have tried to set the parameters as below:
set mapreduce.task.io.sort.mb=256;
set mapreduce.task.io.sort.factor=100;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.child.java.opts=-Xmx1024m;
My Query -
insert overwrite table output select col1, col2, name1, name2, col3, col4,
t.zip, t.state from table1 m join table2 t ON (t.state=m.state and t.zip=m.zip)
where matchStrings(concat(name1,'|',name2))>=0.9;
The above query takes 8 mappers and 2 reducers.
Can someone please suggest what do I suppose to do to improve performance.
That exception probably indicates that you do not have enough space in the cluster for the temporary files created by the query you are running. You should try adding more disk space to the cluster or reducing the amount of rows that are joined by using a subquery to first filter the rows from each table.

Wrong index is chosen by Oracle

I have a problem in indexing in Oracle. Will try to explain my problem with an instance as follows.
I have a table TABLE1 with columns A,B,C,D
another table TABLE2 with columns A,B,C,E,F,H
I have created Indexes for TABLE1
IX_1 A
IX_2 A,B
IX_3 A,C
IX_4 A,B,C
I have created Indexes for TABLE1
IY_1 A,B,C
IY_2 A
when i gave query similar to this
SELECT * FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
When i give Explain Plan i got its not getting IX_1 nor IY_2
Its taking IX_4 nor IY_1
why this is not picking right index?
EDITED:
Can anyone help me to know difference between INDEX RANGE SCAN,INDEX UNIQUE SCAN, INDEX SKIP SCAN
I guess SKIP SCAN means when a column is skipped in Composite Index by Oracle
what about others i dont have idea!
The best benefit of indexes is that you can select a few rows from a table without scanning the entire table.
If you ask for too many rows(let's say 30% - depends of many things) the engine will prefer to scan the entire table for those rows.
That's because reading a row using an index is gets an overhead : reading some index blocks, and after that reading table blocks.
In your case, in order to join tables T1 and T2, Oracle needs all the rows from those table. Reading(full) the index will be an unsefull operation, adding unnecesary cost.
UPDATE: A step forward: if you run:
SELECT T1.B, T2.B FROM TABLE1 T1,TABLE2 T2
WHERE T1.A=T2.A
Oracle probably will use the indexes(IX2, IY2), because it does not need to read anything from table, because the values T1.B, T2.B, are in indexes.

Resources