Re-writing a join query - hadoop

I have a question concerning Hive. Let me explain to you the scenario :
I am using a Hive action on Oozie; I have a query which is doing
succesive LEFT JOIN on different tables;
Total number of rows to be inserted is about 35 million;
First, the job was crashing due to lack of memory, so I set "set hive.auto.convert.join=false" the query was perfectly executed but it took 4 hours to be done;
I tried to rewrite the order of LEFT JOINs putting large tables at the end, but same result, about 4 hours to be executed;
Here is what the query look like:
INSERT OVERWRITE TABLE final_table
SELECT
T1.Id,
T1.some_field_name,
T1.another_filed_name,
T2.also_another_filed_name,
FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table
So, knowing the structure of the query is there a way to rewrite it so that I can avoid too many JOINs ?
Thanks in advance
PS: Even vectorization gave me the same timing

Too long for a comment, will be deleted later.
(1) Your current query won't compile.
(2) You are not selecting anything from T3 and T4, which makes no sense.
(3) Changing the order of tables is not likely to have any impact with cost based optimizer.
(4) Basically I would suggest to collect statistics on the tables, specifically on the id columns, but in your case I got a feeling that id is not unique in more than 1 table.
Add to your post the result of the following query:
select *
, case when cnt_1 = 0 then 1 else cnt_1 end
* case when cnt_2 = 0 then 1 else cnt_2 end
* case when cnt_3 = 0 then 1 else cnt_3 end
* case when cnt_4 = 0 then 1 else cnt_4 end as product
from (select id
,count(case when tab = 1 then 1 end) as cnt_1
,count(case when tab = 2 then 1 end) as cnt_2
,count(case when tab = 3 then 1 end) as cnt_3
,count(case when tab = 4 then 1 end) as cnt_4
from ( select 1 as tab,id from table1
union all select 2 as tab,id from table2
union all select 3 as tab,id from table3
union all select 4 as tab,id from table4
) t
group by id
having greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
) t
order by product desc
limit 10
;

Related

select statement should return count as zero if no row return using group by clause

I have a table student_info, it has column "status", status can be P (present), A (absent), S (ill), T ( transfer), L (left).
I am looking for expected output as below.
status count(*)
P 12
S 1
A 2
T 0
L 0
But output is coming like as below:
Status Count(*)
P 12
S 1
A 2
we need rows against status T and L as well with count zero though no record exist in DB.
#mkuligowski's approach is close, but you need an outer join between the CTE providing all of the possible status values, and then you need to count the entries that actually match:
-- CTE to generate all possible status values
with stored_statuses (status) as (
select 'A' from dual
union all select 'L' from dual
union all select 'P' from dual
union all select 'S' from dual
union all select 'T' from dual
)
select ss.status, count(si.status)
from stored_statuses ss
left join student_info si on si.status = ss.status
group by ss.status;
STATUS COUNT(SI.STATUS)
------ ----------------
P 12
A 2
T 0
S 1
L 0
The CTE acts as a dummy table holding the five statuses you want to count. That is then outer joined to your real table - the outer join means the rows from the CTE are still included even if there is no match - and then the rows that are matched in your table are counted. That allows the zero counts to be included.
You could also do this with a collection:
select ss.status, count(si.status)
from (
select column_value as status from table(sys.odcivarchar2list('A','L','P','S','T'))
) ss
left join student_info si on si.status = ss.status
group by ss.status;
It would be preferable to have a physical table which holds those values (and their descriptions); you could also then have a primary/foreign key relationship to enforce the allowed values in your existing table.
If all the status values actually appear in your table, but you have a filter which happens to exclude all rows for some of them, then you could get the list of all (used) values from the table itself instead of hard-coding it.
If your initial query was something like this, with a completely made-up filter:
select si.status, count(*)
from student_info si
where si.some_condition = 'true'
group by si.status;
then you could use a subquery to get all the distinct values from the unfiltered table, outer join from that to the same table, and apply the filter as part of the outer join condition:
select ss.status, count(si.status)
from (
select distinct status from student_info
) ss
left join student_info si on si.status = ss.status
and si.some_condition = 'true'
group by ss.status;
It can't stay as a where clause (at least here, where it's applying to the right-hand-side of the outer join) because that would override the outer join and effectively turn it back into an inner join.
You should store somewhere your statuses (pherhaps in another table). Otherwise, you list them using subquery:
with stored_statuses as (
select 'P' code, 'present' description from dual
union all
select 'A' code, 'absent' description from dual
union all
select 'S' code, 'ill' description from dual
union all
select 'T' code, 'transfer' description from dual
union all
select 'L' code, 'left' description from dual
)
select ss.code, count(*) from student_info si
left join stored_statuses ss on ss.code = si.status
group by ss.code

Need help creating a view on top of a JOIN query that needs to return only the latest value

I need help with my SQL Query I have Two tables that i need to join using a LEFT OUTER JOIN, then i need to create a database view over that particular view. If i run a query on the join to look for name A i need to get that A's latest brand "AP".
Table 1
ID name address
-----------------------
1 A ATL
2 B ATL
TABLE 2
ID PER_ID brand DATEE
--------------------------------------------
1 1 MS 5/19/17:1:00pm
2 1 XB 5/19/17:1:05pm
3 1 AP 5/19/17:2:00pm
4 2 RO 5/19/17:3:00pm
5 2 WE 5/19/17:4:00pm
I tried query a which returns correct result but i get problem 1 when i try to build the database view on top of the join. I tried query b but when i query my view in oracle sql developer i still get all the results but not the latest.
query a:
SELECT * from table_1
left outer join table_2 on table_1.ID = Table_2.PER_ID
AND table_2.DATE = (SELECT MAX(DATE) from table_2 z where z.PER_ID = table_2.PER_ID)
Problem 1
Error report -
ORA-01799: a column may not be outer-joined to a subquery
01799. 00000 - "a column may not be outer-joined to a subquery"
*Cause: <expression>(+) <relop> (<subquery>) is not allowed.
*Action: Either remove the (+) or make a view out of the subquery.
In V6 and before, the (+) was just ignored in this case.
Query 2:
SELECT * from table_1
left outer join(SELECT PER_ID,brand, max(DATEE) from table_2 group by brand,PER_ID) t2 on table_1.ID = t2.PER_ID
Use row_number():
select t1.id, t1.name, t1.address, t2.id as t2_id, t2.brand, t2.datee
from table_1 t1 left outer join
(select t2.*,
row_number() over (partition by per_id order by date desc) as seqnum
from table_2 t2
) t2
on t1.ID = t2.PER_ID and t2.seqnum = 1;
When defining a view, you should be in the habit of listing the columns explicitly.

data not retrieved in the same order

I have a query which returns list of SIMs. Each SIM is linked to a Customer. The SIMs are in T_SIM table and Customers are in T_CUSTOMER table. There can be more than one SIM linked to a single Customer. When returning the SIMs it returns the Customer details also.
The T_SIM table will have a foreigh key to T_CUSTOMER table.
The issue is:
First run the query by requesting top 100 records by doing order by CUSTOMER_CODE in ascending order.
Now run the same query by requesting top 1000 records by doing order by CUSTOMER_CODE in ascending order.
Here in point #2, in the results of 1000 records the first 100 records are not same as in point #1 result. The records got shuffled. The order is not consistent.
To resolve this I have used ROWID along with order by CUSTOMER_CODE.
But the solution is not accepted by the client.
Could you please suggest any other alternative to resolve the issue. The data type of CUSTOMER_CODE is VARCHAR2
Below is the query:
SELECT TT.SIM_ID,
TT.IMSI,
TT.MSISDN,
TT.SECONDARY_MSISDN,
TT.CUSTOMER_ID,
TT.SIM_STATE,
TCU.CUSTOMER_CODE
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 = 1
AND TT.SIM_ID IN
(SELECT SIM_ID
FROM
(SELECT *
FROM
(SELECT Z.*,
ROWNUM RNUM
FROM
(SELECT TT.SIM_ID
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 =1
ORDER BY TCU.CUSTOMER_CODE ASC
) Z
WHERE ROWNUM <= 1000
)
WHERE RNUM >= 0
)
)
ORDER BY TCU.CUSTOMER_CODE ASC
The result in both the cases is done order by CUSTOMER_CODE but the SIMS belonging to them are not coming in the same order.
The problem is that first you are limiting number of rows when selecting from t_sim (so these are selected randomly) , and just then you are ordering your output.
So what you should do, is to remove ROWNUM<1000 from inner query and
put it on the very top level like this:
select * from
( TT.SIM_ID,
TT.IMSI,
TT.MSISDN,
TT.SECONDARY_MSISDN,
TT.CUSTOMER_ID,
TT.SIM_STATE,
TCU.CUSTOMER_CODE
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 = 1
AND TT.SIM_ID IN
(SELECT SIM_ID
FROM
(SELECT *
FROM
(SELECT Z.*,
ROWNUM RNUM
FROM
(SELECT TT.SIM_ID
FROM T_SIM TT
LEFT OUTER JOIN T_CUSTOMER TCU
ON TT.CUSTOMER_ID = TCU.CUSTOMER_ID
WHERE 1 =1
ORDER BY TCU.CUSTOMER_CODE ASC
) Z
)
WHERE RNUM >= 0
)
)
ORDER BY TCU.CUSTOMER_CODE ASC
) where rownum<1000
Because first you want to make complete ordered result set and just then display 1000 top records of sim cards ordered by customer_code.

Selecting rows that has exactly same data as other table

I have these tables:
Products, Articles, Product_Articles
Lets say, product_ids are: p1 , p2 article_ids are: a1 , a2 , a3
product_articles is:
(p1,a1)
(p1,a2)
(p2,a1)
(p2,a1)
(p2,a2)
(p2,a3)
How to query for product_id, which has only a1,a2, nothing less, nothing more?
UPDATED Try
SELECT p.*
FROM products p JOIN
(
SELECT product_id
FROM product_articles
GROUP BY product_id
HAVING COUNT(*) = SUM(CASE WHEN article_id IN (1, 2) THEN 1 ELSE 0 END)
AND SUM(CASE WHEN article_id IN (1, 2) THEN 1 ELSE 0 END) = 2
) q ON p.product_id = q.product_id
or
SELECT p.*
FROM products p JOIN
(
SELECT product_id, COUNT(*) a_count
FROM product_articles
WHERE article_id IN (1, 2)
GROUP BY product_id
HAVING COUNT(*) = 2
) a ON p.product_id = a.product_id JOIN
(
SELECT product_id, COUNT(*) total_count
FROM product_articles
GROUP BY product_id
) b ON p.product_id = b.product_id
WHERE a.a_count = b.total_count
Here is SQLFiddle demo for both queries
This is an example of a "set-within-sets" subquery. I advocate using aggregation with a having clause for the logic, because this is the most general way to express the relationships.
The idea is that you can count the appearance of the articles within a product (in this case) in a way similar to using a where statement. The code is a bit more complex, but it offers flexibility. In your case, this would be:
select pa.product_id
from product_articles pa
group by pa.product_id
having sum(case when pa.article_id = 'a1' then 1 else 0 end) > 0 and
sum(case when pa.article_id = 'a2' then 1 else 0 end) > 0 and
sum(case when pa.article_id not in ('a1', 'a2') then 1 else 0 end) = 0;
The first two clauses count the appearance of the two articles, making sure that there is at least one occurrence of each. The last counts the number of rows without those two articles, making sure there are none.
You can see how this easily generalizes to more articles. Or to queries where you have "a1" and "a2" but not "a3". Or where you have three of four of specific articles, and so on.
I believe this can be done entirely using relational joins, as follows:
SELECT DISTINCT pa1.PRODUCT_ID
FROM PRODUCT_ARTICLES pa1
INNER JOIN PRODUCT_ARTICLES pa2
ON (pa2.PRODUCT_ID = pa1.PRODUCT_ID)
LEFT OUTER JOIN (SELECT *
FROM PRODUCT_ARTICLES
WHERE ARTICLE_ID NOT IN (1, 2)) pa3
ON (pa3.PRODUCT_ID = pa1.PRODUCT_ID)
WHERE pa1.ARTICLE_ID = 1 AND
pa2.ARTICLE_ID = 2 AND
pa3.PRODUCT_ID IS NULL
SQLFiddle here.
The inner join looks for products associated with the articles we care about (articles 1 and 2 - produces product 1 and 2). The left outer looks for products associated with articles we don't care about (anything article except 1 and 2) and then only accepts products which don't have any unwanted articles (i.e. pa3.PRODUCT_ID IS NULL, indicating that no row from pa3 was joined in).

Sybase expert help: groupby aggregate performance problems

Hey I have the following tables and SQL:
T1: ID, col2,col3 - PK(ID) - 23mil rows
T2: ID, col2,col3 - PK(ID) - 23mil rows
T3: ID, name,value - PK(ID,name) -66mil rows
1) The below sql returns back the 10k row resultset very fast, no problems.
select top 10000 T1.col2, T2.col2, T3.name, T4.value
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
2) The below sql took FOREVER.
select top 10000 T1.col2, T2.col2,
ABC = min(case when T3.name='ABC ' then T3.value end)
XYZ = min(case when T3.name='XYZ ' then T3.value end)
from T1, T2, T3
where T1.ID = T2.ID and T1.ID *= T3.ID and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE'
group by T1.col2, T2.col2,
The only difference in the showplan between those 2 queries are the below for query 2). I dont understand it 100%, is it selecting the ENTIRE resultset WITHOUT top 10000 into the temp table then doing a group by on it? is that why it's slow?
STEP 1
The type of query is SELECT (into Worktable1).
GROUP BY
Evaluate Grouped MINIMUM AGGREGATE.
FROM TABLE ...etc..
TO TABLE
Worktable1.
STEP 2
The type of query is SELECT.
FROM TABLE
Worktable1.
Nested iteration.
Table Scan.
Forward scan.
Positioning at start of table.
Using I/O Size 16 Kbytes for data pages.
With MRU Buffer Replacement Strategy for data pages.
My question is
1) Why is query 2) so slow
2) How do I fix while keeping the query logic the same and preferably limit it to just 1 select SQL as before.
thank you
Although possibly a generic answer, I'd say to put a index on the columns you're grouping by.
Edit / Revise: Here's my theory after re-looking at the issue. The SELECT statement in a query is always the last line executed. This makes sense as it is the statement that retrieves the values you want from the dataset specified below. In your query, the whole dataset (millions of records) will be evaluated for the MIN value expression that you specified. There will be two seperate functions called on the entire dataset, since you have specified two MIN columns in the select statement. After the dataset is filtered and the MIN columns have been determined, the top 10000 rows will then be selected.
In a nutshell, you're doing two mathematical function on millions of records. This will take a significant amount of time, especially with no indexes.
The solution for you would be to use a derived table. I haven't compiled the code below, but it's something close to what you would use. It will only take the min values of the 10,000 records rather than the whole dataset.
I.e.
Select my_derived_table.t1col2, my_derived_table.t2col2,
ABC = min(case when my_derived_table.t3name ='ABC ' then my_derived_table.t3value end),
XYZ = min(case when my_derived_table.t3name='XYZ ' then my_derived_table.t3value end)
FROM
(Select top 10000 T1.col2 as t1col2,
T2.col2 as t2col2,
t3.name as t3name,
t3.value as t3.value
from T1, T2, T3
where T1.ID = T2.ID
and T1.ID *= T3.ID
and T3.name in ('ABC','XYZ')
and T2.col1 = 'SOMEVALUE') my_derived_table
group by my_derived_table.t1col2, my_derived_table.t2col2

Resources