Need guidance on re-writing this query

Need guidance on re-writing this query - hadoop

Current Scenario => We have a query which we are running on our prod cluster.
this query selects just 3 fields from a join between 1 table and ( nested way of join )another huge table and then performs a groupby at the end but runs for 2 hours on production and it hits one huge table in that join
Query :
INSERT OVERWRITE TABLE mstr_wrk.final_acct_data
SELECT
a0,
a1,
a2,
a3
FROM
(
SELECT
t1.a0 as a0
FROM
(
SELECT
t1.a0 as a0
FROM
(
SELECT
CAST(t1.acct_id AS STRING) as a0
FROM
mstr_wrk.cust_xref t1
)
t1
GROUP BY
t1.a0
)
t1
)
tab1
RIGHT OUTER JOIN
(
SELECT
a0,
a1,
a2
FROM
(
SELECT
(
CASE
WHEN
1 = t1.a1
THEN
t1.a0
ELSE
CAST(NULL AS TIMESTAMP)
END
) as a0, UDFcalldate('TRUNC', UDFcalldate('ADD_TO_DATE',
(
CASE
WHEN
1 = t1.a1
THEN
t1.a0
ELSE
CAST(NULL AS TIMESTAMP)
END
)
, 'D', - 1), 'DD') as a1
FROM
(
SELECT
MAX(t1.a0) as a0,
MAX(t1.a1) as a1
FROM
(
SELECT
load_audit.run_ts as a0,
1 as a1
FROM
mstr_wrk.load_audit
WHERE
val_name = 'card_stg'
)
t1
)
t1
)
tab4
JOIN
(
SELECT
CAST(t1.acct_cd AS STRING) as a0,
CAST(t1.h_acct_cd AS STRING) as a1,
CAST(t1.acct_num AS STRING) as a2,
CAST(t1.load_dt AS TIMESTAMP) as a3,
t1.ts as a4
FROM
mstr_work.acct_crd t1
)
tab3
WHERE
(
tab4.a0 < tab3.a4
)
AND
(
tab4.a1 <= tab3.a3
)
)
tab2
ON (tab1.a0 = tab2.a1)
WHERE
1 =
(
CASE
WHEN
tab1.a0 IS NULL
THEN
1
ELSE
0
END
)
GROUP BY
tab2.a0, tab2.a1, tab2.a2
What I tried -
I tried to enable CBO and vectorization along with ppd but no luck
I dont see any small table here so cant try map side join
One of the joins, which looks like a cross join can be translated to inner join
but is there anyway i can try CTE here
Request -
Kindly guide how can i fix it
is there any better way to rewrite this query.

Related

Pivot On More Than One Column From Source Table

I have a query where pivoting on one column and getting the total notionals works but I also need to get the total notional for another column also (type).
I have tried creating another pivot to show the total for the type column from source table.
SELECT * FROM (SELECT venue, type_, notional FROM ABC ) PIVOT ( SUM(notional) FOR (venue) IN ('A' as A1 , 'B' as B1) ) PIVOT ( SUM(notional) FOR (type_) IN ('Prime') );
I expect to see one pivot which would show the total for Prime, A1 and B1.

I think you shouldn't use pivot for this problem, but simple aggregate functions:
SELECT SUM(CASE WHEN venue = 'A' THEN notional END) AS venueA
, SUM(CASE WHEN venue = 'B' THEN notional END) AS venueB
, SUM(CASE WHEN type_ = 'Prime' THEN notional END) AS Prime
FROM ABC
If it's really only one type you can do the following:
SELECT * FROM
(SELECT venue
, SUM(CASE WHEN type_ = 'Prime' THEN notional END) over() prime
, notional
FROM ABC )
PIVOT ( SUM(notional) FOR (venue) IN ('A' as A1 , 'B' as B1) )
If there is also more then one type you have todo several PIVOT queries and then JOIN them afterwards:
SELECT * FROM
(SELECT * FROM (SELECT venue, notional FROM ABC )
PIVOT ( SUM(notional) FOR (venue) IN ('A' as A1 , 'B' as B1) ))
CROSS JOIN
(SELECT * FROM (SELECT type_, notional FROM ABC )
PIVOT ( SUM(notional) FOR (type_) IN ('Prime' AS Prime)))

Oracle Optimize Query at View

I have Query at view which is make the cost so heavy:
CREATE OR REPLACE FORCE EDITIONABLE VIEW "TMI_ISD_WV_LINK_REV4" ("P_DATA_NO", "DATA_KR_NO", "R_DATA_NO", "R_WF_ST", "SYSTEM_MATTER_ID") AS
SELECT wvl."P_DATA_NO"
,wvl."DATA_KR_NO"
,wvl."R_DATA_NO"
,wvl."R_WF_ST"
,(
select system_matter_id from (select system_matter_id from TMI_ISD_ALL_WORKFLOW wfl WHERE wfl.DATA_NO IN (wvl.R_DATA_NO,wvl.P_DATA_NO) order by wfl.create_date_TIME desc )
where rownum <= 1--fetch first 1 row only
)SYSTEM_MATTER_ID
FROM TMI_ISD_WV_LINK_REV3 wvl;
The max cost taken by this View :
TMI_ISD_WV_LINK_REV3 and TMI_ISD_ALL_WORKFLOW is also View:
CREATE OR REPLACE FORCE EDITIONABLE VIEW "TMI_ISD_ALL_WORKFLOW" ("KS_CD", "DATA_NO", "SYSTEM_MATTER_ID", "USER_DATA_ID", "MATTER_NUMBER", "FLOW_ID", "KI_WF_ST", "PROCESS_DATE", "APPLY_AUTH_USER_CODE", "CREATE_DATE_TIME", "CREATER_ID", "UPDATE_DATE_TIME", "UPDATER_ID") AS
SELECT "KS_CD","DATA_NO","SYSTEM_MATTER_ID","USER_DATA_ID","MATTER_NUMBER","FLOW_ID","KI_WF_ST","PROCESS_DATE","APPLY_AUTH_USER_CODE","CREATE_DATE_TIME","CREATER_ID","UPDATE_DATE_TIME","UPDATER_ID"
FROM NWA7_WF_HZ_KGH_KSH_SS
UNION ALL
SELECT "KS_CD","DATA_NO","SYSTEM_MATTER_ID","USER_DATA_ID","MATTER_NUMBER","FLOW_ID","KI_WF_ST","PROCESS_DATE","APPLY_AUTH_USER_CODE","CREATE_DATE_TIME","CREATER_ID","UPDATE_DATE_TIME","UPDATER_ID"
FROM NWA7_WF_HZ_KGH_KSH_YT
union all
SELECT "KS_CD","DATA_NO","SYSTEM_MATTER_ID","USER_DATA_ID","MATTER_NUMBER","FLOW_ID","KI_WF_ST","PROCESS_DATE","APPLY_AUTH_USER_CODE","CREATE_DATE_TIME","CREATER_ID","UPDATE_DATE_TIME","UPDATER_ID"
FROM NWA7_WF_HZ_SCRH_SS
UNION ALL
SELECT "KS_CD","DATA_NO","SYSTEM_MATTER_ID","USER_DATA_ID","MATTER_NUMBER","FLOW_ID","KI_WF_ST","PROCESS_DATE","APPLY_AUTH_USER_CODE","CREATE_DATE_TIME","CREATER_ID","UPDATE_DATE_TIME","UPDATER_ID"
FROM NWA7_WF_HZ_SC_YT
UNION ALL
SELECT "KS_CD","DATA_NO","SYSTEM_MATTER_ID","USER_DATA_ID","MATTER_NUMBER","FLOW_ID","KI_WF_ST","PROCESS_DATE","APPLY_AUTH_USER_CODE","CREATE_DATE_TIME","CREATER_ID","UPDATE_DATE_TIME","UPDATER_ID"
FROM NWA7_WF_HZ_SKH_YT
UNION ALL
SELECT "KS_CD","DATA_NO","SYSTEM_MATTER_ID","USER_DATA_ID","MATTER_NUMBER","FLOW_ID","KI_WF_ST","PROCESS_DATE","APPLY_AUTH_USER_CODE","CREATE_DATE_TIME","CREATER_ID","UPDATE_DATE_TIME","UPDATER_ID"
FROM NWA7_WF_HZ_SKH_SS;
TMI_ISD_WV_LINK_REV3 :
CREATE OR REPLACE FORCE EDITIONABLE VIEW "TMI_ISD_WV_LINK_REV3" ("P_DATA_NO", "DATA_KR_NO", "R_DATA_NO", "R_WF_ST", "SYSTEM_MATTER_ID") AS
SELECT J1.data_no AS P_DATA_NO
,J1.data_kr_no
,J2.data_no AS R_DATA_NO
,J2.KI_WF_ST AS R_WF_ST
,J3.SYSTEM_MATTER_ID
FROM NWJ2_T_KGH_KSH_YT_HZ J1
LEFT JOIN (
SELECT a.KS_CD
,a.DATA_KR_NO
,a.DATA_NO
,b.KI_WF_ST
FROM NWJ2_T_KGH_KSH_SS_HZ a
JOIN NWA8_T_KS_SK_KGH_KSH_SS b ON a.KS_CD = b.KS_CD
AND a.DATA_NO = b.DATA_NO
WHERE a.KGH_KSH_SS_YK_KBN = '1'
) J2 ON J1.KS_CD = J2.KS_CD
AND J1.DATA_KR_NO = J2.DATA_KR_NO
LEFT JOIN (
SELECT KS_CD,SYSTEM_MATTER_ID,DATA_NO
FROM NWA7_WF_HZ_KGH_KSH_SS
UNION ALL
SELECT KS_CD,SYSTEM_MATTER_ID,DATA_NO
FROM NWA7_WF_HZ_KGH_KSH_YT
) J3 ON J1.KS_CD = J3.KS_CD
AND J1.DATA_NO = J3.DATA_NO
WHERE J1.KGH_KSH_YT_YK_KBN = '1'
UNION ALL
SELECT B1.data_no AS P_DATA_NO
,B1.data_kr_no
,B2.data_no AS R_DATA_NO
,B2.KI_WF_ST AS R_WF_ST
,B3.SYSTEM_MATTER_ID
FROM NWB2_T_SC_YT_HZ B1
LEFT JOIN (
SELECT a.KS_CD
,a.DATA_KR_NO
,a.DATA_NO
,b.KI_WF_ST
FROM NWB2_T_SC_RH_SS_HZ a
JOIN NWA8_T_KS_SK_SCRH_SS b ON a.KS_CD = b.KS_CD
AND a.DATA_NO = b.DATA_NO
WHERE a.SC_RH_SS_YK_KBN = '1'
) B2 ON B1.DATA_KR_NO = B2.DATA_KR_NO
AND B1.KS_CD = B2.KS_CD
LEFT JOIN (
SELECT KS_CD,SYSTEM_MATTER_ID,DATA_NO
FROM NWA7_WF_HZ_SCRH_SS
UNION ALL
SELECT KS_CD,SYSTEM_MATTER_ID,DATA_NO
FROM NWA7_WF_HZ_SC_YT
) B3 ON B1.KS_CD = B3.KS_CD
AND B1.DATA_NO = B3.DATA_NO
WHERE B1.SC_YT_YK_KBN = '1'
UNION ALL
SELECT C1.data_no AS P_DATA_NO
,C1.DATA_KR_NO
,C2.data_no AS R_DATA_NO
,C2.KI_WF_ST AS R_WF_ST
,C3.SYSTEM_MATTER_ID
FROM NWC2_T_SKH_YT_HZ C1
LEFT JOIN (
SELECT a.KS_CD
,a.DATA_KR_NO
,a.DATA_NO
,b.KI_WF_ST
FROM NWC2_T_SKH_SS_HZ a
JOIN NWA8_T_KS_SK_SKH_SS b ON a.KS_CD = b.KS_CD
AND a.DATA_NO = b.DATA_NO
WHERE a.SKH_SS_YK_KBN = '1'
) C2 ON C1.DATA_KR_NO = C2.DATA_KR_NO
AND C1.KS_CD = C2.KS_CD
LEFT JOIN (
SELECT KS_CD,SYSTEM_MATTER_ID,DATA_NO
FROM NWA7_WF_HZ_SKH_YT
UNION ALL
SELECT KS_CD,SYSTEM_MATTER_ID,DATA_NO
FROM NWA7_WF_HZ_SKH_SS
) C3 ON C1.KS_CD = C3.KS_CD
AND C1.DATA_NO = C3.DATA_NO
WHERE C1.SKH_YT_YK_KBN = '1';
As far as i know, I can not add index from view at oracle, so I am guessing the culprit is this sub query at TMI_ISD_WV_LINK_REV4:
(
select system_matter_id from (select system_matter_id from TMI_ISD_ALL_WORKFLOW wfl WHERE wfl.DATA_NO IN (wvl.R_DATA_NO,wvl.P_DATA_NO) order by wfl.create_date_TIME desc )
where rownum <= 1--fetch first 1 row only
)SYSTEM_MATTER_ID

Actually this performance has been tuned up using Materialized View, but the data will updated every 3 minutes once, or like that.
Thank You all.

Hive - OR condition with left outer join

I have referred all the queries on SO for a similar case. Although the error may be common, I am looking for solution for the specific case. Please refrain from marking the question duplicate, unless you get exactly the same scenario with an accepted solution.
I have two tables
Main table:
c1 c2 c3 c4 c5
1 2 3 4 A
Other table
c1 c2 c3 c4 c5
1 8 5 6 B
8 2 8 9 C
8 7 3 9 C
8 7 9 4 C
5 6 7 8 D
Now, from the other table, I should only be able to pick only unique record across all the column. e.g. the last row (5,6,7,8, D) only.
Row 1 from other table rejected, because c1 value (1) is same as c1 value (1) in main table, Row 2 rejected because c2 value of other and main table matches and likewise...
In a nutshell, none of the columns from other table should have the same value (in corresponding column) in main table in the output of the query.
I tried creating below query
select t1.* from otherTable t1
LEFT OUTER JOIN mainTable t2
ON ( t1.c1 = t2.c1 OR t1.c2 = t2.c2 OR t1.c3 = t2.c3 OR t1.c4 = t2.c4 )
Where t2.c5 is null;
However, hive throws below exception
OR not supported in JOIN currently
I understand the hive limitation and many time I have used UNION (ALL | DISTINCT) with inner join to overcome this limitation; but not able to use the same strategy with this.
Please help.
EDIT 1 : I have hive version restriction - Can only use version 1.2.0

You can do cartesian product join (an inner join with no conditions):
select t1.* from otherTable t1
,mainTable t2
WHERE t1.c1 != t2.c1 AND t1.c2 != t2.c2
AND t1.c3 != t2.c3 AND t1.c4 != t2.c4 AND t1.c5 != t2.c5;
Assuming you have one row in the mainTable this query should be as efficient as one that uses OUTER JOIN
Another option is to break your proposed query into 5 different LEFT OUTER JOIN sub-queries:
select t1.* from (
select t1.* from (
select t1.* from (
select t1.* from (
select t1.* from otherTable t1
LEFT OUTER JOIN (select distinct c1 from mainTable) t2
ON ( t1.c1 = t2.c1) Where t2.c1 is null ) t1
LEFT OUTER JOIN (select distinct c2 from mainTable) t2
ON ( t1.c2 = t2.c2) Where t2.c2 is null ) t1
LEFT OUTER JOIN (select distinct c3 from mainTable) t2
ON ( t1.c3 = t2.c3) Where t2.c3 is null ) t1
LEFT OUTER JOIN (select distinct c4 from mainTable) t2
ON ( t1.c4 = t2.c4) Where t2.c4 is null ) t1
LEFT OUTER JOIN (select distinct c5 from mainTable) t2
ON ( t1.c5 = t2.c5) Where t2.c5 is null
;
Here, for each column, I first get the distinct columns from the mainTable and join it with what's left of otherTable. The downside is that I pass 5 times over mainTable - once for each column. If the values in main table are unique, you can remove the distinct from the subqueries.

CROSS APPLY too slow for running total - TSQL

Please see my code below as it is running too slowly with the CROSS APPLY.
How can I remove the CROSS APPLY and add something else that will run faster?
Please note I am using SQL Server 2008 R2.
;WITH MyCTE AS
(
SELECT
R.NetWinCURRENCYValue AS NetWin
,dD.[Date] AS TheDay
FROM
dimPlayer AS P
JOIN
dbo.factRevenue AS R ON P.playerKey = R.playerKey
JOIN
dbo.vw_Date AS dD ON Dd.dateKey = R.dateKey
WHERE
P.CustomerID = 12345)
SELECT
A.TheDay AS [Date]
,ISNULL(A.NetWin, 0) AS NetWin
,rt.runningTotal AS CumulativeNetWin
FROM MyCTE AS A
CROSS APPLY (SELECT SUM(NetWin) AS runningTotal
FROM MyCTE WHERE TheDay <= A.TheDay) AS rt
ORDER BY A.TheDay

CREATE TABLE #temp (NetWin money, TheDay datetime)
insert into #temp
SELECT
R.NetWinCURRENCYValue AS NetWin
,dD.[Date] AS TheDay
FROM
dimPlayer AS P
JOIN
dbo.factRevenue AS R ON P.playerKey = R.playerKey
JOIN
dbo.vw_Date AS dD ON Dd.dateKey = R.dateKey
WHERE
P.CustomerID = 12345;
SELECT
A.TheDay AS [Date]
,ISNULL(A.NetWin, 0) AS NetWin
,SUM(B.NetWin) AS CumulativeNetWin
FROM #temp AS A
JOIN #temp AS B
ON A.TheDay >= B.TheDay
GROUP BY A.TheDay, ISNULL(A.NetWin, 0);

Here https://stackoverflow.com/a/13744550/613130 it's suggested to use recursive CTE.
;WITH MyCTE AS
(
SELECT
R.NetWinCURRENCYValue AS NetWin
,dD.[Date] AS TheDay
,ROW_NUMBER() OVER (ORDER BY dD.[Date]) AS RN
FROM dimPlayer AS P
JOIN dbo.factRevenue AS R ON P.playerKey = R.playerKey
JOIN dbo.vw_Date AS dD ON Dd.dateKey = R.dateKey
WHERE P.CustomerID = 12345
)
, MyCTERec AS
(
SELECT C.TheDay AS [Date]
,ISNULL(C.NetWin, 0) AS NetWin
,ISNULL(C.NetWin, 0) AS CumulativeNetWin
,C.RN
FROM MyCTE AS C
WHERE C.RN = 1
UNION ALL
SELECT C.TheDay AS [Date]
,ISNULL(C.NetWin, 0) AS NetWin
,P.CumulativeNetWin + ISNULL(C.NetWin, 0) AS CumulativeNetWin
,C.RN
FROM MyCTERec P
INNER JOIN MyCTE AS C ON C.RN = P.RN + 1
)
SELECT *
FROM MyCTERec
ORDER BY RN
OPTION (MAXRECURSION 0)
Note that this query will work if you have 1 record == 1 day! If you have multiple records in a day, the results will be different from the other query.

As I said here, if you want really fast calculation, put it into temporary table with sequential primary key and then calculate rolling total:
create table #Temp (
ID bigint identity(1, 1) primary key,
[Date] date,
NetWin decimal(29, 10)
)
insert into #Temp ([Date], NetWin)
select
dD.[Date],
sum(R.NetWinCURRENCYValue) as NetWin,
from dbo.dimPlayer as P
inner join dbo.factRevenue as R on P.playerKey = R.playerKey
inner join dbo.vw_Date as dD on Dd.dateKey = R.dateKey
where P.CustomerID = 12345
group by dD.[Date]
order by dD.[Date]
;with cte as (
select T.ID, T.[Date], T.NetWin, T.NetWin as CumulativeNetWin
from #Temp as T
where T.ID = 1
union all
select T.ID, T.[Date], T.NetWin, T.NetWin + C.CumulativeNetWin as CumulativeNetWin
from cte as C
inner join #Temp as T on T.ID = C.ID + 1
)
select C.[Date], C.NetWin, C.CumulativeNetWin
from cte as C
order by C.[Date]
I assume that you could have duplicates dates in the input, but don't want duplicates in the output, so I grouped data before puting it into the table.

Selecting all rows after a row with specific values without repeating the same subquery

I have a table t1 and t2 which I join and order to form data set set1.
Two columns c1 and c2 form a unique identifier for the rows in set1.
I want to get all values from set1 after the first row with a specific c1 and c2.
I have a query like the one below which works, but it repeats the same subquery twice, which seems superfluous and overly complex even for Oracle:
SELECT * FROM
(
SELECT row_number() OVER (ORDER BY c1, c3) myOrder, c1, c2, c3
FROM t1, t2
WHERE condition
ORDER BY conditions
) sub1,
(
SELECT sub1_again.myOrder FROM
(
SELECT row_number() OVER (ORDER BY c1, c3) myOrder, c2, c3
FROM t1, t2
WHERE condition
ORDER BY conditions
) sub1_again
WHERE sub1_again.c2 = "foo" AND sub1_again.c3 = "bar"
) sub2
WHERE sub1.myOrder >= sub2.myOrder
ORDER BY sub1.myOrder
It seems like SQL would have a simple way to do this, but I am not sure what to search for.
Is there a cleaner way to do this?

SELECT * FROM (
SELECT row_number() OVER (ORDER BY c1, c3) myOrder, c2, c3
,CASE WHEN c2 = "foo" AND c3 = "bar"
THEN row_number() OVER (ORDER BY c1, c3)
END target_rn
FROM t1, t2
WHERE condition
ORDER BY conditions
) WHERE myOrder > target_rn;

I think there is something missing in the accepted solution. However, it helped me to come up with this:
SELECT * FROM (
SELECT row_number() OVER (ORDER BY c1, c3) myOrder, c1, c2, c3,
max(case when c2 = "foo" AND c3 = "bar" then 1 else 0 end) over (order by c1, c3) rowFound,
FROM t1, t2
WHERE condition
)
WHERE rowFound > 0
ORDER BY conditions
Basically the case is selecting which row is the one to start from, and the max "drags" the value from that row onwards. The last WHERE does the filtering.

Please try to be more specific,if i understood well you have just to add the parameter 'where',like:SELECT * FROM **where** element

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Need guidance on re-writing this query - hadoop

Related

Pivot On More Than One Column From Source Table

Oracle Optimize Query at View

Hive - OR condition with left outer join

CROSS APPLY too slow for running total - TSQL

Selecting all rows after a row with specific values without repeating the same subquery

Categories

Resources