How to efficiently evaluate rows and sub records in single PL/SQL - oracle

I'm struggling writing PL/SQL (I'm new to PL/SQL) and I'm not sure how to structure SQL and loops for something like this.
I have 128 lines of SQL to create something like the following cursor:
ID Course Grade Attend? Date
123 MATH091 B Y 5/15
123 BIOL101 F N 3/10
123 ENGL201 W Y 1/2
456 MATH091 A Y 5/16
456 CHEM101 C Y 5/16
456 POLS301 NULL NULL NULL
With each ID, I need to several comparisons across the courses (e.g. which has the latest date, or were all courses attended). These comparisons need to be done in a certain order so that when they hit one that is true, they are flagged with a code and excluded from subsequent comparisons.
For example:
All courses attended? If true, output as attended and remove from next steps.
Find and store the latest date with a passing grade.
Find and store the latest date with a non-passing grade.
Is the later date after a course with a null grade? If true, output as coming back and remove from next steps.
Etc.
Each condition can be easily written in a SQL, but I don't know/understand the appropriate structure to loop through this process.
Is there syntax that can accomplish this easily?
We're on Oracle 11g and we do not have permissions to write to a temporary table.

I don't think you need PL/SQL for this. Except for the "unknown" requirement "etc." this can all be done in a single SQL statement:
Something like:
select id, course, grade, attended, attendance_date,
count(distinct case when attended = 'Y' then course end) over (partition by id) courses_attended,
count(distinct course) over () as total_courses,
case
when count(distinct case when attended = 'Y' then course end) over (partition by id) = count(distinct course) over () then 'yes'
else 'no'
end as all_courses_attended,
max(case when attended <> 'F' then attendance_date else null end) over (partition by id) as latest_passing_date,
max(case when attended = 'F' then attendance_date else null end) over (partition by id) as latest_non_passing_date
from attendees
order by id;
Btw: the attended column is not necessary if you have an attendance_date. If that date is not NULL than obviously the student attended the course. Otherwise she/he didn't.
Of course I have no idea what the "etc." steps should do though....
SQLFiddle example: http://sqlfiddle.com/#!4/e7c95/1

Related

duplicates with connect by in oracle

select RP.COUNTRYID,RP.PRDCODE,
RP.REPID,
RP.CHANNELID,
RP.CUSTOMERID,
RP.DIVISION,
RP.WWCOGS_GAUSS,
RP.WWCOGS_SAP,
RP.WWCOGS_BusLine,
RP.CURRID ,
ADD_MONTHS(to_date('01-01-'||rp.year,'DD-MM-YYYY'),level-1) KFDATE
from
(SELECT CP.COUNTRYID,
CP.PRDCODE,
CP.REPID,
CP.CHANNELID,
CP.CUSTOMERID,
IP.DIVISION,
CP.WWCOGS_GAUSS,
CP.WWCOGS_SAP,
CP.WWCOGS_BusLine,
CP.year,
CP.CURRID from
(select distinct IBC.COUNTRYID ,
decode(ls.dmdunit,null,mwa.ProductName,ls.dmdunit) PRDCODE,
ReportingUnit REPID,
'99' CHANNELID,
IBC.COUNTRYID CUSTOMERID ,
(
CASE
WHEN MWA.COGSSourceCF LIKE '%Gauss%'
THEN MWA.COGSPriceCF * MWA.ExchRate
ELSE NULL
END) WWCOGS_GAUSS,
(
CASE
WHEN MWA.COGSSourceCF LIKE '%SAP%'
THEN MWA.COGSPriceCF * MWA.ExchRate
ELSE NULL
END) WWCOGS_SAP,
(
CASE
WHEN MWA.COGSSourceCF NOT LIKE '%Gauss%'
AND MWA.COGSSourceCF NOT LIKE '%SAP%'
THEN MWA.COGSPriceCF * MWA.ExchRate
ELSE NULL
END) WWCOGS_BusLine,
mwa.year,
IRC.CURRID
from BAM.M_WWCOGS_AREA MWA,
MICSTAG.M_IBP_REPUNIT_CURRENCY IRC,
micstag.M_IBP_BDREPORTINGCOUNTRY IBC
,MICSTAG.M_LOCALPRODUCT_STAG LS
WHERE MWA.ReportingUnit=IRC.REPID
--and mwa.productname='FR21030390085'
--and MWA.GaussCountry='BE BELGIUM'
AND IBC.COUNTRYPLANNINGGROUP =MWA.GaussCountry
and IBC.businessdivision=31
and MWA.COGSPriceCF <>0
and ls.REPORTINGUNITID(+)=mwa.ReportingUnit
and ls.IBPLOCALPRDID(+)= mwa.ProductName ) CP, micstag.M_IBP_PRODUCT IP
where CP.PRDCODE=IP.PRDID) RP
CONNECT BY level <= 12 ;
the above query is getting unwanted duplicate, if i use distinct the query is running forever.
req. duplicate the records based on year in result set of rp
consider value of year is 2019 than 12 records should came from 1-jan-2019 to 1-dec-2019.
more than one value of year are possible
The CONNECT BY LEVEL <= 12 trick works nicely with dual because that table returns one row. Things are messier when they base result set returns more than one row, because the CONNECT BY generates a product. This is why you're getting duplicates.
What you need to do is specify some additional criteria for the connection. Ideally you will have a primary key in the projection - I'm going to assume it's REPID, so if it's something different you'll need to tweak this. Anyway, you'll need something like this:
) RP
CONNECT BY level <= 12
and rp.repid = prior rp.repid
and prior sys_guid() is not null
The prior sys_guid() bit prevents ORA-01436: CONNECT BY loop in user data.

Function returning Last record

I don't often use ORACLE PL/SQL by the way but i need to understand what if anything in this function created by someone else
in the company before me is wrong as for it is not returning the latest record i've been told. I found out in some other forum issues that they
suggested to use the max(dateColumn) instead of "row_numer = 1" for example but not quite sure how to and where to incorporate that.
-- Knowing that --
We use Oracle version 12,
CustomObjectTypeA is an custom Oracle OBJECT TYPE defined by some old employee not longer in here,
V_OtherView is of Table_Mnd type beeing defined by some old employee not longer in here,
V_ABC_123 is a view created by some old employee not longer in here as well.
CREATE OR REPLACE FUNCTION F_TABLE_APPROVED (NUMBER_F_UPD number, NUMBER_F_GET VARcHAR2)
RETURN Table_Mnd
IS
V_OtherView Table_Mnd
BEGIN
SELECT CustomObjectTypeA (FromT.NUMBER_F,
FromT.OP_CODE,
FromT.CATG_CODE,
FromT.CATG_NAME,
FromT.CATG_SORT,
FromT.ORG_CODE,
FromT.ORG_NAME
FromT.DATA_ENTRY_VALID,
FromT.NUMBER_RECEIVED,
FromT.YEAR_1,
FromT.YEAR_2)
BULK COLLECT INTO V_OtherView
FROM (SELECT NUMBER_F,
OP_CODE,
CATG_CODE,
CATG_NAME,
CATG_SORT,
ORG_CODE,
ORG_NAME
DATA_ENTRY_VALID,
NUMBER_RECEIVED,
YEAR_1,
YEAR_2,
ROW_NUMBER() OVER (PARTITION BY BY ORG_CODE ORDER BY NUMBER_RECEIVED DESC, LOAD_DATE DESC) AS ROW_NUMBER
FROM V_ABC_123
WHERE NUMBER_F = NUMBER_F_UPD AND DATA_ENTRY_VALID <> 'OnGoing'
AND LOAD_DATE >= (SELECT sysdate-10 FROM dual)
AND LOAD_DATE <= (SELECT DISTINCT LOAD_DATE
FROM V_ABC_123
WHERE NUMBER_RECEIVED = NUMBER_F_GET)) FromT
WHERE FromT.ROW_NUMBER=1;
RETURN V_OtherView;
END F_TABLE_APPROVED;
The important bits of the query are:
SELECT ...
FROM (select ...,
ROW_NUMBER()
OVER (PARTITION BY ORG_CODE
ORDER BY NUMBER_RECEIVED DESC,
LOAD_DATE DESC) AS ROW_NUMBER
...) FromT
WHERE FromT.ROW_NUMBER = 1;
The "ROW_NUMBER" column is computed according to the following window clause:
PARTITION BY ORG_CODE
ORDER BY NUMBER_RECEIVED DESC, LOAD_DATE DESC
Which means that for each ORG_CODE, it will sort all the records by NUMBER_RECEVED,LOAD_DATE in descending order. Note that if the columns are Oracle DATEs, they will only be accurate to the nearest second; so if there are multiple records with date/times in the exact same 1-second interval, this sort order will not be guaranteed unique. The logic of ROW_NUMBER will therefore pick one of them arbitrarily (i.e. whichever record happens to be emitted first) and assign it the value "1", and this will be deemed the "latest". Subsequent executions of the same SQL could (in theory) return a different record.
The suspicious part is NUMBER_RECEIVED which sounds like it's a number, not a date? Sorting by this means that the records with the highest NUMBER_RECEIVED will be preferred. Was this intentional?
I'm not sure why the PARTITION is there, this would cause the query to return one "latest" record for each value of ORG_CODE that it finds. I can only assume this was intentional.
The problem is that the query can only determine the "latest record" as well as it can based on the data provided to it. In this case, it's possible the data is simply not granular enough to be able to decide which record is the actual "latest" record.

Group by subquery

I have following query
Select id, name, add1 || ' ' ||add2 address,
case
when subId =1 then 'Maths'
else 'Science'
End,
nvl(col1, col2) sampleCol
From Student_tbl
Where department = 'Student'
I want to group by this query by address as default
I tried
Group by add1 ,add2 ,id, name, subId, col1, col2
and
Group by add1 || ' ' ||add2,id, name,
case
when subId =1 then 'Maths'
else 'Science'
End,
nvl(col1, col2)
Both group by returns same result. I am unsure which query is right.
Anybody help me on this?
Always try to implement all the columns (with same format) that you mentioned in the select statement in "Group By" except aggregated columns. In your case I would prefer the second approach.
SELECT id
,NAME
,add1 || ' ' || add2 address
,CASE
WHEN subId = 1
THEN 'Maths'
ELSE 'Science'
END
,nvl(col1, col2) sampleCol
FROM Student_tbl
WHERE department = 'Student'
GROUP BY id
,NAME
,add1 || ' ' || add2
,CASE
WHEN subId = 1
THEN 'Maths'
ELSE 'Science'
END
,nvl(col1, col2)
I can't see any aggregated columns in your Select. If your select does not require aggregation then you can simply get rid off group by. You can implement distinct in case of any duplicate records in your result set.
The two queries will not necessarily give the same result. Which is correct depends on your requirement. Here is an example, using just the new ADDRESS column which you get by aggregating the input columns ADD1 and ADD2.
Suppose in one row you have ADD1 = 123 Main Street, Portland, and ADD2 = Oregon. Then in the output, ADDRESS = 123 Main Street, Portland, Oregon
In another row you have ADD1 = 123 Main Street, and ADD2 = Portland, Oregon. For this row, the resulting ADDRESS is the same.
If you group by ADDRESS the two output rows will land in the same group, but if you group by ADD1, ADD2 they will be in different groups. In this example it is likely you want to group by ADDRESS, but in other, similar-structure cases that wouldn't be what you want or need.
After your last comment I THINK to finally understand what you are expecting from us.
Both queries are correct basically because for both approaches you've listed in your SELECT clause all the fields present in your GROUP BY clause
What you are doing is a bit strange here because the GROUP BY contains the id, I guess this is the unique identifier of each row so you are finally not grouping anything. You'll get as much as rows as your table contains.
The reason why it returns the same reults is purely data based. There might be scenarios where the 2 queries returns different results.
In your case, if it returns the same results, it would mean that col1 is never NULL

HIVE equivalent of FIRST and LAST

I have a table with 3 columns:
table1: ID, CODE, RESULT, RESULT2, RESULT3
I have this SAS code:
data table1
set table1;
BY ID, CODE;
IF FIRST.CODE and RESULT='A' THEN OUTPUT;
ELSE IF LAST.CODE and RESULT NE 'A' THEN OUTPUT;
RUN;
So we are grouping the data by ID and CODE, and then writing to the dataset if certain conditions are met. I want to write a hive query to replicate this. This is what I have:
proc sql;
create table temp as
select *, row_number() over (partition by ID, CODE) as rowNum
from table1;
create table temp2 as
select a.ID, a.CODE, a.RESULT, a.RESULT2, a.RESULT3
from temp a
inner join (select ID, CODE, max(rowNum) as maxRowNum
from temp
group by ID, CODE) b
on a.ID=b.ID and a.CODE=b.CODE
where (a.rowNum=1 and a.RESULT='A') or (a.rowNum=b.maxRowNum and a.RESULT NE 'A');
quit;
There are two issues I see with this.
1) The row that is first or last in each BY group is entirely dependant on the order of rows in table1 in SAS, we aren't ordering by anything. I don't think row order is preserved when translating to a hive query.
2) The SAS code is taking the first row in each BY GROUP or the last, not both. I think that my HIVE query is taking both, resulting in more rows than I want.
Any suggestions or insight on how to improve my query is appreciated. Is it even possible to replicate this SAS code in HIVE?
The SAS code has a by statement (BY ID CODE;), which tells SAS that the set dataset is sorted at those levels. So, not a random selection for first. and last..
That said, we can replicate this in HIVE by using the first_value and last_value window functions.
FIRST.CODE should replicate to
first_value(code) over (partition by Id order by code)fcode
Similarly, LAST.CODE would be
last_value(code) over (partition by Id order by code)lcode
Once you have the fcode and lcode columns, use case when statements for the result column criteria. Like,
case when (code=fcode and result='A') or (code=lcode and result<>'A')
then 1 else 0 end as op_flag
Then the fetch the table with where op_flag = 1
SAMPLE
select id, code, result from (
select *,
first_value(code) over (partition by id order by code)fcode,
last_value(code) over (partition by id order by code)lcode
from footab) f
where (code=fcode and result='A') or (code=lcode and result<>'A')
Regarding point 1) the BY group processing requires the input data to be sorted or indexed on BY variables, so though the code contains no ordering, the source data is processed in order. If the input data was not indexed/sorted, SAS will throw error.
Regarding this, possible differences are on rows with same values of BY variables, especially if the RESULT is different.
In SAS, I would pre-sort data by ID, CODE, RESULT, then use BY ID CODE in order to not be influenced by order of rows.
Regarding 2) FIRST and LAST can be both true in SAS. Since your condition for first and last on RESULT is different, I guess this is not a source of differences.
I guess you could add another field as
row_number() over (partition by ID, CODE desc) as rowNumDesc
to detect last row with rowNumDesc = 1 (so that you skip the join).
EDIT:
I think the two programs above both include random selection of rows for groups with same values of ID and CODE variables, especially with same values of RESULT. But you should get same number of rows from both. If not, just debug it.
However the random aspect in SAS code/storage is based on physical order of rows, while the ROW_NUMBERs randomness within a group will be influenced by the implementation of the function in the engine.

Return Boolean value when table has data in the specified range

I need a query to return boolean when there's table has data in the given range.
Assume table
Customer
[User ID, Name, Date, Products_Purchased]
I'm trying to do:
select case when exists(
select Date, count(*)
from Customer
where date between '2015-08-03' and '2015-08-05'
)
then cast(1 as BIT)
else case(0 as BIT)end;
This is throwing an error near "select Date".
However, weird part is the inner query is running perfectly fine.
Im wondering if im missing out something here !
What about something more straightforward e.g.
select case when count(*) >0 then 1 else 0 end as HIT
from ... where ...
That way you don't have to bother about Hive assuming that EXISTS implies a correlated sub-query, automagically translated into a MapJoin, i.e. a Java HashMap shuffled to the 2nd line of Mappers jobs, etc. Not exactly your use case.
Then it's not useful to compute the exact count, so the query could be refined as
select case when count(*) >0 then 1 else 0 end as HIT
from
(select ... from ... where ... limit 1) X
[Edit] There is no "bit" datatype in Hive. But the default "int" should be OK if you just want a return flag (zero / non-zero)

Resources