Oracle merge into with text similarlity - oracle

I have 2 tables: from_country and to_country. I want to bring new records and update records to to_country
Definition and data
--
CREATE TABLE from_country
(
country_code varchar2(255) not null
);
--
CREATE TABLE to_country
(
country_code varchar2(255) not null
);
-- Meaning match
INSERT INTO from_country
(country_code)
VALUES
('United States of America');
-- Match 100%
INSERT INTO from_country
(country_code)
VALUES
('UGANDA');
-- Meaning match, but with domain knowledge
INSERT INTO from_country
(country_code)
VALUES
('CON CORRECT');
-- Brand new country
INSERT INTO from_country
(country_code)
VALUES
('NEW');
--
INSERT INTO to_country
(country_code)
VALUES
('USA');
-- Match 100%
INSERT INTO to_country
(country_code)
VALUES
('UGANDA');
-- Meaning match, but with domain knowledge
INSERT INTO to_country
(country_code)
VALUES
('CON');
I need to run merge into so I bring data from from_county to to_country
Here is my 1st attempt, but it only does a equal, which is not good enough. I need some smartness so that it is able to do meaning match.
If anyone know how to do it, please provide your solution.
merge into
to_country to_t
using
from_country from_t
on
(to_t.country_code = from_t.country_code)
when not matched then insert (
country_code
)
values (
from_t.country_code
);
So in a nutshell, here is what I want
from_table:
United States of America
UGANDA
CON CORRECT
NEW
to_table:
USA
UGANDA
CON
After oracle merge into
the new to_country table:
United States of America
UGANDA
CON CORRECT
NEW
sql fiddle: http://sqlfiddle.com/#!4/f512d
Please note that this is my simplified example. I have larger data set.

Since the match is not guaranteed unique, you have to write a query that will return only one match using some decision.
Here is a simplified case which uses a naive match and then just picks one value when there is more than one match:
merge into to_country t
using (
select * from (
select t.rowid as trowid
,f.country_code as fcode
,t.country_code as tcode
,case when t.country_code is null then 1 else
row_number()
over (partition by t.country_code
order by f.country_code)
end as match_no
from from_country f
left join to_country t
on f.country_code like t.country_code || '%'
) where match_no = 1
) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
Result in to_country:
USA
UGANDA
CON CORRECT
United States of America
Now that that's taken care of, you just need to make the match algorithm smarter. This is where you need to look at the whole dataset to see what sort of errors there are - i.e. typos, etc.
You could try some of the procedures in Oracle's supplied UTL_MATCH for this purpose: https://docs.oracle.com/cd/E18283_01/appdev.112/e16760/u_match.htm - such as EDIT_DISTANCE, or JARO_WINKLER.
Here is an example using the Jaro Winkler algorithm:
merge into to_country t
using (
select * from (
select t.rowid as trowid
,f.country_code as fcode
,t.country_code as tcode
,case when t.country_code is null then 1
else row_number() over (
partition by t.country_code
order by utl_match.jaro_winkler_similarity(f.country_code,t.country_code) desc)
end as match_no
from from_country f
left join to_country t
on utl_match.jaro_winkler_similarity(f.country_code,t.country_code) > 70
) where match_no = 1
) s
on (s.trowid = t.rowid)
when matched then update set country_code = s.fcode
when not matched then insert (country_code) values (s.fcode);
SQL Fiddle: http://sqlfiddle.com/#!4/f512d/23
Note that I've picked an arbitrary cutoff of >70%. This is because UGANDA vs. USA has a Jaro Winkler similarity of 70.
This results in the following:
United States of America
USA
UGANDA
CON NEW
To see how these algorithms fare, run something like this:
select f.country_code as fcode
,t.country_code as tcode
,utl_match.edit_distance_similarity(f.country_code,t.country_code) as ed
,utl_match.jaro_winkler_similarity(f.country_code,t.country_code) as jw
from from_country f
cross join to_country t
order by 2, 4 desc;
FCODE TCODE ED JW
======================== ====== === ===
CON NEW CON 43 86
CON CORRECT CON 28 83
UGANDA CON 17 50
United States of America CON 0 0
UGANDA UGANDA 100 100
United States of America UGANDA 9 46
CON NEW UGANDA 15 43
CON CORRECT UGANDA 0 41
UGANDA USA 34 70
United States of America USA 13 62
CON CORRECT USA 0 0
CON NEW USA 0 0
SQL Fiddle: http://sqlfiddle.com/#!4/f512d/22

Related

Find sum of economy, region

DDL Statement table name is world
country Economy
USA 2000
CHINA 1500
INDIA 1600
DUBAI 1000
Nepal 500
Pakistan 700
Show a query in oracle so that from this table we retriview this output
output: Region sum(economy)
USA 2000
Asia 5300
You're missing a table which says which country belongs to which region:
create table region
(id_region number constraint pk_reg primary key,
name varchar2(20)
);
create table country
(id_country number constraint pk_cou primary key,
id_region number constraint fk_coureg references region (id_region),
name varchar2(20)
);
Then you'd
select r.name as region,
sum(c.economy) as sum_economy
from region r join country c on c.id_region = r.id_region
group by r.name
If you insist on doing it wrong & hardcode regions, here you are:
select case when country = 'USA' then 'USA'
else 'Asia'
end as region,
sum(economy) as sum_economy
from your_table
group by
case when country = 'USA' then 'USA'
else 'Asia'
end;
Note that this "solution" is simply wrong and I suggest you do it properly, as previously described.

Select same column for different values on a different column

I did search the forum before posting this and found some topics which were close to the same issue but I still had questions so am posting it here.
EMP_ID SEQ_NR NAME
874830 3 JOHN
874830 4 JOE
874830 21 MIKE
874830 22 BILL
874830 23 ROBERT
874830 24 STEVE
874830 25 JERRY
My output should look like this.
EMP ID SEQ3NAME SEQ4NAME SEQ21NAME SEQ22NAME SEQ23NAME SEQ24NAME SEQ25NAME
874830 JOHN JOE MIKE BILL ROBERT STEVE JERRY
SELECT A.EMP_ID
,A.NAME SEQ3NAME
,B.NAME SEQ4NAME
FROM AC_XXXX_CONTACT A
INNER JOIN AC_XXXX_CONTACT B ON A.EMP_ID = B.EMP_ID
WHERE A.SEQ_NR = '03' AND B.SEQ_NR = '04'
AND B.EMP_ID = '874830';
The above query helped me get the below results.
EMP_ID SEQ3NAME SEQ4NAME
874830 JOHN JOE
My question is to get all the fields(i.e till seq nr = 25) should I be joining the table 5 more times.
Is there a better way to get the results ?
I m querying against the Oracle DB
Thanks for your help.
New Requirement
New Input
STU-ID SEM CRS-NBR
12345 1 100
12345 1 110
12345 2 200
New Output
stu-id crs1 crs2
12345 100 200
12345 110
Not tested since you didn't provide test data (from table AC_XXXX):
(using Oracle 11 PIVOT clause)
select *
from ( select emp_id, seq_nr, name
from ac_xxxx
where emp_id = '874830' )
pivot ( max(name) for seq_nr in (3 as seq3name, 4 as seq4name, 21 as seq21name,
22 as seq22name, 23 as seq23name, 24 as seq24name, 25 as seq25name)
)
;
For Oracle 10 or earlier, pivoting was done "by hand", like so:
select max(emp_id) as emp_id, -- Corrected based on comment from OP
max(case when seq_nr = 3 then name end) as seq3name,
max(case when seq_nr = 4 then name end) as seq4name,
-- etc. (similar expressions for the other seq_nr)
from ac_xxxx
where emp_id = '874830'
;
Or, emp_id doesn't need to be within max() if we add group by emp_id - which then will work even without the WHERE clause, for a different but related question.

Logic for change in one column value in pl/sql

I have an assignmentes table asg_tab with effective start date and effective end date columns which track on which dates which change was made.
asg_tab
eff_start_date eff_End_date PER_ASG_ATTRIBUTE2 job name pos name
01-Jan-2015 03-feb-2015 Ck Bonus Retail Mgr
04-Feb-2015 20-Feb-2015 UK Bonus Sales Mgr
21-Feb-2015 28-Nov-2015 UK Bonus Sales Snr. Mgr
Now I have to calculate the number of days for which PER_ASG_ATTRIBUTE2 is UK Bonus. For example in the above case it will be days between 04-Feb-2015 to 28-Nov-2015.
I have written the logic below, which is fetching values from cursor.
cursor cur_asg
is
select
eff_start_date,
eff_End_date,
PER_ASG_ATTRIBUTE2,
job_name,
pos_name
from
asg_tab
Logic I have built :
START_DT ='01-Jan-2015'
END_DT ='31-Dec-2015'
IF PER_ASG_ATTRIBUTE2 LIKE '%UK Bonus%' THEN
(
l_new_ATTR = PER_ASG_ATTRIBUTE2
l_effective_date = i.eff_start_date
IF (l_new_ATTR <> l_old_ATTR) AND (l_effective_date >= START_DT ) AND (l_effective_date =< END_DT)
THEN
(
l_days=eff_end_date -eff_start_date
)
l_old_ATTR = l_new_ATTR
)
The issue which is coming up is that from this condition: IF (l_new_ATTR <> l_old_ATTR) AND (l_effective_date >= START_DT ) AND (l_effective_date =< END_DT)
This condition will pick the 2nd row where the PER_ASG_ATTRIBUTE2 changed from Ck Bonus to UK Bonus but when the pos name changes the 3rd row is generated.
Even though the PER_ASG_ATTRIBUTE2 is still UK Bonus this will not be filtered in the if condition.
What more can I add to this condition ?

Oracle count distinct record within subquery

I have 3 tables
SUBJECTS
CODE, SUBJECT_NAME , SESSION
100, MATHS , AM
101, MATHS - INTRO , AM
102, MATHS - ADVANCED , AM
200, ENGLISH , AM
201, ENGLISH - INTRO , AM
202, ENGLISH - BEGINNER, AM
203, ENGLISH - ADVANCED, AM
STUDENTS_SUBJECTS
ID, SUBJECT_CODE
2, 101
2, 102
1, 201
1, 203
3, 101
3, 102
STUDENTS
ID,PARENT_ID, STUDENT_NAME, CLASS_LEADER, INACTIVE, EXPERT
1 , 2 , ELSA , no , N , N
2 , 4 , STEVE , no , N , N
3 , 5 , MIKE , no , N , N
My query goes like
SELECT t1.CODE,
t1.SUBJECT_NAME,
SUM (CASE WHEN ( (t2.CLASS_LEADER = 'no'
OR t2.CLASS_LEADER IS NULL)
AND t2.EXPERT IS NULL)
THEN 1 ELSE 0 END) AS "Average Student"
FROM subjects t1
LEFT OUTER JOIN (
select a.STUDENT_ID, a.PARENT_ID, a.STUDENT_NAME,
a.CLASS_LEADER, c.SUBJECT_CODE, a.INACTIVE, a.EXPERT
FROM students a
INNER JOIN students_subjects c
ON (a.STUDENT_ID = c.ID )
where (INACTIVE is null)
GROUP BY a.STUDENT_ID, a.PARENT_ID, a.STUDENT_NAME, a.CLASS_LEADER, c.SUBJECT_CODE, a.INACTIVE, a.EXPERT
) t2
ON substr(trim(t2.SUBJECT_CODE),1,2)= substr(trim(t1.CODE),1,2)
WHERE (t1.SESSION='AM')
GROUP BY t1.CODE, T1.SUBJECT_NAME
ORDER BY T1.CODE
What I would like to get is the number of students who signed up for the class for morning session under each major subject without the duplicates. For example, each students who signed up for Maths - Intro & Maths Advanced should only be counted once under the Maths subject.
if I run the subquery separately minus the subject_code in select statement and group by statement, I managed to get the correct value however I'm not sure how to return the correct value when it's joined in the query.
REPORT
CODE, SUBJECT_NAME, AVERAGE_STUDENT
100 MATHS 2
200 ENGLISH 1
Thank you.
First some recomendation:
1) add column MAIN_SUBJECT_CODE to the table SUBJECTS (as already commented)
2) the column ID in the table STUDENTS_SUBJECTS is a foreign key pointing to the table STUDENT, so a better name will be STUDENT_ID
3) use unique mechanism to store Boolean values do not mix 'no' and 'N'
First the query of all student subscriptions
Note that I added the missing column main_subject_code and adjusted the average student definition to get some result.
SELECT su.CODE,
substr(trim(su.CODE),1,2)||'0' main_subject_code,
su.SUBJECT_NAME,
st.STUDENT_NAME,
CASE WHEN ( (st.CLASS_LEADER = 'no'
OR st.CLASS_LEADER IS NULL)
AND st.EXPERT = 'N' /*IS NULL*/)
THEN 1 ELSE 0 END AS "Average Student"
FROM subjects su
INNER JOIN students_subjects ss
ON su.code = ss.SUBJECT_CODE
INNER JOIN STUDENTS st
ON ss.ID /* STUDENT_ID */ = st.ID
;
CODE MAIN_SUBJECT_CODE SUBJECT_NAME STUDENT_NAME Average Student
101 100 MATHS - INTRO MIKE 1
101 100 MATHS - INTRO STEVE 1
102 100 MATHS - ADVANCED MIKE 1
102 100 MATHS - ADVANCED STEVE 1
201 200 ENGLISH - INTRO ELSA 1
203 200 ENGLISH - ADVANCED ELSA 1
The rest is simple - group on main subject and add the title of it
with subsr as (
SELECT su.CODE,
substr(trim(su.CODE),1,2)||'0' main_subject_code,
su.SUBJECT_NAME,
st.STUDENT_NAME,
CASE WHEN ( (st.CLASS_LEADER = 'no'
OR st.CLASS_LEADER IS NULL)
AND st.EXPERT = 'N' /*IS NULL*/)
THEN 1 ELSE 0 END AS "Average Student"
FROM subjects su
INNER JOIN students_subjects ss
ON su.code = ss.SUBJECT_CODE
INNER JOIN STUDENTS st
ON ss.ID /* STUDENT_ID */ = st.ID
)
select
main_subject_code,
(select SUBJECT_NAME from SUBJECTS where CODE = main_subject_code) main_subject_name,
sum("Average Student") "Average Student"
from subsr
group by main_subject_code
order by main_subject_code;
MAIN_SUBJECT_CODE MAIN_SUBJECT_NAME Average Student
----------------- ------------------------- ---------------
100 MATHS 4
200 ENGLISH 2
Your posted query contains a lot of extraneous logic which doesn't seem releavnt to your apparent task. So I'm ignoring it and focusing on simply getting "the number of students who signed up for the class for morning session under each major subject without the duplicates".
select major
, count(*)
from (
select distinct subj.major
, ss.id as student_id
from
( select code,
regexp_replace(subject_name, '^([A-Z]+)(.*)', '\1') major ,
from subjects
where session = 'AM'
) subj
join student_subjects ss
on ss.subject_code = subj.code
)
group by major
order by major
/
The subquery on SUBJECTS use a regex function to extract the leading element of the subject name as the major. It works for the posted sample data but might fail for more complicated names. Regex shouldn't be necessary: a proper data model would separate the MAJOR subject from its subsidiaries.

can I do insert in update of merge(Implementation SCD type 2)

I have source table and a target table I want to do merge such that there should always be insert in the target table. For each record updated there should ne a flag updated to 'Y' and when this in something is changed then record flag value should be chnaged to 'N' and a new row of that record is inserted in target such that the information of record that is updated should be reflected. Basically I want to implement SCD type2 . My input data is-
student_id name city state mobile
1 suraj bhopal m.p. 9874561230
2 ravi pune mh 9874563210
3 amit patna bihar 9632587410
4 rao banglore kr 9236547890
5 neel chennai tn 8301456987
and when my input chnages-
student_id name city state mobile
1 suraj indore m.p. 9874561230
And my output should be like-
surr_key student_id name city state mobile insert_Date end_date Flag
1 1 suraj bhopal m.p.9874561230 31/06/2015 1/09/2015 N
2 1 suraj indore m.p.9874561230 2/09/2015 31/12/9999 Y
Can anyone help me how can I do that?
You can do this with the use of trigger ,you can create before insert trigger on your target table which will update flag column of your source table.
Or you can have after update trigger on source table which will insert record in your target table.
Hope this helps
Regards,
So this should be the outline of your procedure steps. I used different columns in source and target for simplification.
Source (tu_student) - STUDENT_ID, NAME, CITY
Target (tu_student_tgt)- SKEY, STUDENT_ID, NAME, CITY, INSERT_DATE, END_DATE, IS_ACTIVE
The basic idea here is
Find the new records from source which are missing in target and Insert it. Set start_date as sysdate, end_date as 9999 and IsActive to 1.
Find the records which are updated (like your Bhopal -> Indore case). So we have to do 2 operations in target for it
Update the record in target and set end date as sysdate and IsActive to 0.
Insert this record in target which has new values. Set start_date as sysdate, end_date as 9999 and IsActive = 1.
-- Create a new oracle sequence (test_utsav_seq in this example)
---Step 1 - Find new inserts (records present in source but not in target
insert into tu_student_tgt
(
select
test_utsav_seq.nextval as skey,
s.student_id as student_id,
s.name as name,
s.city as city,
sysdate as insert_date,
'31-DEC-9999' as end_date,
1 as Flag
from tu_student s
left outer join
tu_student_tgt t
on s.student_id=t.student_id
where t.student_id is null)
----Step 2 - Find skey which needs to be updated due to data chage from source and target. So get the active records from target and compare with source data. If mismatch found, we need to
-- a update this recods in target and mark it as Inactive.
-- b Insert a new record for same student_id with new data and mark it Active.
-- part 2a - find updates.
--these records need update. Save these skey and use it one by one while updating.
select t.skey
from tu_student s inner join
tu_student_tgt t
on s.student_id=t.student_id
where t.Flag = 1 and
(s.name!=t.name or
s.city!=t.city)
--2 b ) FInd the ids which needs to be inserted as they changed in source from target. Now as above records are marked inactive,
select s.student_id
from tu_student s inner join
tu_student_tgt t
on s.student_id=t.student_id
where t.Flag = 1 and
(s.name!=t.name or
s.city!=t.city)
---2a - Implement update
-- Now use skey from 2a in a loop and run update statements like below. Replace t.key = with the keys which needs to be updated.
update tu_student_tgt t
set t.student_id = (select s.student_id from tu_student s,tu_student_tgt t where s.student_id=t.student_id and t.key= -- id from 2a step . )
, t.name=(select s.name from tu_student s,tu_student_tgt t where s.student_id=t.student_id and t.key= --id from 2a step. )
, end_date = sysdate
, is_active = 0
where t.skey = -- id from 2a step
---2b Implement Insert use student_id found in 2a
--Insert these student id like step 1
insert into tu_student_tgt
(
select
test_utsav_seq.nextval as skey,
s.student_id as student_id,
s.name as name,
s.city as city,
sysdate as insert_date,
'31-DEC-9999' as end_date,
1 as Flag
from tu_student s
where s.student_id = -- ID from 2b step - Repeat for other ids
I cannot give you a simple example of SCD-2. If you understand SCD-2, you should understand this implementation.

Resources