Duplicate removal from a table using Informatica - etl

I have a scenario to be implemented in informatica where I need to remove duplicate records from a table based on PK. But I need to keep the 1st occurrence of the PK values and remove the others(in case of duplicate PK).
For example, If my source has 1,1,1,2,3,3,4,5,4. I want to see my target data as 1,2,3,4,5. I have to read data from the same table and need to load into the same table., no new table can be introduced. please help me with your inputs.
Thanks in Advance!

I suppose you want the first occurrence because there are other (data) columns in addition to the key you entered. Therefore you want
1,b
1,c
1,a
2,d
3,c
3,d
4,e
5,f
4,b
Turned into
1,b
2,d
3,c
4,e
5,f
??
In that case try this mapping layout:
SRC -> SQ -> SRT -> AGG -> TGT
SEQ /
Where the sorter is set to sort on the KEY,sequence_port (desc)
And the aggregator is set to group by the KEY, and the sequence_port should not go out of the sorter
Hope you can follow me :)

There are multiple ways to do this, the simplest would be too do it in the SQL override.
Assuming the example quoted above, the SQL would be like this. General idea is to set a row number for a primary key ( so if you have 3 rows with same pk they will have 1,2,3 as row numbers before being reset for the next pk)
SQL:
select * from (
Select primary_key,column2 row_number() over (partition by primary_key order by primary_key) as distinct_key) where distinct_key=1
Before:
1,b
1,c
1,a
2,d
3,c
3,d
Intermediate query:
1,c,1
1,a,2
2,d,1
3,c,1
3,d,2
output:
1,c
2,d
3,d

I am able to achieve this by following the below steps.
1. Passing Sorted data(keys are EMP_ID, MOBILE, DEPTID) to an expression.
2. Creating the following variable ports in the expression and getting the counts.
V_CURR_EMP_ID = EMP_ID
V_CURR_MOBILE = MOBILE
V_CURR_DEPTID = DEPTID
V_COUNT =
IIF(V_CURR_EMP_ID=V_PREV_EMP_ID AND V_CURR_MOBILE=V_PREV_MOBILE AND V_CURR_DEPTID=V_PREV_DEPTID ,V_COUNT+1,1)
V_PREV_EMP_ID = EMP_ID
V_PREV_MOBILE = MOBILE
V_PREV_DEPTID = DEPTID
O_COUNT =V_COUNT
3. In the next transformation which is filter, I am taking only the records which have count more than 1 and deleting them using update strategy(DD_DELETE).
Here is the mapping flow.
SQ->SRTR->EXP->FIL->UPD->TGT
Also, when I tried to delete them using aggregator , it is deleting only the first occurrence of duplicates but not all.
Thanks again for your inputs!

Related

Sum of only Distinct values in a Column in DAX

I have table[Table 1] having three columns
OrganizationName, FieldName, Acres having data as follows
organizationname fieldname Acres
ABC |F1 |0.96
ABC |F1 |0.96
ABC |F1 |0.64
I want to calculate the sum of Distinct values of Acres
(eg: 0.96+0.64) in DAX.
One of the problems with doing what you want is that many measures rely on filters and not actual table expressions. So, getting a distinct list of values and then filtering the table by those values, just gives you the whole table back.
The iterator functions are handy and operate on table expressions, so try SUMX
TotalDistinctAcreage = SUMX(DISTINCT(Table1[Acres]),[Acres])
This will generate a table that is one column containing only the distinct values for Acres, and then add them up. Note that this is only looking at the Acres column, so if different fields and organizations had the same acreage -- then that acreage would still only be counted once in this sum.
If instead you want to add up the acreage simply on distinct rows, then just make a small change:
TotalAcreageOnDistinctRows = SUMX(DISTINCT(Table1),[Acres])
Hope it helps.
Ok, you added these requirements:
Thank You. :) However, I want to add Distinct values of Acres for a
Particular Fieldname. Is this possible? – Pooja 3 hours ago
The easiest way really is just to go ahead and slice or filter the original measure that I gave you. But if you have to apply the filter context in DAX, you can do it like this:
Measure =
SUMX(
FILTER(
SUMMARIZE( Table1, [FieldName], [Value] )
, [FieldName] = "<put the name of your specific field here"
)
, [Value]
)

Can I use FOR ALL ENTRIES with GROUP BY?

Currently the code looks something like this:
LOOP AT lt_orders ASSIGNING <fs_order>.
SELECT COUNT(*) AS cnt
FROM order_items
INTO <fs_order>-cnt
WHERE order_id = <fs_order>-order_id.
ENDLOOP.
It is the slowest part of the report. I want to speed it up.
How can I use FOR ALL ENTRIES with GROUP BY?
Check the documentation. You can't use GROUP BY. Maybe in this case, you could try selecting your items with FAE outside of the loop, then count them using a parallel cursor:
REPORT.
TYPES: BEGIN OF ty_result,
vbeln TYPE vbeln,
cnt TYPE i.
TYPES: END OF ty_result.
DATA: lt_headers TYPE SORTED TABLE OF ty_result WITH UNIQUE KEY vbeln,
lv_tabix TYPE sy-tabix VALUE 1.
"get the headers
SELECT vbeln FROM vbak UP TO 100 ROWS INTO CORRESPONDING FIELDS OF TABLE lt_headers.
"get corresponding items
SELECT vbeln, posnr FROM vbap FOR ALL ENTRIES IN #lt_headers
WHERE vbeln EQ #lt_headers-vbeln
ORDER BY vbeln, posnr
INTO TABLE #DATA(lt_items).
LOOP AT lt_headers ASSIGNING FIELD-SYMBOL(<h>).
LOOP AT lt_items FROM lv_tabix ASSIGNING FIELD-SYMBOL(<i>).
IF <i>-vbeln NE <h>-vbeln.
lv_tabix = sy-tabix.
EXIT.
ELSE.
<h>-cnt = <h>-cnt + 1.
ENDIF.
ENDLOOP.
ENDLOOP.
BREAK-POINT.
Or join header/item with a distinct count on the item id (whichever column that would be in your table).
You should be able to do something like
SELECT COUNT(order_item_id) AS cnt, order_id
FROM order_items
INTO CORRESPONDING FIELDS OF TABLE lt_count
GROUP BY order_id.
Assuming that order_item_id is a key in the order_items table. And assuming that lt_count has two fields: cnt of type int8 and order_id of same type as your other order_id fields
PS: then you can loop over lt_count and move the counts to lt_orders. Or the other way around. To speed up the loop, sort one of the tables and use READ ... BINARY SEARCH
I did with table KNB1 (customer master in company code), where we have customers, which are created in several company codes.
Please note, because of FOR ALL ENTRIES you have to SELECT the full key.
TYPES: BEGIN OF ty_knb1,
kunnr TYPE knb1-kunnr,
count TYPE i,
END OF ty_knb1.
TYPES: BEGIN OF ty_knb1_fae,
kunnr TYPE knb1-kunnr,
END OF ty_knb1_fae.
DATA: lt_knb1_fae TYPE STANDARD TABLE OF ty_knb1_fae.
DATA: lt_knb1 TYPE HASHED TABLE OF ty_knb1
WITH UNIQUE KEY kunnr.
DATA: ls_knb1 TYPE ty_knb1.
DATA: ls_knb1_db TYPE knb1.
START-OF-SELECTION.
lt_knb1_fae = VALUE #( ( kunnr = ... ) ). "add at least one customer which is created in several company codes
ls_knb1-count = 1.
SELECT kunnr bukrs
INTO CORRESPONDING FIELDS OF ls_knb1_db
FROM knb1
FOR ALL ENTRIES IN lt_knb1_fae
WHERE kunnr EQ lt_knb1_fae-kunnr.
ls_knb1-kunnr = ls_knb1_db-kunnr.
COLLECT ls_knb1 INTO lt_knb1.
ENDSELECT.
Create a range table for your lt_orders, like lt_orders_range.
Do select order_id, count( * ) where order_id in lt_orders_range.
If you think this is too much to create a range table, you will save a lot of performance by running just one select for all orders instead of single select for each order id.
Not directly, only through a CDS view
While all of the answers provide a faster solution than the one in the question, the fastest way is not mentioned.
If you have at least Netweaver 7.4, EHP 5 (and you should, it was released in 2014), you can use CDS views, even if you are not on HANA.
It still cannot be done directly, as OpenSQL does not allow FOR ALL ENTRIES with GROUP BY, and CDS views cannot handle FOR ALL ENTRIES. However, you can create one of each.
CDS:
#AbapCatalog.sqlViewName: 'zorder_i_fae'
DEFINE VIEW zorder_items_fae AS SELECT FROM order_items {
order_id,
count( * ) AS cnt,
}
GROUP BY order_id
OpenSQL:
SELECT *
FROM zorder_items_fae
INTO TABLE #DATA(lt_order_cnt)
FOR ALL ENTRIES IN #lt_orders
WHERE order_id = #lt_orders-order_id.
Speed
If lt_orders contains more than about 30% of all possible order_id values from table ORDER_ITEMS, the answer from iPirat is faster. (While using more memory, obviously)
However, if you need only a couple hunderd order_id values out of millions, this solution is about 10 times faster than any other answer, and 100 times faster than the original.

Delete duplicate rows from a BigQuery table

I have a table with >1M rows of data and 20+ columns.
Within my table (tableX) I have identified duplicate records (~80k) in one particular column (troubleColumn).
If possible I would like to retain the original table name and remove the duplicate records from my problematic column otherwise I could create a new table (tableXfinal) with the same schema but without the duplicates.
I am not proficient in SQL or any other programming language so please excuse my ignorance.
delete from Accidents.CleanedFilledCombined
where Fixed_Accident_Index
in(select Fixed_Accident_Index from Accidents.CleanedFilledCombined
group by Fixed_Accident_Index
having count(Fixed_Accident_Index) >1);
You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).
A query that should work is here:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1
UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:
https://stackoverflow.com/a/57900778/132438
An alternative to Jordan's answer - this one scales better when having too many duplicates:
#standardSQL
SELECT event.* FROM (
SELECT ARRAY_AGG(
t ORDER BY t.created_at DESC LIMIT 1
)[OFFSET(0)] event
FROM `githubarchive.month.201706` t
# GROUP BY the id you are de-duplicating by
GROUP BY actor.id
)
Or a shorter version (takes any row, instead of the newest one):
SELECT k.*
FROM (
SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k
FROM `fh-bigquery.reddit_comments.2017_01` x
GROUP BY id
)
To de-duplicate rows on an existing table:
CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k
FROM `deleting.deduplicating_table` row
GROUP BY id
)
Not sure why nobody mentioned DISTINCT query.
Here is the way to clean duplicate rows:
CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table
If your schema doesn’t have any records - below variation of Jordan’s answer will work well enough with writing over same table or new one, etc.
SELECT <list of original fields>
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Fixed_Accident_Index) AS pos,
FROM Accidents.CleanedFilledCombined
)
WHERE pos = 1
In more generic case - with complex schema with records/netsed fields, etc. - above approach can be a challenge.
I would propose to try using Tabledata: insertAll API with rows[].insertId set to respective Fixed_Accident_Index for each row.
In this case duplicate rows will be eliminated by BigQuery
Of course, this will involve some client side coding - so might be not relevant for this particular question.
I havent tried this approach by myself either but feel it might be interesting to try :o)
If you have a large-size partitioned table, and only have duplicates in a certain partition range. You don't want to overscan nor process the whole table. use the MERGE SQL below with predicates on partition range:
-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------
DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");
MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
SELECT k.*
FROM (
SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k
FROM `gcp_project`.`data_set`.`the_table` AS original_data
WHERE stamp BETWEEN dt_start AND dt_end
GROUP BY surrogate_key
)
) AS INTERNAL_SOURCE
ON FALSE
WHEN NOT MATCHED BY SOURCE
AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
THEN DELETE
WHEN NOT MATCHED THEN INSERT ROW
credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a
Easier answer, without a subselect
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY Fixed_Accident_Index)
row_number
FROM Accidents.CleanedFilledCombined
WHERE TRUE
QUALIFY row_number = 1
The Where True is neccesary because qualify needs a where, group by or having clause
Felipe's answer is the best approach for most cases. Here is a more elegant way to accomplish the same:
CREATE OR REPLACE TABLE Accidents.CleanedFilledCombined
AS
SELECT
Fixed_Accident_Index,
ARRAY_AGG(x LIMIT 1)[SAFE_OFFSET(0)].* EXCEPT(Fixed_Accident_Index)
FROM Accidents.CleanedFilledCombined AS x
GROUP BY Fixed_Accident_Index;
To be safe, make sure you backup the original table before you run this ^^
I don't recommend to use ROW NUMBER() OVER() approach if possible since you may run into BigQuery memory limits and get unexpected errors.
Update BigQuery schema with new table column as bq_uuid making it NULLABLE and type STRING

Create duplicate rows by running same command 5 times for example
insert into beginner-290513.917834811114.messages (id, type, flow, updated_at) Values(19999,"hello", "inbound", '2021-06-08T12:09:03.693646')
Check if duplicate entries exist
select * from beginner-290513.917834811114.messages where id = 19999
Use generate uuid function to generate uuid corresponding to each message

UPDATE beginner-290513.917834811114.messages
SET bq_uuid = GENERATE_UUID()
where id>0
Clean duplicate entries
DELETE FROM beginner-290513.917834811114.messages
WHERE bq_uuid IN
(SELECT bq_uuid
FROM
(SELECT bq_uuid,
ROW_NUMBER() OVER( PARTITION BY updated_at
ORDER BY bq_uuid ) AS row_num
FROM beginner-290513.917834811114.messages ) t
WHERE t.row_num > 1 );

How to sort rows in "SELECT ... FOR ALL ENTRIES ...", ORDER BY is not accepted

I am selecting a table that has multiple of the same records (same REQUEST_ID) with different VERSION_NO. So I want to sort it descending so I can take the highest number (latest record).
This is what I have...
IF it_temp2[] IS NOT INITIAL.
SELECT request_id
version_no
status
item_list_id
mod_timestamp
FROM ptreq_header INTO TABLE it_abs3
FOR ALL ENTRIES IN it_temp2
WHERE item_list_id EQ it_temp2-itemid.
ENDIF.
So version_no is one of the SELECT field but I want to sort that field (descending) and only take the first row.
I was doing some research and read that SORT * BY * won't work with FOR ALL ENTRIES. But that's just my understanding from reading up.
Please let me know how I can make this work. Thanks
You can simply sort the itab after the select and delete all adjecent duplicates afterwards, if wanted:
SORT it_abs3 BY request_id [ASCENDING] version_no DESCENDING.
DELETE ADJACENT DUPLICATES FROM it_abs3 COMPARE request_id.
Depending on the amount of expected garbage (to be deleted lines) in the itab an SQL approach is better. See Used_By_Already's answer.
If you are using the term "latest" to indicate "the most recent entry", then the field mod_timestamp appears to be relevant and you could use it this way to choose only the most recent records for each request_id.
SELECT
request_id
, version_no
, status
, item_list_id
, mod_timestamp
FROM ptreq_header h
INNER JOIN (
SELECT
request_id
, MAX(mod_timestamp) AS latest
FROM ptreq_header
GROUP BY
request_id
) l
ON h.request_id = l.request_id
AND h.mod_timestamp = l.latest
If you want the largest version_no, then instead of MAX(mod_timestamp) use MAX(version_no)
Just declare the it_abs3 as a sorted table with key that would consist of the columns you want to sort by.
You can also sort the table after the query.
SORT it_abs3 BY ...

How to create select SQL statement that would produce "merged" dataset from two tables(Oracle DBMS)?

Here's my original question:
merging two data sets
Unfortunately I omitted some intircacies, that I'd like to elaborate here.
So I have two tables events_source_1 and events_source_2 tables. I have to produce the data set from those tables into resultant dataset (that I'd be able to insert into third table, but that's irrelevant).
events_source_1 contain historic event data and I have to do get the most recent event (for such I'm doing the following:
select event_type,b,c,max(event_date),null next_event_date
from events_source_1
group by event_type,b,c,event_date,null
events_source_2 contain the future event data and I have to do the following:
select event_type,b,c,null event_date, next_event_date
from events_source_2
where b>sysdate;
How to put outer join statement to fill the void (i.e. when same event_type,b,c found from event_source_2 then next_event_date will be filled with the first date found
GREATLY APPRECIATE FOR YOUR HELP IN ADVANCE.
Hope I got your question right. This should return the latest event_date of events_source_1 per event_type, b, c and add the lowest event_date of event_source_2.
Select es1.event_type, es1.b, es1.c,
Max(es1.event_date),
Min(es2.event_date) As next_event_date
From events_source_1 es1
Left Join events_source_2 es2 On ( es2.event_type = es1.event_type
And es2.b = es1.b
And es2.c = es1.c
)
Group By c1.event_type, c1.b, c1.c
You could just make the table where you need to select a max using a group by into a virtual table, and then do the full outer join as I provided in the answer to the prior question.
Add something like this to the top of the query:
with past_source as (
select event_type, b, c, max(event_date)
from event_source_1
group by event_type, b, c, event_date
)
Then you can use past_source as if it were an actual table, and continue your select right after the closing parens on the with clause shown.
I end up doing two step process: 1st step populates the data from event table 1, 2nd step MERGES the data between target (the dataset from 1st step) and another source. Please forgive me, but I had to obfuscate table name and omit some columns in the code below for legal reasons. Here's the SQL:
INSERT INTO EVENTS_TARGET (VEHICLE_ID,EVENT_TYPE_ID,CLIENT_ID,EVENT_DATE,CREATED_DATE)
select VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID,
max(EVENT_INITIATED_DATE) EVENT_DATE, sysdate CREATED_DATE
FROM events_source_1
GROUP BY VEHICLE_ID, EVENT_TYPE_ID, DEALER_ID, sysdate;
Here's the second step:
MERGE INTO EVENTS_TARGET tgt
USING (
SELECT ee.VEHICLE_ID VEHICLE_ID, ee.POTENTIAL_EVENT_TYPE_ID POTENTIAL_EVENT_TYPE_ID, ee.CLIENT_ID CLIENT_ID,ee.POTENTIAL_EVENT_DATE POTENTIAL_EVENT_DATE FROM EVENTS_SOURCE_2 ee WHERE ee.POTENTIAL_EVENT_DATE>SYSDATE) src
ON (tgt.vehicle_id = src.VEHICLE_ID AND tgt.client_id=src.client_id AND tgt.EVENT_TYPE_ID=src.POTENTIAL_EVENT_TYPE_ID)
WHEN MATCHED THEN
UPDATE SET tgt.NEXT_EVENT_DATE=src.POTENTIAL_EVENT_DATE
WHEN NOT MATCHED THEN
insert (tgt.VEHICLE_ID,tgt.EVENT_TYPE_ID,tgt.CLIENT_ID,tgt.NEXT_EVENT_DATE,tgt.CREATED_DATE) VALUES (src.VEHICLE_ID, src.POTENTIAL_EVENT_TYPE_ID, src.CLIENT_ID, src.POTENTIAL_EVENT_DATE, SYSDATE)
;

Resources