How to use Nifi to write CSV files from DB data with all data sorted and no same value in different files? - oracle

I have a data table in Oracle database in this format to keeps all the transactions in my system:
Customer ID
Transaction ID
001
trans_id_01
001
trans_id_02
002
trans_id_03
003
trans_id_04
As you see, each customer ID can generate many transactions in this table.
Now I need to export the data from each day into CSV files with Apache Nifi.
But the requirement is I need to have around 10k transactions in each file (this is not fixed, can have a bit more or less), with rows sorted by Customer ID. That should be simple, and I have done it with
this processor:
But there's additional requirement to ensure each Customer ID should be in the same file. There should be no case where customer id 005 have some transactions in file no. 1 and another transaction in file no. 2.
If I need to write this logic with pure coding, I think I can do DB query with pagination and write some codes to check for trailing data at the end to be compared with next page before writing each file. But when it comes to implementation with Nifi, I still have no idea how to do this.

But there's additional requirement to ensure each Customer ID should be in the same file. There should be no case where customer id 005 have some transactions in file no. 1 and another transaction in file no. 2.
Try I think ExecuteSQLRecord with a custom select that gets you exactly what you want from Oracle and then use PartitionRecord configured to use the customer ID as the partition column. That will break up the record set.
I don't know how Oracle does it, but this would be the way I'd do it in Postgres:
SELECT CUSTOMER_ID, ARRAY_AGG(TRANSACTION_ID) FROM TRANSACTIONS GROUP BY CUSTOMER_ID
That would create: 001, {trans_id_01, trans_id_02...} and ensure that each result entry from the database has precisely one customer per line and all of their transactions enumerated in a single list.

I have found a solution by using an idea of doing loop flow from : https://gist.github.com/ijokarumawak/01c4fd2d9291d3e74ec424a581659ca8
So I created a loop flow like in the image below:
This will query records of about 40k each time with this sql query
SELECT * FROM TRANSACTIONS WHERE <some filtering> AND CUSTOMER_ID > '${max}' ORDER BY CUSTOMER_ID FETCH FIRST 40000 ROWS With Ties
With ties keyword will help get the ties record to ensure all records of same CUSTOMER_ID is in the same file. Each success flow file will go the right side to write the data into CSV file. While the success flow also go downward to extract data for another iteration. ${max} is retrieved for the biggest value of the current result set, using QueryRecord processor with below query:
SELECT CUSTOMER_ID FROM FLOWFILE ORDER BY CUSTOMER_ID DESC fetch first 1 row only
Then it will go to the next iteration in a loop until there's no data left for the current criteria

Related

Deduplication in Oracle

Situation:-
Table 'A' is receiving data from OracleGoldenGate feed and gets the data as New,Updated,Duplicate feed that either creates a new record or rewrites the old one based on it's characteristics (N/U/D). Every entry in table has its UpdatedTimeStamp column contain insertion timestamp.
Scope:-
To write a StoredProcedure in Oracle that pulls the data for a time period based on UpdatedTimeStamp column and publishes an xml using DBMSXMLGEN.
How can I ensure that a duplicate entered in the table is not processed again ??
FYI-am currently filtering via a new table that I created, named as 'A-stg' and has old data inserted incrementally.
As far as I understood the question, there are a few ways to avoid duplicates.
The most obvious is to use DISTINCT, e.g.
select distinct data_column from your_table
Another one is to use timestamp column and get only the last (or the first?) value, e.g.
select data_column, max(timestamp_column)
from your_table
group by data_column

PLSQL Daily record of changes on table, then select from day

Oracle PL SQL question: One table should be archived day by day. Table counts about 50.000 records. But only few records during a day are changed. Second table (destination/history table) has one additional field - import_date. Two days = 100.000 records. Should be 50.000 + feq records with informations about changes during a day.
I need one simple solution to copy data from source table to destination like a "LOG" - only changes are copied/registered. But I should have possibility to check dataset of source table from given day.
Is there such mechanism like MERGE or something like that?
Normally you'd have a day_table and a master_table. All records are loaded from the day_table into master and only master is manipulated with the day table used to store the raw data.
You could add a new column to master such as a date_modified and have the app update this field when a record changes, or a flag used to indicate it's changed.
Another way to do this is to have an active/latest flag. Instead of changing the record it is duplicated with a flag set to indicate this is a better/old record. This might be easier for comparison
e.g. select * from master_table where record = 'abcd'
This would show 2 rows - the original loaded at 1pm and the modified active one changed at 2pm.
There's no need to have another table, you could base a view on this flag then
e.g. CHANGED_RECORDS_VIEW = select * from master_table where flag = 'Y'
Once i faced a similar issue. And please find the solution below.
Tables we had :
Master table always has records it and keeps adding up.
One backup table to store all the master records on daily basis.
Solution:
From morning to evening records are inserted and updated into the master table. The concept of finding out the new records was the timestamp. Whenever a new record is inserted/updated then corresponding timestamp is added and kept.
At night, we had created a job schedule to run a procedure (Create_Job-> please check oracle documentations for further learning) which runs exactly at 10:00 pm to bulk collect all the records available in master table based on today's date and insert into the backup table.
This scenrio which i have explained to you will help you. Please check out the concept of Job scheduling which will help you. Thank you .

change data capture multiple tables for incremental load - ETL

I am building a staging area that gets data from informatica cdc. Now for example lets say I am replication two tables for incremental load. I have to delete the processed data from the staging tables after each load. I join these two tables to populate my target dimension. Problem is change can happen on only one source and not the other in a particular load.
Example:
Employee
---------
ID NAME
1 PETER
EmployeeSal
------------
EMPID SAL
1 2000
If the above is replicated in my first load, I join the two tables and load them thats fine.
Now lets say the salary of peter is updated frrom 2000 to 3000. As I have delete my staging tables after each load, I will have the following for current load.
Employee
---------
ID NAME
EmployeeSal
-----------
EMPID SAL
1 3000
Here is my problem ss I have to populate the whole row of the dimension which is TYPE2.
I have to join back to the source to get the other attributes of employee table ( This is just a lame example, In reality it might be 10 tables and hundreds of thousand of changes). Is it recommended to go back to the source?
I can join the target table to this mix and populate the missing attributes.
Is this even recommended as I have to do lot of case statements , nullhandlings etc if a particular staging table has no change for a dimension record. My question is is it even common that target table is joined in a ETL transformation?
Going back to the source system, invalidates the purpose of creating the staging area in the first place. It is usually not recommended.
However querying the target table is quite common to get previous information. But it is true that you have to do a lot of checks.
Another option is to maintain a scd type 1 in your staging area. Maintain insert and update timestamps in staging which you can use to get only the changed records while loading the dimension.
I have encountered the same problem with Order Header and Details - if the detail changes I need to be able to inner join to the Header to update my flattened Order Fact table.
I resolved it this way: After I have staged all the changed records, I look up any missing Headers for the changed Details (an vice versa) using an SQL Task that loads an object with an array of Order IDs. And I use a for each loop to get load the missing Headers into staging.
My inner joins from Staging to the data warehouse will now work as expected

showing hbase data in JSP taking several minutes.Please advice

I am accessing HBase data from JSP using Hive queries.Now since Hbase can store huge data like Terabytes of data.If the data is in so much the hive query (which converts into map reduce tasks) will take several minutes of time.So will the JSP page wait say 10 minutes to display the data.What should be the strategy.Is this the correct approach.If not so what is the best approach to show huge hbase data on JSP.
Hive/Any hadoop map-reduce system for that matter, is designed for offline batch processing. Submitting Hive queries from JSP and waiting for an arbitrary amount of time for data to be ready and be shown on the front-end is a definite no-no. If the cluster is super-busy , your jobs might not be even scheduled within the specified time.
What exactly do you want to show from Hbase on the front end ?
If it is a set of rows from a table and you know what the rows are (meaning you have the row key or your application can compute it at run time) , just fetch those rows from and display.
if you have to do some SQL-like operations(joins/ selects etc), then I guess you do realize , HBase is a No-SQL system and you are supposed to do these operations in the application and then fetch the appropriate rows using the row key.
For eg: If you have 2 HBase tables , say Dept (dept Id as row key and a string column(employees) with commma separated list of empIds) and Employee( emp Id as row key and columns Name, Age, Salary) . To find the employee with highest salary in a dept, you have to
a.Fetch the row from the Dept table (using dept Id)
b. Iterate the list of empIds from the employees column.
c. In each iteration , fetch the row from Employee table(by empId row key)
and find the max
Yes HBase can handle TBs of data, but you ll almost never have to show that much of data on the front-end using JSP. Im guessing, you ll most likely be interested in only a portion of the data , though the backing HBase table is much bigger

Informatica issue - transformation

I want to know how to get latest updated record via Informatica. Suppose I have 10 records in a temporary table. 3 records for Account1, 3 for Account2 and 4 records for Account3. Now out of these 3 accounts, I need fetch only those records which has maximum date value (Latest date) and insert in another temporary table. So which transformations I could use to get this or informatica logic I should use? Please help.
If the Date column comes from input with unique date , based on that use the aggregator transformation and take the maximum date.
If no date column is present, please assign system timestamp but cannot take maximum date from this. You have to go for some other logic like rowid and rownum features.
If the source is DB, we can do it in SQ itself - write a temp table by grouping by the pk field, and select this pk field and max(date). then join this output with the original source based on pk and date.
for eg:
select * from src_table
join ( select pk,max(date) as maxdate from src_table ) aggr_table
on src_table.pk=aggr_table.pk
and src_table.date=aggr_table.maxdate
same can be implemented inside informatica using an aggregator and joiner. but since the aggregator source is sq and again it's output is joining with sq, one sorter will be required in between the aggregator and joiner.
you can use aggregator transformation. You can use sorter transformation first with sorting based on account and date asc.After that you can aggregator transformation (grouping based on account). You dont need to add any condition or grouping function, as aggregator will give last records of every group.

Resources