Informatica issue - transformation - informatica-powercenter

I want to know how to get latest updated record via Informatica. Suppose I have 10 records in a temporary table. 3 records for Account1, 3 for Account2 and 4 records for Account3. Now out of these 3 accounts, I need fetch only those records which has maximum date value (Latest date) and insert in another temporary table. So which transformations I could use to get this or informatica logic I should use? Please help.

If the Date column comes from input with unique date , based on that use the aggregator transformation and take the maximum date.
If no date column is present, please assign system timestamp but cannot take maximum date from this. You have to go for some other logic like rowid and rownum features.

If the source is DB, we can do it in SQ itself - write a temp table by grouping by the pk field, and select this pk field and max(date). then join this output with the original source based on pk and date.
for eg:
select * from src_table
join ( select pk,max(date) as maxdate from src_table ) aggr_table
on src_table.pk=aggr_table.pk
and src_table.date=aggr_table.maxdate
same can be implemented inside informatica using an aggregator and joiner. but since the aggregator source is sq and again it's output is joining with sq, one sorter will be required in between the aggregator and joiner.

you can use aggregator transformation. You can use sorter transformation first with sorting based on account and date asc.After that you can aggregator transformation (grouping based on account). You dont need to add any condition or grouping function, as aggregator will give last records of every group.

Related

How to use Nifi to write CSV files from DB data with all data sorted and no same value in different files?

I have a data table in Oracle database in this format to keeps all the transactions in my system:
Customer ID
Transaction ID
001
trans_id_01
001
trans_id_02
002
trans_id_03
003
trans_id_04
As you see, each customer ID can generate many transactions in this table.
Now I need to export the data from each day into CSV files with Apache Nifi.
But the requirement is I need to have around 10k transactions in each file (this is not fixed, can have a bit more or less), with rows sorted by Customer ID. That should be simple, and I have done it with
this processor:
But there's additional requirement to ensure each Customer ID should be in the same file. There should be no case where customer id 005 have some transactions in file no. 1 and another transaction in file no. 2.
If I need to write this logic with pure coding, I think I can do DB query with pagination and write some codes to check for trailing data at the end to be compared with next page before writing each file. But when it comes to implementation with Nifi, I still have no idea how to do this.
But there's additional requirement to ensure each Customer ID should be in the same file. There should be no case where customer id 005 have some transactions in file no. 1 and another transaction in file no. 2.
Try I think ExecuteSQLRecord with a custom select that gets you exactly what you want from Oracle and then use PartitionRecord configured to use the customer ID as the partition column. That will break up the record set.
I don't know how Oracle does it, but this would be the way I'd do it in Postgres:
SELECT CUSTOMER_ID, ARRAY_AGG(TRANSACTION_ID) FROM TRANSACTIONS GROUP BY CUSTOMER_ID
That would create: 001, {trans_id_01, trans_id_02...} and ensure that each result entry from the database has precisely one customer per line and all of their transactions enumerated in a single list.
I have found a solution by using an idea of doing loop flow from : https://gist.github.com/ijokarumawak/01c4fd2d9291d3e74ec424a581659ca8
So I created a loop flow like in the image below:
This will query records of about 40k each time with this sql query
SELECT * FROM TRANSACTIONS WHERE <some filtering> AND CUSTOMER_ID > '${max}' ORDER BY CUSTOMER_ID FETCH FIRST 40000 ROWS With Ties
With ties keyword will help get the ties record to ensure all records of same CUSTOMER_ID is in the same file. Each success flow file will go the right side to write the data into CSV file. While the success flow also go downward to extract data for another iteration. ${max} is retrieved for the biggest value of the current result set, using QueryRecord processor with below query:
SELECT CUSTOMER_ID FROM FLOWFILE ORDER BY CUSTOMER_ID DESC fetch first 1 row only
Then it will go to the next iteration in a loop until there's no data left for the current criteria

Exago BI - Can you join to the same table multiple times?

Let's say I have a date table.
Let's say I also have another table where I want to join based on sales date and then do another join based upon the shipped date with the goal of joining based on date values in order to get the month name, year, etc from the date table.
Using Exago BI, is it possible to do such a join or would I have to create a view and just put the data in there manually?

hive: how to handle scd type 2 without update

currently in our on-prem Hadoop environment we are using hive table with transaction properties. However as we have moving to AWS we don't have that feature yet. and so want to understand how to handle SCD Type 2 without updates.
for example.
for following record.
With Updates
In table with transaction properties enabled, when I get an update for a record, I go ahead and change the end_date to current date and create new record with effective_date as current date and end_date as 12/31/9999, as shows in above table. And so it's easier to find my active record (where end_date = "12/31/9999").
However, if I can't update the past record. I have two records with same end_date. as shows in table below.
My question are.
if I can update end_date of past record,
How do I get the historical duration of stay?
How do i get active record?
without updates
First of all, convert all dates to the 'yyyy-MM-dd' format, so they all will be sortable and analytic functions will work. Then you can use lead(effective_date, '2019-01-01') over(partition by id order by effective_date). For id=1 and effective_date = 2019-01-01 it should give you '2020-08-15' and you can assign this value as end_date for '2019-01-01' record. If there is no record with bigger effective_date, '9999-01-01' will be assigned. After this transformation Active record is that having '9999-01-01'.
Suppose dates are already converted to yyyy-MM-dd, this is how you can rewrite your table (after insert):
insert overwrite table your_table
select name, id, location, effective_date,
lead(effective_date,'2019-01-01') over(partition by id order by effective_date) as end_date
from your_table
Or without doing insert first, you can UNION ALL existing records with new records, in a subquery, then calculate lead.
Actually, SCD2 is not recommended for historical data rewriting because of non-equi join implementation in hive. It is implemented as cross-join + filter (or duplicating join on dim.id=fact.id (this will duplicate rows) + where fact.date<=dim.end_date and fact.date>=dim.effective_date - this should filter one record). This join is very expensive if the dimension and fact are big because of duplication before filtering.

Deduplication in Oracle

Situation:-
Table 'A' is receiving data from OracleGoldenGate feed and gets the data as New,Updated,Duplicate feed that either creates a new record or rewrites the old one based on it's characteristics (N/U/D). Every entry in table has its UpdatedTimeStamp column contain insertion timestamp.
Scope:-
To write a StoredProcedure in Oracle that pulls the data for a time period based on UpdatedTimeStamp column and publishes an xml using DBMSXMLGEN.
How can I ensure that a duplicate entered in the table is not processed again ??
FYI-am currently filtering via a new table that I created, named as 'A-stg' and has old data inserted incrementally.
As far as I understood the question, there are a few ways to avoid duplicates.
The most obvious is to use DISTINCT, e.g.
select distinct data_column from your_table
Another one is to use timestamp column and get only the last (or the first?) value, e.g.
select data_column, max(timestamp_column)
from your_table
group by data_column

Is it possible to use SYSDATE as a column alias?

If I want to run a report daily and store the report's date as one of the column headers. Is this possible?
Example output (Counting the activities of employees for that day):
SELECT EMPLOYEE_NAME AS EMPLOYEE, COUNT(ACTIVITY) AS "Activity_On_SYSDATE" FROM EMPLOYEE_ACCESS GROUP BY EMPLOYEE_NAME;
Employee Activity_On_17042016
Jane 5
Martha 8
Sam 11
You are looking to do a reporting job with a data storing tool. The database (and SQL) is for storing and retrieving data, not for creating reports. There are special tools for creating reports.
In database design, it is very unhealthy to encode actual data in table or column name. Neither a table name nor a column name should have, as part of the name (and of the way they are used), an employee id, a date, or any other bit of actual data. Actual data should only be in fields, which in turn are in columns in different tables.
From what you describe, your base table should have columns for employee, activity and date. Then on any given day, if you want the count for the "current" day, you can query with
select employee, count(activity) ct
from table_name
where activity_date = SYSDATE
group by employee
If you want, you can also include the "activity_date" column in the output, that will show for which date the report was run.
Note that I assumed the column name for "date" is "activity_date." And in the output I used "ct" for a column alias, not "count." DATE and COUNT are reserved words, like SYSDATE, and you should NOT use them as table or column name. You could use them as aliases, as long as you don't need to refer to these aliases anywhere else in SQL, but it is still a very bad idea. Imagine you ever need to refer to a column (by name or by alias) and the name or alias is SYSDATE. What would a where clause like this mean?
where sysdate = sysdate
Do you see the problem?
Also, I can't tell from your question - were you thinking of storing these reports back in the database? To what end? It is better to store just one query and run it whenever needed (and make the "activity_date" for which you want the counts) an input parameter, so you can run the query for any date, at any time in the future. There is no need to store the actual daily reports in the database, as long as the base table is properly maintained.
Good luck!

Resources