Data Transformation for Large data in a file - intersystems-ensemble

I am new to ensemble and have a clarification regarding the Data Transformations.
I have 2 schemas as follows,
PatientID,
Patient Name,
Patient Address (combination of door number, Street, District, State)
and another schema as,
PatientID,
Patient Name,
Door Number
Street
District
State
Now there is an incoming text file with 1000's of records as per the first schema ('|' separated) as below,
1001|John|220,W Maude Ave,Suisun City, CA
like this there a 1000's of recrods in the input file
My requirement is to convert this as per the second schema (i.e to separate the Address) and store in the file like,
1001|John|220|W Maude Ave|Suisun City|CA
One solution I implemented was to loop through each line in the file and replace the , in the address with '|'.
My question is, whether we can do it through DTL. If the answer is yes how do we loop through 1000s of records using DTL.
Whether DTL will be time consuming? because we need to load the schema and then do the transformations.
Please help.

You can use DTL with any class that inherit from Ens.VirtualDocument or %XML.Adaptor, virtually Ensemble use class dictionary to represent the schema so for basic classes there is not problem is you extends %XML.Adaptor Ensemble can represent it. In case of virtual documents the object has to be set the DocType.
In order to do the loop there is a in DTL

Yes, DTLs can parse 1000's of records. You can do the following:
1) Create a record map to parse the incoming file that has schema 1
2) Define an intermediate object that maps schema 2 fields to object properties
3) Create a DTL whose source object is the record map object from 1 above and target is object from 2 above.

Related

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

Is it a good idea to use Hashing (ORA_HASH) to define uniqueness in relational tables?

I have an XML file containing Client id and addresses which I need to load into relational tables in Oracle database.
<Clients>
<Client id="100">
<name>ABC Corportation</name>
<Address>
<Addr1>1 Pine Street</Addr1>
<City>Chennai</City>
<State>Tamil Nadu</State>
<Country>India</Country>
<Postalcode>6000000</Postalcode>
</Address>
<Address>
<Addr1>1 Apple Street</Addr1>
<City>Coimbatore</City>
<State>Tamil Nadu</State>
<Country>India</Country>
<Postalcode>6000101</Postalcode>
</Address>
</Client>
<Client id="101">
....
....
</Client>
I have 2 relational tables defined as below-
Client
CLIENT_ID (Unique Key)
CLIENT_NAME
Client_Location
CLIENT_ID
ADDR1
CITY
STATE
COUNTRY
POSTAL_CODE
Updates to client address at source will sent in the XML file everyday. The ETL designed in a way that it requires a unique key on the table based on which it will identify the change coming in the XML as INSERT or UPDATE and accordingly sync the table to the XML. Identifying DELETEs is not really necessary.
Question: What should be defined as the unique key for Client_Location to process incremental changes coming everyday in the XML file? There is no identifier for address in the XML file. I was thinking about creating an additional hashing column (using ORA_HASH function) based on the 3 columns (STATE, COUNTRY, POSTAL_CODE). The unique key for the table would (CLIENT_ID, <>) which the ETL will use.. The idea is that it is not common for STATE/COUNTRY/POSTAL_CODE to change in an address. Ofcourse, this is a big assumption which I'm making. I would like to implement the below-
1) If there is any small change to ADDR1, I want the ETL to pick it up as a "valid" update at source and sync it to the table.
2) If there is a small change in the STATE/COUNTRY/POSTAL_CODE (eg: typo correction or case change like India to INDIA), then I don't want this to picked as a change because it would lead to INSERT (hashing value would change which is part of the unique key) and in turn duplicate rows in the table.
Does the idea of using a hashing column to define uniqueness make sense? Is there a better way to deal with this?
Is there a way to tweak ORA_HASH to produce results expected in #2 above?
If the client can have only one location reuse CLIENT_ID as primary key.
If there are more locations posible add SEQUENCEkey (sequence number 1..N) to the CLIENT_ID as a PK.
The simplest possibility to distinct and identify the locations is to use the feature of XML that the order of elements is well defined and has meaning. So the first ADDRESS element (pine street) becomes sequence 1, the second 2 and so on.
Please check the FOR ORDINALITY clause on XML table how to get this identification while parsing the XML.
You may also add TIMESTAMP (as a simple attribute - not a key) to keep the timestamp of the change and a STATUS column to identify deleted locations.
HASH may be usefull to quickly test a change if you have tons of columns, but for 5 columns it is probably an overkill (as you may simple compare the column values). I'd not recommend to used HASH as part of a key as it has no advantage to the proposed solution.

Queries in Dynamodb

I have an application written in Nodejs that needs to find ONE row based on a city name (this could just be the table's name, different cities will be categorized as different tables), and a field named "currentJobLoads" which is a number. For example, a user might want to find ONE row with the city name "Chicago" and the lowest currentJobLoads. How can I achieve this in Dynamodb without scan operations(since scan would be slower and can only read so much data before it gets terminated)? Any suggestions would be highly appreciated.
You didn't specify what your current partition key and sort key for the table are, but I'm guessing the currentJobLoads field isn't one of them. So you would need to create a Global Secondary Index on the currentJobLoads field, at which point you will be able to run query operations against that field.

how to create a new record dynamically using informatica powercenter

I have employee's leaves related data and payment related information.
e.g. Employee E1 has taken maternity leave this year. She needs to paid for 6 months and if she is on leave for greater duration (like 8 months) , I need to create two records for her.
One for the allowed duration and the other for extended duration.
Employee LeaveStartDAte LeaveEndDate Total_days_taken Total_days_allowed LeaveType
e1 1Jan2013 31Aug2013 242 186 ML
Target expected :
Employee LeaveStartDAte LeaveEndDate Leavetype
e1 1Jan2013 30June2013 ML
e1 1July 2013 31Aug2013 Extended ML
How can create the second record dynamically in informatica mapping?
Generally speaking, we use java transformation in informatica to dynamically create new rows. However, for scenarios like the one you described, where you only need to create one extra row based on some condition you can achieve this by adding two target instances and populating the second target instance conditionally (using a router or filter transformation).
You can do something like this:
Create two sets of ports for LeaveStartDate, LeaveEndDate and LeaveType in an expression, and calculate their values accordingly. For example:
LeaveStartDate1 -> source LeaveStartDate
LeaveStartDate2 -> LeaveStartDate + Total_days_allowed + 1
Now connect first set of ports directly to a target instance. Connect the second set of ports to another target instance through a filter. The filter condition would be something like Total_days_taken > Total_days_allowed. You can also do this using a router, if you like.
You can use two pipelines in a mapping - one to load the records for insert and the 2nd used to combined the insert with update.

CloverETL: Compare two records

I have two files, A and B. The records in both files share the same format and the first n characters of a record is its unique identifier. The record is of fixed length format and consists of m fields (field1, field2, field3, ...fieldm). File B contains new records and records in file A that have changed. How can I use cloverETL to determine which fields have changed in a record that appears in both files?
Also, how can I gather metrics on the frequency of changes for individual fiels. For example, I would like to know how many records had changes in fieldm.
This is typical example of Slowly Changing Dimension problem. Solution with CloverETL is described on theirs blog: Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 1 and Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2.

Resources