Comparing two large files having different order using Unix shell script - shell

I have two text files each of size 3.5GB that I want to compare using Unix script. The files contain around 5 million records in them.
The layout of the files are like below.
*<sysdate>
<Agent Name 1>
<Agent Address 1>
<Agent Address 2>
<Agent Address 3>
...
<Agent Name 2>
<Agent Address 1>
<Agent Address 2>
<Agent Address 3>
...
<Total number of records present>*
Sample file.
<sysdate>
Sachin Tendulkar 11051973 M
AddrID1 AddrLn11 AddrLn12 City1 State1 Country1 Phn1 OffcAddr11 OffcAddr12 St1 Cntry1
AddrID2 AddrLn21 AddrLn22 City2 State2 Country2 Phn2 OffcAddr21 OffcAddr22 St2 Cntry2
...
Sourav Ganguly 04221975 M
AddrID1 AddrLn11 AddrLn12 City1 State1 Country1 Phn1 OffcAddr11 OffcAddr12 St1 Cntry1
AddrID2 AddrLn21 AddrLn22 City2 State2 Country2 Phn2 OffcAddr21 OffcAddr22 St2 Cntry2
...
<Total number of records present>
The order of the Agent addresses in the two files is different. I need to find the records that are present in one file but not in the other and also the mismatched records. I tried sorting the files using Unix sort command initially but it failed due to server space issue. ETL (Informatica) approach can also be considered.
Any help would be appreciated

You can use awk and start writing to a new file each time you match Agent Name and give that file the name of the agent (perkaps in subdir's using the first three characters). Next compare the directories (trees) from both input files (diff -r).
Another solution is importing all records in two different tables and use sql to compare:
select name from table1 where name not in (select name from table2);
select name from table2 where name not in (select name from table1);
select name from table1
inner join table2 on table1.name=table2.name
where table1.address1 <> table2.address1
or table1.address2 <> table2.address2
...

In informatica load both the files.
find MD5 of each row, by concatenating each column, For example:
MD5(COL1||Col2||COL3)
Now compare both the MD5 values from each file by using joiner, by this way you can find matching and non-matching rows.

first at all, send an example of the 2nd file
why you cant sorter the data with the sorter transformation?
my approuch would be concatenate the first 3 columns (name,addres1,addres2) and make it as key, next use a joiner transformation to match the data.
you also can do a union transformation and after that an aggregator transformation to count how many times the key that you create is found
if the count is equal to 2, mean that the data is in both files
if the count is equal to 1, mean that the data is in just 1 file
send more info about the problem to be more specific

Try to restructure your data first.
Keep adding the AgentName and other fields to every address related to that agent. Use simple tricky expression logic, like a variable/counting methodology to achieve this. By doing this your flat files will be compare-friendly & can be compared easily either in UNIX or in Informatica.
Let me know if you are interested in this solution, shall help you more.

Related

How to extract multiple values as multiple column data from filename by Informatica PowerCenter?

I am very new to Informatica PowerCenter, Just started learning. Looking for help. My requirement is : I have to extract data from flat file(CSV file) and store the data into Oracle Table. Some of the column value of the target table should be coming from extracting file name.
For example:
My Target Table is like below:
USER_ID Program_Code Program_Desc Visit Date Term
EACRP00127 ER Special Visits 08/02/2015 Aug 2015
My input filename is: Aug 2015 ER Special Visits EACRP00127.csv
From this FileName I have to extract "AUG 2015" as Term, "ER Special Visits" as Program_Desc and "EACRP00127" as Program_Code along with some other fields from the CSV file.
I have found one solution using "Currently Processed Filename". But with this I am able to get one single value from filename. how can I extract 3 values from the filename and store in the target table? Looking for some shed of light towards solution. Thank you.
Using expression transformation you can create three output values from Currently Processed Filename column.
So you get the file name from SQ using this field 'Currently Processed Filename'. Then you can substring the whole string to get what you want.
input/output = Currently Processed Filename
o_Term = substr(Currently Processed Filename,1,9)
o_Program_Desc = substr(Currently Processed Filename,10,18)
o_Program_Code = substr(Currently Processed Filename,28,11)

How to take two columns of two TXT and create new TXT with the two columns?

I have two text files with only one column each.
I need to take the column from each of the text files and create a new text file with the two columns with tabs.
These columns have no relation (ID) but are in order with each other.
I could do that in Excel, but there are more than 200 thousand lines and not accepted.
How can I do it in Pentaho?
Take 2 text input steps, read both the files,
after that add 2 add constant step create same column with some value,make sure the value of the both constant values remains same.
use stream lookup/merge join and merge them with constant values.
generate the file.
You can read both files with Text file input, add "row number" in each stream, which gives you two streams of 2 fields each. Then you can Merge join both streams on Row number, and finally a Select fields step to clean up the output so that only the two relevant fields are kept. Then Text file output to write it.

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

Pentaho Data Integration (DI) Get Last File in a Directory of a SFTP Server

I am doing a transformation on Pentaho Data Integration and I have a list of files in a directory of my SFTP server. This files are named with FILE_YYYYMMDDHHIISS.txt format, my directory looks like that:
mydirectory
FILE_20130701090000.txt
FILE_20130701170000.txt
FILE_20130702090000.txt
FILE_20130702170000.txt
FILE_20130703090000.txt
FILE_20130703170000.txt
My problem is that I need get the last file of this list in accordance of its creation date, to pass it to other transformation step...
How can I do this in Pentaho Data Integration?
In fact this is quite simple because your file names can be sorted textually, and the max in the sort list will be your most recent file.
Since a list of files is likely short, you can use a Memory Group by step. A grouping step needs a separate column by which to aggregate. If you only have column and you want to find the max in the entire set, you can add a grouping column with an Add Constants step, and configure it to add a column with, say an integer 1 in every row.
Configure your Memory Group by to group on the column of 1s, and use the file name column as the subject. Then simply select the Maximum grouping type. This will produce a single row with your grouping column, the file name field removed and the aggregate column containing your max file name. It would look something like this:

CloverETL: Compare two records

I have two files, A and B. The records in both files share the same format and the first n characters of a record is its unique identifier. The record is of fixed length format and consists of m fields (field1, field2, field3, ...fieldm). File B contains new records and records in file A that have changed. How can I use cloverETL to determine which fields have changed in a record that appears in both files?
Also, how can I gather metrics on the frequency of changes for individual fiels. For example, I would like to know how many records had changes in fieldm.
This is typical example of Slowly Changing Dimension problem. Solution with CloverETL is described on theirs blog: Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 1 and Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2.

Resources