SSIS Lookup transformation not finding matches - caching

I have a Lookup transformation that does not seem to be finding obvious matches. I have an input file that has 43 records that includes the same CustomerID which is set as an 8 byte-Signed Integer. I am using the Lookup to see if the CustomerID already exist in my destination table. In the destination table the CustomerID is defined as BigInt.
For testing, I truncated the Lookup(destination) table. I have tried all three Cache settings with the same results.
When I run the SSIS package, all 43 records are sent through the No Match Output side. I would think that only the 1st record should go that direction and all others would be considered as a match since they have the same CustomerID. Additionally, if I run the job a second time(without truncating the destination) then they are all flagged as Matched.
It seems as if the cache is not being looked at in the Lookup. Ultimately I want the NO Match records to be written to the Destination table and the Matched records to have further processing.
Any ideas?

Lookup transformation is working as expected. I am not sure what's your understanding of look up is, so I'll go point by point.
For testing, I truncated the Lookup(destination) table. I have tried
all three Cache settings with the same results.
When I run the SSIS package, all 43 records are sent through the No
Match Output side
Above behavior is expected. After truncate, lookup is essentially trying to find those 43 records within your truncated destination table. Since it can't find any, it is flagging them as new records ie No match output side.
if I run the job a second time(without truncating the destination)
then they are all flagged as Matched
In this case, all those 43 records from file are matched within destination table, hence lookup refers them as duplicates and thus they are flagged as Matched output
I am using the Lookup to see if the CustomerID already exist in my
destination table
To achieve this, all you need to do is send Matched output to some staging table which can be periodically truncated(as they are duplicate). and all the No match output can be send to your destination table.
You can post screenshot of your lookup as well in case you want further help.

The lookup can't be used this way. SSIS dataflows execute in a transaction. So while the package is running, no rows have been written to the destination until the entire dataflow runs. So regardless of the Cache setting, the new rows being sent to your destination table are not going to be considered by the Lookup while it's running. Then when you run it again, the rows will be considered. This is expected behavior.

Related

Oracle 11g Replace Physical Columns with Virtual Columns

After googling for a while , I am posting this question here since I was not able to find such a problem posted anywhere.
Our application has a table with 274 columns(No LOB or Long Raw columns) and over a period of 8 years the table started to have chained rows so any full table scan is impacting the performance.
When we dig deeper we found out that approximately 50 columns are not used anywhere in application and so could be dropped right away. But the challenge here is the application has to undergo many code changes to achieve this and we have exposed the underlying data as a service that is being consumed by other applications as well. So we cannot choose the code change as an option for now.
Another option we thought was, whether I can make these 50 columns as Virtual column set to NULL always, then we only we need to make changes to table loading procs and rest all will be as is. But I need experts' advice whether adding virtual columns to the table will not construct chained rows again. Will this solution work for the given problem statement?
Thanks
Rammy
Oracle only allows 255 columns per block. For tables with more than 255 columns it splits rows into multiple blocks. Find out more.
You table has 274 columns so you have chained rows because of the inherent table structure rather than the amount of space the data takes up. Making fifty columns all null won't change that.
So, if you want to eliminate the chained rows you really need to drop the rows. Of course you don't want to change all that application code. So what you can try is:
rename the table
drop the columns you don't want any more
create a view using the original table name and include NULL columns in the view's projection to match the original table structure.

Stop Hbase update operation if it have same value

I have a table in Hbase named 'xyz' . When I do an update operation on this table , it updates a table even though it is same record .
How can I control second record to not be added.
Eg:
create 'ns:xyz',{NAME=>'cf1',VERSIONS => 5}
put 'ns:xyz','1','cf1:name','NewYork'
put 'ns:xyz','1','cf1:name','NewYork'
Above put statements are giving 2 records with different timestamp if I check all versions. I am expecting that it should not add 2nd record because it have same value
HBase isn't going to look through the entire row and work out if it's the same as the data you're adding. That would be an expensive operation, and HBase prides itself on its fast insert speeds.
If you're really eager to do this (and I'd ask if you really want to do this), you should perform a GET first to see if the data is already present in the table.
You could also write a Coprocessor to do this every time you PUT data, but again the performance would be undesirable.
As mentioned by #Ben Watson, HBase is best known for it's performance in write since it doesn't need to check for the existence of a value as multiple versions will be maintained by default.
One hack what you can do is, you can use custom versioning. As show in the below screenshot, you have two versions already for a row key. Now if you are going to insert the same record with the same timestamp. HBase would be overwriting the same record with just the value.
NOTE: It is left to your application to get the same timestamp for a particular value.

Informatica 9.5.1, huge table (scd1)

I have a table(in oracle) size about 860 million records (850gb) on top we are getting about 2 -3 million records as source (flatfile).
we are doing a lookup on target if record already exist it will update if it is a new record it will insert(scd1).
The transformations we using are unconnectedlookup, sorter, filter and router, update strategy transformations, it was fine all this time, but as the table is huge and growing huge, it is taking for ever to insert and update, last night it took 19 hrs to 2.4 million records (2.1 millions were new so inserted and the rest are updates).
Today I got about 1.9 millions to go through i am not sure how long it will take any suggestions or help how can we handle this ?
1) Use just a connected lookup to oracle table, after SQ matching on primary key and filter out nulls (records missing in Oracle table) or not null (updates). Dont check for other columns for update. Skip sorter and filter. Just use update strategy.
2) Or use joiner and make flat file pipeline as master. Then check for nulls to find insert or updates.
3) Check if your target table dont have any trigger etc on it. If yes then check its logic and implement it in ETL.
Since you are dealing with 850mil data, you have two major bottlenecks - target lookup and writing into target.
You can think of this strategy -
Mapping 1 - Create a new mapping to load flat file data into a temp table TMP1.
Mapping 2 - Modify existing mapping. Just modify lookup query and join TMP1 and target (860mil)table in SQL Override. This will reduce time, I/O, lookup cache.
Also, please make sure you have an index on key columns in target. And you drop-create all other index while loading. Skipping sorter will help but adding joiner will not help much.
Regards,
Koushik
How many inserts vs updates do you have?
With just a few updates, try using Update else Insert target
property.
If there are many updates and few inserts, perform update
just if a key is found, without checking if anything has changed
If there are many source rows matching what you already have (i.e. an update that doesn't change anything) try to eliminate them. But don't compare all columns - use a hash instead. Just create an additional computed column that will contain a MD5 calculated on all columns. Then all you need to do is compare one column instead of all to detect a change.
1) Try using a merge statement if source and targets are in same database.
2) We can also use sql loader connection to improve the performance.
Clearly the bottleneck is in the target lookup and target load (update to be specific).
Try the following to tune the existing code:
1) Try to remove any unwanted lookup ports if you have in the lookup transformation. Keep only the fields that are used in the lookup condition as you are using it just to check if the record exists.
2) Try adding an index to the target table for the fields you are using for the update
3) Increase the commit interval of the session to a higher value.
4) Partial Pushdown optimization:
You can pushdown some of the processing to database which might be faster instead of doing it in Informatica
Create a staging table to hold the incoming data for that load.
Create a mapping to load the incoming file to the staging table. Truncate it before the start of the load to clear the records of the previous run.
In the SQL override of the existing mapping do a left join between the staging table and target table to find insert/updates. This will be faster than the Informatica lookup and eliminates the time taken to build the Informatica lookup cache.
5) Using MD5 to eliminate unwanted updates
For using MD5 you need to add a new field in the target table and do a mapping to update the existing records one time.
Then in your existing mapping add a step to compute MD5 for the incoming column.
If the record is identified for update then check if the MD5 computed for the incoming column is same as that of the target column. If the checksum also matches then don't update the record. Only if the check sum is different update the record. By this way you will filter out the unwanted updates. If there is no lookup match then insert the record.
Advantages: You are reducing the unwanted updates.
Disadvantages: You have to do an one time process to populate MD5 values for the existing records in the table.
If none of this works check with your database administrator to see if there is any issue in the database side that might slow down the load.

Avoid duplication in data pulled from Cassandra

Background: I am getting information from varius logfiles and a Cassandra table. The logfiles are fine, but fetching from the table gives me duplicates inside elasticsearch, as I can't get the rows added since sql_last_run.
How do I avoid duplication of rows?
One way to avoid this is to create your own document IDs by computing the SHA or MD5 of the raw log line.
That way the same log line, even if read repeatedly, will always produce the same ID and you won't get any duplicate documents anymore.
Another solution is to create another column in your tables with a unique GUID and use that value as the document IDs.

Informatica Data Quality - Match Analysis

In our Duplicate analysis requirement the input data has 1418 records out of which 1380 records are duplicate records.
On using the Match Analysis (used Key Generator, Matcher, Associator, Consolidator) in IDQ integrated with PowerCenter except for 8 records all duplicates were eliminated.
On executing the workflow by excluding these records, duplicates appear in other records for which duplicate didnt occur in the previous run.
Can anyone tell why this mismatch occurs?
Looks like your Consolidator transformation is not getting correct association ids and hence inserting multiple records resulting in duplicates.
please try the below steps:
1) Try to create a workflow in IDQ itself by deploying the mapping which you developed in IDQ.
2) Also keep a check on the business keys of the records which make a primary key through which you are identifying the dups in source.

Resources