2 mappings with exactly same session attributes behave completely different with slight change in logic - informatica-powercenter

I am an informatica Developer.
I have a mapping in informatica with below :
Original Mapping :
AS400(DB2SQ)->EXP->RTR->AGG1->MPLT->TGT1(SQL Server) Pipeline 1.
| |->AGG2->TGT2(SQL Server)
| |
| |->TGT3(SQL Server)
->AGG3->EXP->TGT4(FlatFile) Pipeline 2.
Major number of records are passing through pipeline 1. And i was asked to optimize the flow. Below was my suggestions.
In Pipeline 1. remove the AGG1 and AGG2, and push the aggregation logic to the database, this was my suggestion, as the flow is incremental, and incremental records being loaded into a temporary table, so expecting the performance to be better.
Remove the Target data TGT3, as it was not required.
This is how my optimized mapping looks now :
Optimized Mapping(What i thought) :
AS400(DB2SQ)->EXP->RTR->MPLT->TGT1(SQL Server) Pipeline 1.
| |->TGT2(SQL Server)
|
->AGG3->EXP->TGT4(FlatFile) Pipeline 2.
Just to investigate on source performance optimizations, i replaced the sessions properties of all targets to write into a file instead. I wanted to check if i could optimize my source in anyways.
But to my surprise, when, i executed both the session(in separate workflows, and separately one after the other), i see that the SQ throughput for optimized session is much slower than the original session.
Everything in the optimized solution is exactly same, as i made a copy of the original mapping/session, before removing 2 of the Aggregators, and one of the target.
Please Note : the environment where i am developing has version control enabled, has it anything got with that?
I tried to cross check this multiple and unable to find an answer.

You can identify it better if you go through the session log in details.And also you can run the query in source dB and check for the time,it's taking.you can also tune the performance of the source side by using push down optimization (i.e. source side push down optimization). But before that check with your source dB if everything is good and it's not taking much time. Also you can modify optimize the query and see for the performance.
If still that does not work out then you can go for session partitions in sq and check for the performance.

Related

Spark Performance Tuning Question - Resetting all caches for performance testing

I'm currently working on performance & memory tuning for a spark process. As part of this I'm performing multiple runs of different versions of the code and trying to compare their results side by side.
I've got a few questions to ask, so I'll post each separately so they can be addressed separately.
Currently, it looks like getOrCreate() is re-using the Spark Context each run. This is causing me two problems:
Caching from one run may be affecting the results of future runs.
All of the tasks are bundled into a single 'job', and I have to guess at which tasks correspond to which test run.
I'd like to ensure that I'm properly resetting all caches in each run to ensure that my results are comparable. I'd also ideally like some way of having each run show up as a separate job in the local job history server so that it's easier for me to compare.
I'm currently relying on spark.catalog.clearCache() but not sure if this is covering all of what I need. I'd also like a way to ensure that the tasks for each job run are clearly grouped in some fashion for comparison so I can see where I'm losing time and hopefully see total memory used by each run as well (this is one thing I'm currently trying to improve on).

Data warehouses and atomicity rarely coexist

This came up in the context of a humorous Tweet about wrapping a data warehouse overnight load in a transaction and how this would bloat the log file and eat up disk space. I'm not trying to disprove it but rather to understand it better - as to me it seems to imply that a partial load (due to error) should be allowed to complete which would mean that the DW would not accurately reflect the source system(s).
The only way that I can understand it is if the incomplete records would be loaded into an intermediate staging layer in the DW but not be processed further until completed by a subsequent overnight load and only then would be processed further.
I tried to research it further but without success so would be really grateful for any advice.
When an error happens during the loading of a DW you could:
stop the load and rollback (either to the start of the load of a single target object, of a group of objects or the whole DW)
Stop the load and leave the DW as it is at that point
log the error and continue the load (either of the failing target or of other objects in the DW)
Which option you choose is entirely dependent on your particular circumstances and you might have many different strategies in use at different points in your etl pipeline and depending on the number of errors. For example:
The error may allow you to continue to load other dims/facts without affecting them
Your business might prefer a fact table to be loaded minus one erroring record rather than missing a complete day’s data until the error is fixed

An issue of partial insertion of data into the target when job fails

We have 17 records data set in one of the source tables in which we have erroneous data in the 14th record, which causes the job failure. Then, in the target only 10 records would be inserted as the commit size given as “10” in the mysqloutput component and the job failed. In the next execution after correcting the error record, job will fetch all the 17 records with successful execution. Due to which there will be duplicates in the target.
we tried :
To overcome this, we have tried with tmysqlrollback component in which we have included the tmysqlconnection and tmysqlcommit components.
Q1 : Is there any other option to use tmysqlrollback without using the tmysqlconnection and tmysqlcommit components?
Explored the tmysqlrollback and commit component from the documentation
https://help.talend.com/reader/QgrwjIQJDI2TJ1pa2caRQA/7cjWwNfCqPnCvCSyETEpIQ
But still looking for clue how to design the above process efficient manner.
Q2 : Also, We'd like to know about the RAM usage and disk space consumption from the performance perspective.
Any help on it would be much appreciated ?
No, the only way to do transactions in Talend is to open a connection using tMysqlConnection, then either commit using a tMysqlCommit or rollback using tMysqlRollback.
Without knowing what you're doing in your job (lookups, transformations..etc), it's hard to advise you on the ram consumption and performance. But if you only have a source to target, then ram consumption should be minimal (make sure you enable stream on the tMysqlInput component). If you have another database as your source, then ram consumption depends on how that database driver is configured (jdbc drivers usually accept a parameter to tell it to only fetch a certain number of records at a time).
Lookups and components that process data in memory (tSortRow, tUniqRow, tAggregateRow..etc) are what causes memory issues, but it's possible to tweak their usage (using disk among other methods).

5GB file to read

I have a design question. I have a 3-4 GB data file, ordered by time stamp. I am trying to figure out what the best way is to deal with this file.
I was thinking of reading this whole file into memory, then transmitting this data to different machines and then running my analysis on those machines.
Would it be wise to upload this into a database before running my analysis?
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
Any ideas?
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).
You can even access the file in the hard disk itself and reading a small chunk at a time. Java has something called Random Access file for the same but the same concept is available in other languages also.
Whether you want to load into the the database and do analysis should be purely governed by the requirement. If you can read the file and keep processing it as you go no need to store in database. But for analysis if you require the data from all the different area of file than database would be a good idea.
You do not need the whole file into memory, just the data you need for analysis. You can read every line and store only the needed parts of the line and additionally the index where the line starts in file, so you can find it later if you need more data from this line.
Would it be wise to upload this into a database before running my analysis ?
yes
I plan to run my analysis on different machines, so doing it through database would be easier but if I increase the number machines to run my analysis on the database might get too slow.
don't worry about it, it will be fine. Just introduce a marker so the rows processed by each computer are identified.
I'm not sure I fully understand all of your requirements, but if you need to persist the data (refer to it more than once,) then a db is the way to go. If you just need to process portions of these output files and trust the results, you can do it on the fly without storing any contents.
Only store the data you need, not everything in the files.
Depending on the analysis needed, this sounds like a textbook case for using MapReduce with Hadoop. It will support your requirement of adding more machines in the future. Have a look at the Hadoop wiki: http://wiki.apache.org/hadoop/
Start with the overview, get the standalone setup working on a single machine, and try doing a simple analysis on your file (e.g. start with a "grep" or something). There is some assembly required but once you have things configured I think it could be the right path for you.
I had a similar problem recently, and just as #lalit mentioned, I used the RandomAccess file reader against my file located in the hard disk.
In my case I only needed read access to the file, so I launched a bunch of threads, each thread starting in a different point of the file, and that got me the job done and that really improved my throughput since each thread could spend a good amount of time blocked while doing some processing and meanwhile other threads could be reading the file.
A program like the one I mentioned should be very easy to write, just try it and see if the performance is what you need.
#update :
I want to process the records one by one. Basically trying to run a model on a timestamp data but I have various models so want to distribute it so that this whole process run over night every day. I want to make sure that I can easily increase the number of models and not decrease the system performance. Which is why I am planning to distributing data to all the machines running the model ( each machine will run a single model).

Looking for alternatives to the database project

I've a fairly large database project which contains nine databases and one database with a fairly large schema.
This project takes a large amount of time to build and I'm about to pull my hair out. We'd like to keep our database source controlled, but having a hard getting the other devs to use the project and build the database project before checking in just because it takes so long to build.
It is seriously crippling our work so I'm look for alternatives. Maybe something can be done with Redgate's SQL Compare? I think maybe the only drawback here is that it doesn't validate syntax? Anyone's thoughts/suggestions would be most appreciated.
Please consider trying SQL Source Control, which is a product designed to work alongside SQL Compare as part of a database development lifecycle. It's in Beta at the moment, but it's feature complete and it's very close to its full release.
http://www.red-gate.com/products/SQL_Source_Control/index.htm
We'd be interested to know how this performs on a commit in comparison to the time it takes for Visual Studio to build your current Database Project. Do you actually need to build the project so often in VS that it's a problem? How large is your schema and how long is an average build?
Keeping Dev/live db in sync:
There are probably a whole host of ways of doing this, I'm sure other users will expand further (including software solutions).
In my case I use a two fold approach:
(a) run scripts to get differences between db (stored procs, tables, fields, etc)
(b) Keep a strict log of db changes (NOT data changes)
In my case I've over time built up a semi structured log thus:
Client_Details [Alter][Table][New Field]
{
EnforcePasswordChange;
}
Users [Alter][Table][New Field]
{
PasswordLastUpdated;
}
P_User_GetUserPasswordEnforcement [New][Stored Procedure]
P_User_UpdateNewPassword [New][Stored Procedure]
P_User_GetCurrentPassword [New][Stored Procedure]
P_Doc_BulkDeArchive [New][Stored Procedure]
ignore the tabing, the markdown has messed it up.
But you get the general gist.
I find that 99% of the time the log is all I need.

Resources