Count inserts, deletes and updates in a PowerCenter session - informatica-powercenter

Is there a way in PowerCenter 9.1 to get the number of inserts, deletes and updates after an execution of a session? I can see the data on the log but I would like to see it in a more ordered fashion in a table.

The only way I know requires building the mapping appropriately. You need to have 3 separate instances of the target and use a router to redirect the rows to either TARGET_insert or TARGET_update or TARGET_delete. Workflow Monitor will then show a separate row for the inserted, updated and deleted rows.

There are few ways,
1. You can use $tgtsuccessrows / $TgtFailedRows and assign it to workflow variable
2. Expression transformation can be used with a variable port to keep track of insert/update/delete
3. You can even query OPB_SESSLOG in second stream to get row count inside same session.
Not sure if PowerCenter 9.1 offers a solution to this problem.

You can design your mapping to populate a Audit table to track the number of insert/update/delete's
You can download a sample implementation from Informatica Marketplace block titled "PC Mapping : Custom Audit Table"
https://community.informatica.com/solutions/mapping_custom_audit_table

There are multiple ways like you can create a assignment task attach this assignment task just after you session once the session complete its run the assignment task will pass on the session stats from session to the workflow variable defined at workflow level, sessions stats like $session.status,$session.rowcount etc and now create a worklet having a mapping included in it, pass the session stats captured at workflow level to the newly created worklet and from worklet to the mapping, now once the stats are available at mapping level in the mapping scan these stats (using a SQL or EXP transformation) and then write these stats to the AUDIT table ... attach the combination of assignment task and worklet after each session and it will start capturing the stats of each session after the session completes it run....

Related

How to execute a processor only when another processor is not executing?

I am inserting/updating data into a table. The database system does not provide an "Upsert" functionality. Thus I am using a staging table for the insert followed by a merge into the "final" table and finally I am truncating the staging table.
This leads to a race condition. If new data is inserted into the staging table between the merge+truncate this data is lost.
How can I make sure this does not happen?
I have tried to model this via Wait/Notify, but this is not a clean solution either. The queue for the "Put Data into staging table" PutDatabaseRecord processor could be filled and "MergeVertica for Insert/Update" ExecuteSQL could still execute.
I would use a MonitorActivity processor with a 60 or 30 sec threshold and use the Inactive output with a Continually Send Messages set to "false".
Have the success of the SQL inserts into staging connection into your MonitorActivity, this way if no activity is seen in the last X seconds he will trigger a flowfile that will start your Merge Process.
Download the template from https://codeshare.io/aJNNkn

How to order ETL tasks in Sql Server Data Tools (Integration Services)?

I'm a newbie in ETL processing. I am trying to populate a data mart through ETL and have hit a bump. I have 4 ETL tasks(Each task filling a particular table in the Mart) and the problem is that I need to perform them in a particular order so as to avoid constraint violations like Foreign Key constraints. How can I achieve this? Any help is really appreciated.
This is a snap of my current ETL:
Create a separate Data Flow Task for each table you're populating in the Control Flow, and then simply connect them together in the order you need them to run in. You should be able to just copy/paste the components from your current Data Flow to the new ones you create.
The connections between Tasks in the Control Flow are called Precendence Constraints, and if you double-click on one you'll see that they give you a number of options on how to control the flow of your ETL package. For now though, you'll probably be fine leaving it on the defaults - this will mean that each Data Flow Task will wait for the previous one to finish successfully. If one fails, the next one won't start and the package will fail.
If you want some tables to load in parallel, but then have some later tables wait for all of those to be finished, I would suggest adding a Sequence Container and putting the ones that need to load in parallel into it. Then connect from the Sequence Container to your next Data Flow Task(s) - or even from one Sequence Container to another. For instance, you might want one Sequence Container holding all of your Dimension loading processes, followed by another Sequence Container holding all of your Fact loading processes.
A common pattern goes a step further than using separate Data Flow Tasks. If you create a separate package for every table you're populating, you can then create a parent package, and use the Execute Package Task to call each of the child packages in the correct order. This is fantastic for reusability, and makes it easy for you to manually populate a single table when needed. It's also really nice when you're testing, as you don't need to keep disabling some Tasks or re-running the entire load when you want to test a single table. I'd suggest adopting this pattern early on so you don't have a lot of re-work to do later.

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

Windows Azure Run Once Routine

I'm trying to initialize my data in my Azure Data Tables but I only want this to happen once on the server at startup (i.e. via the WebRole Role Entry OnStart routine). The problem is if I have multiple instances starting up at the same time then potentially either one of those instances can add records to the same table at the same time hence duplicating the data at runtime.
Is there there like an overarching routine for all instances? An application object in which I can shove a value into and check it in each of the instances to see if the tables have been created or not? A singleton of some sort that azure exposes?
Cheers
Rob
No, but you could use a Blob lease as a mutex. You could also use a table lock in SQL Azure, if you're using that.
You could also use a Queue, and drop a message in there and then just one role would pick up the message and process it.
You could create a new single instance role that does this job on role start.
To be really paranoid about this and address the event of failure in the middle of writing the data, you can do something even more complex.
A queue message is a great way to ensure transactional capabilities as long as the work you are doing can be idempotent.
Each instance adds a message to a queue.
Each instance polls the queue and on receiving a message
Reads the locking row from the table.
If the ‘create data state’ value is ‘unclaimed’
Attempts to update the row with a ‘in process’ value and a timeout expiration timestamp based on the amount of time needed to create the data.
if the update is successful, the instance owns the task of creating the data
So create the data
update the ‘create data state’ to ‘committed’
delete the message
else if the update is unsuccessful the instance does not own the task
so just delete the message.
Else if the ‘create data’ value is ‘in process’, check if the current time is past the expiration timestamp.
That would imply that the ‘in process’ failed
So try all over again to set the state to ‘in process’, delete the incomplete written rows
And try recreating the data, updating the state and deleting the message
Else if the ‘create data’ value is ‘committed’
Just delete the queue message, since the work has been done

Referencing object's identity before submitting changes in LINQ

is there a way of knowing ID of identity column of record inserted via InsertOnSubmit beforehand, e.g. before calling datasource's SubmitChanges?
Imagine I'm populating some kind of hierarchy in the database, but I wouldn't want to submit changes on each recursive call of each child node (e.g. if I had Directories table and Files table and am recreating my filesystem structure in the database).
I'd like to do it that way, so I create a Directory object, set its name and attributes,
then InsertOnSubmit it into DataContext.Directories collection, then reference Directory.ID in its child Files. Currently I need to call InsertOnSubmit to insert the 'directory' into the database and the database mapping fills its ID column. But this creates a lot of transactions and accesses to database and I imagine that if I did this inserting in a batch, the performance would be better.
What I'd like to do is to somehow use Directory.ID before commiting changes, create all my File and Directory objects in advance and then do a big submit that puts all stuff into database. I'm also open to solving this problem via a stored procedure, I assume the performance would be even better if all operations would be done directly in the database.
One way to get around this is to not use an identity column. Instead build an IdService that you can use in the code to get a new Id each time a Directory object is created.
You can implement the IdService by having a table that stores the last id used. When the service starts up have it grab that number. The service can then increment away while Directory objects are created and then update the table with the new last id used at the end of the run.
Alternatively, and a bit safer, when the service starts up have it grab the last id used and then update the last id used in the table by adding 1000 (for example). Then let it increment away. If it uses 1000 ids then have it grab the next 1000 and update the last id used table. Worst case is you waste some ids, but if you use a bigint you aren't ever going to care.
Since the Directory id is now controlled in code you can use it with child objects like Files prior to writing to the database.
Simply putting a lock around id acquisition makes this safe to use across multiple threads. I've been using this in a situation like yours. We're generating a ton of objects in memory across multiple threads and saving them in batches.
This blog post will give you a good start on saving batches in Linq to SQL.
Not sure off the top if there is a way to run a straight SQL query in LINQ, but this query will return the current identity value of the specified table.
USE [database];
GO
DBCC CHECKIDENT ("schema.table", NORESEED);
GO

Resources