How to skip the target write in case zero record count in Informatica Cloud (IICS)? - etl

I have one use case to create incremental Data ingestion pipeline from one Database to AWS S3. I have created a pipeline and it is working fine except for the one scenario where no incremental data was found.
In case of zero record count, it is writing the file with a header-only (parquet file). I want to skip the target write when there is no incremental record.
How I can implement this in IICS?
I have already tried to implement the router transformation where I have put the condition if record count > 0 then only write to target but still it is not working.

First of all: the target file gets created even before any data is read from source. This is to ensure the process has write access to target location. So even if there will be no data to store, an empty file will get created.
The possible ways out here will be to:
Have a command task check the number of lines in output file and delete it if there is just a header. This would require the file to be created locally, verified, and uploaded to S3 afterwards e.g. using Mass Ingestion task - all invoked sequentially via taskflow
Have a session that will first check if there is any data available, and only then run the data extraction.

Related

CSV Blob Sink - Skip Writing File when 0 Rows Present

This is a relatively simple problem with (I'm hoping) a similarly-simple solution.
In my ADF ETLs, any time there's a known and expected yet unrecoverable row-based error, I don't want my full ETL to fail. Instead, I'd rather pipe those rows off to a log, which I can then pick up at the end of the ETL for manual inspection. To do this, I use conditional splits.
Most of the time, there shouldn't be any rows like this. When this is the case, I don't want my blob sink to write a file. However, the current behavior writes a file no matter what -- it's just that the file only contains the table header.
Is there a way to skip writing anything to a blob sink when there are no input rows?
Edit: Somehow I forgot to specify -- I'm specifically referring to a Mapping Data Flow with a blob sink.
You can use Lookup activity(don't check first row only) to get all your table data firstly. Then use If condition to check the count of Lookup activity's output. If its count > 0, execute next activity(or data flow).

Build pipeline from Oracle DB to AWS DynamoDB

I have an Oracle instance running on a stand alone EC2 VM, I want to do two things.
1) Copy the data from one of my Oracle tables into a cloud directory that can be read by DynamoDB. This will only be done once.
2) Then daily I want to append any changes to that source table into the DynamoDB table as another row that will share an id so I can visualize how that row is changing over time.
Ideally I'd like a solution that would be as easy as pipeing the results of a SQL query into a program that dumps that data into a cloud files system (S3, HDFS?), then I will want to convert that data into a format that can be read with DynamoDB.
So I need these things:
1) A transport device, I want to be able to type something like this on the command line:
sqlplus ... "SQL Query" | transport --output_path --output_type etc etc
2) For the path I need a cloud file system, S3 looks like the obvious choice since I want a turn key solution here.
3) This last part is a nice to have because I can always use a temp directory to hold my raw text and convert it in another step.
I assume the "cloud directory" or "cloud file system" you are referring to is S3? I don't see how it could be anything else in this context, but you are using very vague terms.
Triggering the DynamoDB insert to happen whenever you copy a new file to S3 is pretty simple, just have S3 trigger a Lambda function to process the data and insert into DynamoDB. I'm not clear on how you are going to get the data into S3 though. If you are just running a cron job to periodically query Oracle and dump some data to a file, which you then copy to S3, then that should work.
You need to know that you can't append to a file on S3, you would need to write the entire file each time you push new data to S3. If you are wanting to stream the data somehow then using Kenesis instead of S3 might be a better option.

Determine deltas between new extract and data extracted by previous run and Generate three separate CSV feed files based on the deltas

Hi I have a requirement which I need to develop in Informatica.
The requirement is
1)Determine deltas between new extract and data extracted by previous run
2) Generate three separate CSV feed files based on the deltas
Could you please let me know the process of how to do this delta thing and compare the data from previous run and the new run
And how to write these delta data into an automated .csv file which need to be created automatically by informatica for every run .
Instead of writing the data into target table,it should write the data into these automated .csv or .txt files.
Does Informatica creates .csv or .txt files automatically and saves the data in them for every informatica run?if so, could you please let me know how?
Information you are seeking is widely available on the Internet and can be found with little research. However, let me try to chip in,
If the structure of the file remains the same between two runs, create two staging tables, one for previous and one for current. Do a minus between the two to capture the delta. Move the current to previous after delta capture and truncate current every time you load into it.
Use a flat file target instead of a table.

How to keep a state in Hadoop jobs?

I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added since its last run. So, I need to keep a state which is a time-stamp showing the last time my hadoop job has run. I was thinking of storing this state in a SQL Server and query that in the driver program of my job. Is it a good solution or might be a better solution ?
p.s. my hadoop job is running on HDInsight. Having said that it is still possible to query the SQL server from my driver program?
We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.
Our setup:
Data store: AWS S3
Data ingestion mechanism: Flume
Workflow management: Oozie
Storage for file status: MySQL
Problem:
We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).
We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.
For e.g. for the 1st hour data ingested is:
/tmp/1.txt
/tmp/2.txt
When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.
If for the second hour, the data ingested is:
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Then, the total data in the folder after 2 hours will be:
/tmp/1.txt
/tmp/2.txt
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".
Solution:
We developed a library (let's call it as FileManager), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.
This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.
The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.
For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).
We had following 3 states for each of the files:
SUCCESS - Files which were successfully processed
ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job
For each file, we stored following details in the MySQL DB:
File Name
Last Modified Time - We used this to handle some corner cases
Status of the file (IN_PROGRESS, SUCCESS, ERROR)
The FileManager exposed following interfaces:
GetLatestFiles: Return the list of latest Un-Processed files
UpdateFileStatus: After processing the files, update the status of the files
Following are the steps followed to identify the files, which were not yet processed:
Query the database (MySql), to check the last file which had status of SUCCESS (query: order by created desc).
If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)
We used Oozie for workflow management. Oozie workflow had following steps:
Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
Step 2: Process the files
Step 3: Update the status of the processing (SUCCESS or ERROR)
De-duplication:
When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.
you can rename the result document by using date-time,then your program can process the document according to the name of document.
Driver program checking the last run time stamp is good approach, but for storing last run time stamp, you can use a temporary file from HDFS.

Spring XD Batch Job Incremental

I created a job batch to extract data from csv file to a jdbc using filejdbc module, it worked properly, but when I scheduled the batch to run every 5 minutes, it did not work with the incremental load concept, it loaded all the data again, Is there any feature to schedule the batch with incremental load?
Is the solution to run the batch once, and to create a stream to do the incremental load? Will the stream load all the data again, or it will just continue from a certain point.
Please explain how can I achieve the incremental load concept using spring XD?
Thanks,
Moha.
I suppose what is missing is the concept of 'state'...the filejdbc module does not seem to know where the last import stopped. I do something similar but I use a custom batch job and I use a meta store to keep track of where the last load stopped - that is where the next incremental will pick up from, etc.
Since you're using a module that came with spring-xd itself, you may not have this flexibility but you may have to options:
a- your destination table can define unique fields that will prevent duplicates. That way, even if its trying to load ALL the data again, only new rows will get inserted. This assumes that the module is using 'insert ignore' (or something similar and not just basic insert (which will throw an error/exception). This, I must say, will end up being non-optimal pretty quickly, pretty soon.
b- If its an option, write a module that can delete the file after its uploaded into the db. You can construct a complex stream that will first do your data load and then file delete.

Resources