I have one use case to create incremental Data ingestion pipeline from one Database to AWS S3. I have created a pipeline and it is working fine except for the one scenario where no incremental data was found.
In case of zero record count, it is writing the file with a header-only (parquet file). I want to skip the target write when there is no incremental record.
How I can implement this in IICS?
I have already tried to implement the router transformation where I have put the condition if record count > 0 then only write to target but still it is not working.
First of all: the target file gets created even before any data is read from source. This is to ensure the process has write access to target location. So even if there will be no data to store, an empty file will get created.
The possible ways out here will be to:
Have a command task check the number of lines in output file and delete it if there is just a header. This would require the file to be created locally, verified, and uploaded to S3 afterwards e.g. using Mass Ingestion task - all invoked sequentially via taskflow
Have a session that will first check if there is any data available, and only then run the data extraction.
I am bit puzzled here, I need to do a task similar to the following scenario with Spring Batch
Read Person from repository ==> I can use RepositoryItemReader
(a) Generate CSV file (FlatFileItemWriter) and (b) save CSV file in DB with the generated date (I can use RepositoryItemWriter)
But here I am struggling to understand how I can give generated CSV file output of 2a to save in DB 2b.
Consider CSV File has approx 1000+ Person Data which are processed for a single day.
is it possible to merge 2a & 2b? I thought about CompositeItemWriter but as here we are combining 1000+ employee in CSV file so it won't work.
Using a CompositeItemWriter won't work as you will be trying to write an incomplete file to the database for each chunk..
I would not merge 2a and 2b. Make each step do one thing (and do it well):
Step 1 (chunk-oriented tasklet): read persons and generate the file
Step 2 (simple tasklet) : save the file in the database
Step 1 can use the job execution context to pass the name of the generated file to step 2. You can find an example in the Passing Data To Future Steps section. Moreover, with this setup, step 2 will not run if step 1 fails (which makes sense to me).
Hi I have a requirement which I need to develop in Informatica.
The requirement is
1)Determine deltas between new extract and data extracted by previous run
2) Generate three separate CSV feed files based on the deltas
Could you please let me know the process of how to do this delta thing and compare the data from previous run and the new run
And how to write these delta data into an automated .csv file which need to be created automatically by informatica for every run .
Instead of writing the data into target table,it should write the data into these automated .csv or .txt files.
Does Informatica creates .csv or .txt files automatically and saves the data in them for every informatica run?if so, could you please let me know how?
Information you are seeking is widely available on the Internet and can be found with little research. However, let me try to chip in,
If the structure of the file remains the same between two runs, create two staging tables, one for previous and one for current. Do a minus between the two to capture the delta. Move the current to previous after delta capture and truncate current every time you load into it.
Use a flat file target instead of a table.
I'm working on a hadoop program which is scheduled to run once a day. It takes a bunch of json documents and each document has a time-stamp which shows when the document has been added. My program should only process those documents that are added since its last run. So, I need to keep a state which is a time-stamp showing the last time my hadoop job has run. I was thinking of storing this state in a SQL Server and query that in the driver program of my job. Is it a good solution or might be a better solution ?
p.s. my hadoop job is running on HDInsight. Having said that it is still possible to query the SQL server from my driver program?
We had solved this problem for our workflows running in AWS (Amazon Web Services), for the data stored in S3.
Our setup:
Data store: AWS S3
Data ingestion mechanism: Flume
Workflow management: Oozie
Storage for file status: MySQL
Problem:
We were ingesting data into Amazon S3, using Flume. All the ingested data was in same folder (S3 is a key/value store and has no concept of folder. Here folder means, all the data had same prefix. For e.g. /tmp/1.txt, /tmp/2.txt etc. Here /tmp/ is the key prefix).
We had a ETL workflow, which was scheduled to run once in an hour. But, since all the data was ingested into same folder, we had to distinguish between the Processed and Un-Processed files.
For e.g. for the 1st hour data ingested is:
/tmp/1.txt
/tmp/2.txt
When the workflow starts for the first time, it should process data from "1.txt" and "2.txt" and mark them as Processed.
If for the second hour, the data ingested is:
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Then, the total data in the folder after 2 hours will be:
/tmp/1.txt
/tmp/2.txt
/tmp/3.txt
/tmp/4.txt
/tmp/5.txt
Since, "1.txt" and "2.txt" were already processed and marked as Processed, during the second run, the job should just process "3.txt", "4.txt" and "5.txt".
Solution:
We developed a library (let's call it as FileManager), for managing the list of processed files. We plugged in this library into the Oozie work flow, as a Java action. This was the first step in the workflow.
This library also took care of ignoring the files, which are currently being written into by Flume. When Flume is writing data into a file, those files had "_current" suffix. So, those files were ignored for processing, till they are completely written into.
The ingested files were generated with timestamp as a suffix. For e.g. "hourly_feed.1234567". So, the file names were in ascending order of their creation.
For getting the list of unprocessed files, we used S3's feature of querying using markers (For e.g. if you have 10,000 files in a folder, if you specify marker as the name of the 5,000th file, then S3 will return you files from 5001 to 10,000).
We had following 3 states for each of the files:
SUCCESS - Files which were successfully processed
ERROR - Files which were picked up for processing, but there was an error in processing these files. Hence, these files need to be picked up again for processing
IN_PROGRESS - Files which have been picked up for processing and are currently being processed by a job
For each file, we stored following details in the MySQL DB:
File Name
Last Modified Time - We used this to handle some corner cases
Status of the file (IN_PROGRESS, SUCCESS, ERROR)
The FileManager exposed following interfaces:
GetLatestFiles: Return the list of latest Un-Processed files
UpdateFileStatus: After processing the files, update the status of the files
Following are the steps followed to identify the files, which were not yet processed:
Query the database (MySql), to check the last file which had status of SUCCESS (query: order by created desc).
If the first step returns a file, then query S3 with the file marker set to the last successfully processed file. This will return all the files, ingested after the last successfully processed file.
Also query the DB to check if there are any files in ERROR status. These files need to be re-processed, because previous workflow did not process them successfully.
Return the list of files obtained from Steps 2 and 3 (Before returning them, mark their status as IN_PROGRESS).
After the job is completed successfully update the state of all the processed file as SUCCESS. If there was an error in processing the files, then update the status of all the files as ERROR (so that they can be picked up for processing next time)
We used Oozie for workflow management. Oozie workflow had following steps:
Step 1: Fetch next set of files to be processed, mark each of their state as IN_PROGRESS and pass them to the next stage
Step 2: Process the files
Step 3: Update the status of the processing (SUCCESS or ERROR)
De-duplication:
When you implement such a library, there is a possibility of duplication of records (in some corner cases, same file may be picked up twice for processing). We had implemented a de-duplication logic to remove duplicate records.
you can rename the result document by using date-time,then your program can process the document according to the name of document.
Driver program checking the last run time stamp is good approach, but for storing last run time stamp, you can use a temporary file from HDFS.
I have a requirement to write multiple files using Spring Batch. The first file will be written based on the data from the database table. The second file will contain just the number of records written to the first file. How can I create the second file? I am not sure whether org.springframework.batch.item.file.MultiResourceItemWriter is an option for me as I think it will write multiple files based on the data it will write chunks of data in the multiple files. Correct me if I am wrong here.
Please do suggest some options with sample code if possible.
You have couple of options:
You can use CompositeItemWriter which calls collection of item writers in defined order so you can define one item writer which will write records based on data from DB and second will count records and write that to another file.
You can write data to a file in first step, finish whole file and save it somewhere, you can save counter of records if that is all you need to StepContext (common batch patterns and scroll to 11.8 Passing Data to Future Steps) and read in new Taskletcounter and save to new file.
If you want to go with option 1 which I think is right choice you can check this example of batch job configuration with CompositeItemWriter