Determine deltas between new extract and data extracted by previous run and Generate three separate CSV feed files based on the deltas - etl

Hi I have a requirement which I need to develop in Informatica.
The requirement is
1)Determine deltas between new extract and data extracted by previous run
2) Generate three separate CSV feed files based on the deltas
Could you please let me know the process of how to do this delta thing and compare the data from previous run and the new run
And how to write these delta data into an automated .csv file which need to be created automatically by informatica for every run .
Instead of writing the data into target table,it should write the data into these automated .csv or .txt files.
Does Informatica creates .csv or .txt files automatically and saves the data in them for every informatica run?if so, could you please let me know how?

Information you are seeking is widely available on the Internet and can be found with little research. However, let me try to chip in,
If the structure of the file remains the same between two runs, create two staging tables, one for previous and one for current. Do a minus between the two to capture the delta. Move the current to previous after delta capture and truncate current every time you load into it.
Use a flat file target instead of a table.

Related

update Parquet file format

my requirement is to read that and generate another set of parquet data into another ADLS folder.
do i need this into spark dataframes and perform upserts ?
Parquet is like any other file format. You have to overwrite the files to perform insert, updates and deletes. It does not have ACID properties like a database.
1 - We can use SET properties with the spark dataframe to accomplish what you want. However, it compares at both the row and column level. Not as nice as an ANSI SQL.
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-setops.html
2 - We can save the data in the target directory as a DELTA file. Most people are using DELTA since it has ACID properties like a database. Please see the merge statement. It allows for updates and inserts.
https://docs.delta.io/latest/delta-update.html
Additionally we can soft delete when reversing the match.
The nice thing about a delta file (table) is we can partition by date for a daily file load. Thus we can use time travel to see what happen yesterday versus today.
https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html
3 - If you do not care about history and soft deletes, the easiest way to accomplish this task is to archive the old files in the target directory, then copy over the new files from the source directory to the target directory.

Append DataSets (.ds) using UNIX

I'm currently working on DataStage IBM and here's my problem:
I have to get a n numbers of datasets that's going to be in a folder and I have to append them in one DataSet (.ds).
Since I don't know how many datasets I will have and neither they full name, I can't use a DataStage job to deal with them. All I know is they will have the same metadata (because they will be generated in the same job).
I think I have to use a Shell Cmd to append them but I'm not a UNIX guy.
Thank you for everyone who reads so far.
You can use the same job. Specify Append mode (rather than Override) for the target Data Set; each time you run the job data will be added to the same Data Set. Be careful not to inadvertently create duplicates by processing the same source data twice. Use parameters to specify the source.

writing multiple files (different content) using spring batch

I have a requirement to write multiple files using Spring Batch. The first file will be written based on the data from the database table. The second file will contain just the number of records written to the first file. How can I create the second file? I am not sure whether org.springframework.batch.item.file.MultiResourceItemWriter is an option for me as I think it will write multiple files based on the data it will write chunks of data in the multiple files. Correct me if I am wrong here.
Please do suggest some options with sample code if possible.
You have couple of options:
You can use CompositeItemWriter which calls collection of item writers in defined order so you can define one item writer which will write records based on data from DB and second will count records and write that to another file.
You can write data to a file in first step, finish whole file and save it somewhere, you can save counter of records if that is all you need to StepContext (common batch patterns and scroll to 11.8 Passing Data to Future Steps) and read in new Taskletcounter and save to new file.
If you want to go with option 1 which I think is right choice you can check this example of batch job configuration with CompositeItemWriter

IBM BigSheets Issue

I am getting some error in loading my files onto big sheets both directly from the HDFS( files that are output of pig scripts) and also raw data that is lying on the local hard disk.
I have observed that whenever I am loading the files and issuing a row count to see if all data is loaded into bigsheets, then I see lesses number of rows being loaded.
I have checked that the files are consistent and proper delimeters(/t or comma separated fields).
Size of my file is around 2GB and I have used either of the format *.csv/ *.tsv.
Also in some cases when i have tired to load a file from windows os directly then the files sometimes load successfully with row count matching with actual number of lines in the data, and then sometimes with lesser number of rowcount.
Even sometimes when a fresh file being used 1st time it gives the correct result but if I do the same operation next time some rows are missing.
Kindly share your experience your bigsheets, solution to any such problems where the entire data is not being loaded etc. Thanks in advance
The data that you originally load into BigSheets is only a subset. You have to run the sheet to get it on the full dataset.
http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.analyze.doc/doc/t0057547.html?lang=en

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources