Incremental data processing for file in talend - etl

How can you manage your incremental data processing when you don't have database or anything to log the previous execution timestamp?
Can we use tAddCRCRow component? But How come will it know that if this data has been processed already or not specially when both sorce and destinations are nothing but collections of file?
Thank you.

You have to use your target file as a lookup and identify the existing value. This will help you to resolve your issue.
In case of files, you have to use multiple file as a lookup. Or create a separate table which holds the unique value of all the files and use it as a lookup

Related

Append DataSets (.ds) using UNIX

I'm currently working on DataStage IBM and here's my problem:
I have to get a n numbers of datasets that's going to be in a folder and I have to append them in one DataSet (.ds).
Since I don't know how many datasets I will have and neither they full name, I can't use a DataStage job to deal with them. All I know is they will have the same metadata (because they will be generated in the same job).
I think I have to use a Shell Cmd to append them but I'm not a UNIX guy.
Thank you for everyone who reads so far.
You can use the same job. Specify Append mode (rather than Override) for the target Data Set; each time you run the job data will be added to the same Data Set. Be careful not to inadvertently create duplicates by processing the same source data twice. Use parameters to specify the source.

Combining different data flows and create .txt file by sorting output

I have a requirement. I am trying to combining several data flows with Talend in order to create a .txt file. In my case the input flows are DB tables. I am able to create the output file "prova.txt", but in this file some fields of 2nd and 3rd tables are missing and I don't know why. I checked with tLogRow and the probelm seems to be in tHashInput_1. In the 3 tHashOutput rows are logged correctly with all fields.
Below, my job:
Components tHashOutput_2, tHashOutput_3, tHashInput_1 are linked to tHashOutput_1.
Am I doing something wrong? Does anyone could help me?
Thank you in advance!
Assuming for all thashoutput schema will be same, I attached image for your problem.
Here in all tfileoutputdelimited components give same file name, same schema and use append option. It will append data in same file from all 3 tables.
An alternative, is to use tUnite, Assuming for all thashoutput schema will be same
Example: Using tUnique
Regards!
tunitetalendecanaveras

how to save a text file to hive using table of context as schema

I have many project reports in text format (word and pdf). These files contains data that I want to extract; Such as references, keywords, names mentioned .......
I want to process these files with Apache spark and save the result to hive,
use the power of dataframe (use the table of context as schema) is that possible?
May you share with me any ideas about how to process these files?
As far as I understand, you will need to parse the files using Tika and manually create custom schema s as described here.
Let me know if this helps. Cheers.

Schema verification/validation before loading data into HDFS/Hive

I am a newbie to Hadoop Ecosystem and I need some suggestion from Bigdata experts on achieving schema verification/validation before loading the huge data into hdfs.
The scenario is:
I have a huge dataset with given schema (having around 200
column-header in it). This dataset is going to be stored in Hive
tables/HDFS. Before loading the data into hive table/hdfs I want to
perform a schema level verification/validation on the data supplied to
avoid any unwanted errors/exception while loading the data into hdfs.
Like in case somebody tries to pass a data file having fewer or more
number of columns in it then at the first level of verification this
load fail.
What could be the best possible approach for achieving the same?
Regards,
Bhupesh
Since you have files, you can add them into HDFS,and run map reduce on top of that. Here you would be having a hold on each row, so you can verify number of columns, their types and any other validations.
When i referred to jason/xml, there is slight overhead to make map reduce identify the records in that format. However with respect to validation there is schema validation which you can enforce and also define only specific values for a field using schema. So once the schema is ready, your parsing(xml to java) and then store them at another final HDFS location for further use(like HBase). When you are sure that data is validated, you can create Hive tables on top of that.
Use below utility to create temp tables every time based on the schema you receive in csv file format in staging directory and then apply some conditions to identify whether you have valid columns or not. Finally load into original table.
https://github.com/enahwe/Csv2Hive

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources