Combining different data flows and create .txt file by sorting output - etl

I have a requirement. I am trying to combining several data flows with Talend in order to create a .txt file. In my case the input flows are DB tables. I am able to create the output file "prova.txt", but in this file some fields of 2nd and 3rd tables are missing and I don't know why. I checked with tLogRow and the probelm seems to be in tHashInput_1. In the 3 tHashOutput rows are logged correctly with all fields.
Below, my job:
Components tHashOutput_2, tHashOutput_3, tHashInput_1 are linked to tHashOutput_1.
Am I doing something wrong? Does anyone could help me?
Thank you in advance!

Assuming for all thashoutput schema will be same, I attached image for your problem.
Here in all tfileoutputdelimited components give same file name, same schema and use append option. It will append data in same file from all 3 tables.

An alternative, is to use tUnite, Assuming for all thashoutput schema will be same
Example: Using tUnique
Regards!
tunitetalendecanaveras

Related

Incremental data processing for file in talend

How can you manage your incremental data processing when you don't have database or anything to log the previous execution timestamp?
Can we use tAddCRCRow component? But How come will it know that if this data has been processed already or not specially when both sorce and destinations are nothing but collections of file?
Thank you.
You have to use your target file as a lookup and identify the existing value. This will help you to resolve your issue.
In case of files, you have to use multiple file as a lookup. Or create a separate table which holds the unique value of all the files and use it as a lookup

how to save a text file to hive using table of context as schema

I have many project reports in text format (word and pdf). These files contains data that I want to extract; Such as references, keywords, names mentioned .......
I want to process these files with Apache spark and save the result to hive,
use the power of dataframe (use the table of context as schema) is that possible?
May you share with me any ideas about how to process these files?
As far as I understand, you will need to parse the files using Tika and manually create custom schema s as described here.
Let me know if this helps. Cheers.

how to create a query in hive for particular data

I am using hive to load a data file and run hadoop mapreduce on it. But I am stuck at create table query. I have a data like this 59.7*, 58.9* where * is just a character. I want to make two columns to store 59.7 & 58.9. Can anyone help on that? Thanks
You can use RegexSerDe to do that. You can visit this page if you need an example.

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

Resources