Kettle: load CSV file which contains multiple data tables - etl

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.

I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Related

How do I uploading csv file into Oracle

When trying to load csv file into Oracle table through ODI, ODI is not able to fetch the data from csv file. The csv file format is an issue here with the data in a single line. But when we are opening the csv file through excel and then saving it as csv the format is changing and the data is getting arranged properly and then we are able to import it through ODI.
Problem is we need to import the original csv file whatever format it is. Is there a possibility of doing the same?
SQL Loader will be the first thing that has came to my mind. I use this a lot.
SQL Developer will be a better option if you dont want to work with command line utilities.
Try using External Tables...you can configure how the CSV should be read in the EXTERNAL TABLE configuration

How to load multiple excel files into different tables based on xls metadata using SSIS?

I have multiple excel files with two types of metadata, Now i have to push the data into two different tables based on metadata of excel files using SSIS.
There are many, many different ways to do this. You'd need to share a lot more information on how your data is structured to really give a great answer, but here's the general strategy I'd suggest.
In the control flow tab, have a separate data flow for each Excel file. The data flows will all work the same, with the exception of having a different Excel source in each data flow, so it will be enough to get the first version working and then copy and paste for the other files.
In the data flow, use a conditional split transformation to read the metadata coming from Excel and send the row to the correct table.
If you really want to be fancy, however, you could create a child package that includes all your data flow logic. Using the Execute Package Task you can pass the Excel file name to the child package for each Excel file you need to import. This way you consolidate your logic in one package and can still import from multiple Excel files in parallel.

how to work on specific part of cvs file uploaded into HDFS?

how to work on specific part of cvs file uploaded into HDFS ?
I'm new in Hadoop and i have an a question that is if i export an a relational database into cvs file then uploaded it into HDFS . so how to work on specific part (table) in file using MapReduce .
thanks in advance .
I assume that the RDBMS tables are exported to individual csv files for each table and stored in HDFS. I presume that, you are referring to column(s) data within the table(s) when you mentioned 'specific part (table)'. If so, place the individual csv files into the separate file paths say /user/userName/dbName/tables/table1.csv
Now, you can configure the job for the input path and field occurrences. You may consider to use the default Input Format so that your mapper would get one line at time as input. Based on the configuration/properties, you can read the specific fields and process the data.
Cascading allows you to get started very quickly with MapReduce. It has framework that allows you to set up Taps to access sources (your CSV file) and process it inside a pipeline say to (for example) add column A to column B and place the sum into column C by selecting them as Fields
use BigTable means convert your database to one big table

JMeter export graph to CSV exports to multiple columns, write results to file exports to one column

I was annoyed with JMeter writing data results to CSV as one column. So when the CSV file was opened in Excel all values would be added to one single column (which requires annoying manual copy/paste work to get to graphs). I then noticed that if I choose Export to CSV on a Listener graph, it actually exports the CSV file as separate columns in Excel, which is great.
Is it possible to have the "Write results to file" write data into separate columns by default as it does with the graph "Export to CSV"? Thanks!
Suppose you have at least 2 options:
Simple Data Writer, which one you are using at the moment, as you understand.
In jmeter.properties file (JMETER_HOME\bin\jmeter.properties) uncomment and set jmeter.save.saveservice.default_delimiter=; to use ';' instead of ',' (used by default) as separator in csv-files (which one you are creating using "Write results to file") - this will separate values in different columns if opened in Excel.
# For use with Comma-separated value (CSV) files or other formats
# where the fields' values are separated by specified delimiters.
jmeter.save.saveservice.default_delimiter=;
Flexible File Writer from jmeter-plugins pack implments the same functionality and looks to be more customizable.
Idea is the same as above - use ";" to separate values written into file:
Write file header: endTimeMillis;responseTime;latency;sentBytes;receivedBytes;isSuccessful;responseCode
Record each sample as: endTimeMillis|;|responseTime|;|latency|;|sentBytes|;|receivedBytes|;|isSuccessful|;|responseCode|\r\n
Hope this helps.

Suggested Hadoop File Format for Tabular Data

My application needs to process a couple of TB worth of tabular data. At the moment, the data is saved as several huge comma separated csv files. I can control how the files are being provided to my M/R job and I am wondering what is the preferred file format to make the job to run faster? For instance, is there any point in saving the input data as sequence files instead of the text file that I am using now? Will that make my M/R job to run noticeably faster?
From the perspective of "file format" I don't think using SequeceFile will be a great improvement over text file for csv data. If it was a single (Key,Value) pair in the CSV data, using SequenceFile over textfile would have made sense.
How ever, I am intrigued over use of RCFile (Record Columnar File) which should lend itself well for CSV like data. I have used it with hive tables and achieved some significant improvement in execution time for hive queries. I am assuming that that was due to execution efficiency in M/R since hive queries get translated to M/R programs.
Ref: http://www.ixwebhosting.mobi/2011/10/06/4823.html

Resources