How to load a huge CSV file in orient db - etl

I want to load a huge CSV file in my orient Db database.there is some checklist for the database ,that should our db follows.
1- there would be a single csv file and this CSV file will have millions of records and more then 20 Columns.
2- From this csv i have to create multiple Classes and each class will have different Properties (is it possible with Orient db).
3- i have to create index too
Please help for this.how should i create Etl config file for this
Thanks in advance.
please let me know if any input required from my side.

Related

How to read and perform batch processing using spring batch annotation config

I have 2 different file with different data. The file contains 10K record per day.
Ex:
Productname price date
T shirt,500,051221
Pant,1000,051221
Productname price date
T shirt,800,061221
Pant,1800,061221
I want to create final output file by checking price difference by todays and yesterdays file.
Ex:
Productname price
T shirt,300
Pant,800
By using spring batch I have to do this.
I have tried with batch configuration by creating two different step. but its only able to read the data. but unable to
do the processing. because here I need the data of both file for processing. but in my case its reading one step after another.
Could anyone help me on this with some sample code.
I would suggest to save FlatFile data into the database for yesterday's and today's date (may be two separate tables or in a same table if you can identify difference two records easily). Read this stored data using JdbcCursorItemReader or PagingItemReader and perform calculation/logic/massaging of data at the processor level and create a new FlatFile or save into DB as per convenience. OOTB Spring Batch does not provide facility to read data and perform calculation.
Suggestion - Read data from both the FlatFile keep it in cache and read from the cache and do the further processing.

read csv file data and store it in database using spring framework

I need help, I want the code to read the data which is in a csv file and the store that data into database. I have tried reading the csv file with known rows and cols. But the challenge here is that I want to create an utility where I don't know the number of cols and rows that are in the csv file so how would I do it? Please help.
Have you explored Spring Batch? You can write your own implementation of LineTokenizer for the columns which are going to change dynamically.

How do I uploading csv file into Oracle

When trying to load csv file into Oracle table through ODI, ODI is not able to fetch the data from csv file. The csv file format is an issue here with the data in a single line. But when we are opening the csv file through excel and then saving it as csv the format is changing and the data is getting arranged properly and then we are able to import it through ODI.
Problem is we need to import the original csv file whatever format it is. Is there a possibility of doing the same?
SQL Loader will be the first thing that has came to my mind. I use this a lot.
SQL Developer will be a better option if you dont want to work with command line utilities.
Try using External Tables...you can configure how the CSV should be read in the EXTERNAL TABLE configuration

How to skip file headers in impala external table?

I have file on HDFS with 78 GB size
I need to create an Impala External table over it to perform some grouping and aggregation on data available
Problem
The file contain headers.
Question
Is there any way to skip headers from file while reading the file and do querying on the rest of data.
Although i have a way to solve the problem by copying file to local then remove the headers and then copy the updated file to HDFS again but that is not feasible as the file size is too large
Please suggest if anyone have any idea...
Any suggestions will be appreciated....
Thanks in advance
UPDATE or DELETE row operations are not available in Hive/Impala. So you should simulate DELETE as
Load data file into a temporary Hive/Impala table
Use INSERT INTO or CREATE TABLE AS on temp table to create require table
A straightforward approach would be to run the HDFS data through Pig to filter out the headers and generate a new HDFS dataset formatted so that Impala could read it cleanly.
A more arcane approach would depend on the format of the HDFS data. For example, if both header and data lines are tab-delimited, then you could read everything using a schema with all STRING fields and then filter or partition out the headers before doing aggregations.

Kettle: load CSV file which contains multiple data tables

I'm trying to import data from a csv file which, unfortunately, contains multiple data tables. Actually, it's not really a pure csv file.
It contains a header field with some metadata and then the actual csv data parts are separated by:
//-------------
Table <table_nr>;;;;
An example file looks as follows:
Summary;;
Reporting Date;29/05/2013;12:36:18
Report Name;xyz
Reporting Period From;20/05/2013;00:00:00
Reporting Period To;26/05/2013;23:59:59
//-------------
Table 1;;;;
header1;header2;header3;header4;header5
string_aw;0;0;0;0
string_ax;1;1;1;0
string_ay;1;2;0;1
string_az;0;0;0;0
TOTAL;2;3;1;1
//-------------
Table 2;;;
header1;header2;header3;header4
string_bv;2;2;2
string_bw;3;2;3
string_bx;1;1;1
string_by;1;1;1
string_bz;0;0;0
What would be the best way to process load such data using kettle?
Is there a way to split this file into the header and csv data parts and then process each of them as separate inputs?
Thanks in advance for any hints and tips.
Best,
Haes.
I don't think there are any steps that will really help you with data in such a format. You probably need to do some preprocessing before bringing your data into a CSV step. You could still do this in your job, though, by calling out to the shell and executing a command there first, like maybe an awk script to split up the file into its component files and then load those files via the normal Kettle pattern.

Resources