Multiple files through SQL Loader - oracle

I have two requirements of loading data in Oracle Tables through SQL Loader utility -
Requirement 1
Two .csv files with same headers defined in both the files. Skip the header from both and load the combined data in the table.
What will be the command to load the data by skipping the headers and stop the process if either of files have errors.
Requirement 2
Two files with attributes spread across in both the files i.e.,
Table Primary key - ID,Name
Cols in First file - ID,Name,Attr1
Cols in Second File - ID,Name,Attr2
Columns in the oracle table where the both the files' data will be loaded
ID,Name,Attr1,Attr2
What will be the best way to load the attributes from both the files in this case?
How to handle data integrity scenarios ? i.e., notify or do not load attributes from 2nd file if 1st file corresponding records are bad records.
Thanks in Advance.

Order the files by id,name:
cat file1.txt | sort > file1.csv
cat file2.txt | sort > file2.csv
If you need to discrd the header use grep -v:
cat file1.txt | grep -v "header id.." | sort > file1.csv
cat file2.txt | grep -v "header id.." | sort > file2.csv
then merge the files using awk:
awk 'BEGIN {FS=","}{getline line < "file1.csv"; print line","$3}' file2.csv > inputSqlLoader.csv
the load the resulting file with sql loader. Use the skip=1 option on sql loader to discard the header, if needed.
To improve performance you might use:
paste -d, file{1..2}.csv | awk 'BEGIN {FS=","}{print $1","$2","$3","$6;}' > inputSqlLoader.csv

The way you described it, SQL*Loader is not the tool of your choice. External tables, no the other hand, might be.
Why? Because SQL*Loader will load the 2nd file regardless of errors found in the 1st file. Also, you can't load from two files and "merge" data into a single record in the target table. (OK, its background still is SQL Loader, but that's not the point here).
However, if each of those CSV files represents an external table, then you can access them using SQL or - maybe even better - PL/SQL. As it is a procedural extension to SQL, you'd create a procedure which "loads" (that would be INSERT) the 1st file's contents into the target table. You'll be able to check whether there were any errors and then proceed to the 2nd file, using either UPDATE or MERGE to set the attr2 column's value.

Related

How to display the latest line based on the file's name or the line's position in bash

I have a tricky question about how to keep the latest log data as my server reposted it two times
This is the result after I grep from my folder :(i have tons of data, just to keep it simpler)
...
20150630-201427.csv:20150630,CFIIASU,233,96.21786,0.44644,
20150630-201427.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150630-201427.csv:20150630,CFIIASU_CN,68,102.19569,0.10692
20150630-201427.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150630-201427.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150630-201427.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
...
The data actually came from many csv files, I only pick two csv files to make the example, and here are some explainations of this:
the example came from two files 20150630-201427.csv and 20150701-151654.csv, and it has 4 columns which correspond to date, datanme, data_column1, data_column2, data_column3.
these line have the same data date 20150630 and the same dataname CFIIASU,CFIIASU_AU...etc, but the numbers in the fourth and fifth column (which are data_column2 and data_column3) are different.
How could i keep the data of 20150701-151654.csv based on the file's name and data date and apply it on my whole data set?
To make it more clearly. I'd like to keep the lines of "the latest csv" and since the latest csv is corresponding to the file's name, which in this example is 2015070. but when it comes to my whole data set i need to handle with so many 20xxxxxx.csv that i can't check it one by one.
for the example, i made this should end up like this:
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
Thanks in advance.
Your question isn't clear but it sounds like this might be what you're trying to do (print all lines from the last csv mentioned in the input file):
$ tac file | awk -F':' 'NR>1 && $1!=prev{exit} {print; prev=$1}' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
or maybe this (print the last line seen for every 20150630,CFIIASU etc. pair in the input file):
$ tac file | awk -F'[:,]' '!seen[$2,$3]++' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743

Merging CSVs into one sees exponentially bigger size

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?
I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

Running a number of hive queries and writing output to file

I'm trying to make use of the DESCRIBE function via Hive to output the column descriptions of each of the tables out to individual files. I've discovered the -f option so I can just read from a file and write the output back out:
hive -f nameOfSqlQueryFile.sql > out.txt
However, if I open the output file, it throws all the descriptions back to back and it's unclear where one description starts for a table and where it ends.
So, I've tried making a batch file that uses -e to describe each of the tables individually and output to a file:
#!/bin/bash
nameArr=( $(hive -e 'show tables;') )
count=0
for i in "${nameArr[#]}"
do
echo 'Working on table('$count'): '$i
hive -e 'describe '$i > $i'_.txt';
count=$(($count+1))
done
However, because this needs to reconnect for each query, it's remarkably slow, taking hours to process several hundred queries.
Does anyone have an idea of how else I might run each of these DESCRIBE functions, and ideally output to separate files?
You can probably use one of these, depending on how you process the output:
Just use the OK line as a separator and search for it using a script.
Use DESCRIBE EXTENDED which adds a line at the end with info on the table, including its location, which can be used to extract the table name (using sed, for example)
If you're just using the output file as a manual reference, insert a SQL statement that prints a separator of your choice between each table, e.g.:
DESCRIBE table;
SELECT '-----------------' FROM table;

Grep and substitute?

I have to parse ASCII files, output the relevant data to a comma-delimited file and load it into a database table.
The specs for the file format have been recently updated and one section is causing problems. This is the original layout for that section.
CSVHeaderAttr:PUIS,IdleImmediate,POH,Temp,WorstTemp
CSVValuesAttr:NO,NO,9814,31,56
I parse it with grep thusly
CSVAttributes=$(grep ^CSVValuesAttr: ${filename}|cut -d':' -f2)
[ -z "$CSVAttributes" ] && CSVAttributes="NA"
It works great but now that the section has new fields and they are named differently
CSVHeaderAttr:PUIS,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,YES,YES,23861,31,51
Right now, I am grepping the files based on their layout (there is a field in the the header which tells me the version of the layout) to two different comma-delimited files and load them into two different tables. I would like to output both sections to the same file so the data scientist only has one table to use in his analysis.
Is there a way to use grep to produce an output like this and substitute empty fields with NA?
For one file type:
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,NO,NA,NA,9814,31,56
For the other file type:
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,NA,YES,YES,23861,31,51
Thanks for your input.
sed -n '/CSVHeaderAttr:/ c\
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
/CSVValuesAttr:/ {
/\([^,]*,\)\{5\}/ s/\([^,]*,\)/&NA,/
t p
s/\(\([^,]*,\)\{2\}\)/\1NA,NA,/
# t p
: p
p
}' AllYourFiles > ConcatFile
using sed that test how many column (with /\([^,]*,\)\{5\}/) before changing the new layout

How to find the number of columns within a row key in hbase

How to find the number of columns within a row key in hbase (since a row can have many columns)
I don't think there's a direct way to do that as each row can have a different number of columns and they may be spread over several files.
If you don't want to bring the whole row to the client to perform the count there you can write an endpoint coprocessor (HBase version for a stored procedure if you like) to perform the calculation on the region server side and only return the result. you can read a little about coprocessors here
There is a simple way:
Use hbase shell to scan through the table and write the output to a intermediate text file. Because hbase shell output splits each column of a row into a new line, we can just count the lines inside the text file (minus the first 6 lines which are hbase shell standard output and the last 2 lines).
echo "scan 'mytable', {STARTROW=>'mystartrow', ENDROW=>'myendrow'}" | hbase shell > row.txt
wc -l row.txt
Make sure to select the appropriate row keys, as the borders are not inclusive.
If you are only interested into specific columns (families), apply the filters in the hbase shell command above (e.g. FamilyFilter, ColumnRangeFilter, ...).
Thanks for #user3375803, actually you don't have to use external txt file. Because I can not comment on your answer, so I leave my answer below:
echo "scan 'mytable', {STARTROW=>'mystartrow', ENDROW=>'myendrow'}" | hbase shell | wc -l | awk '{print $1-8}'

Resources