Oracle SQL Loader split data into different tables - oracle

I have a Data file that looks like this:
1 2 3 4 5 6
FirstName1 | LastName1 | 4224423 | Address1 | PhoneNumber1 | 1/1/1980
FirstName2 | LastName2 | 4008933 | Address1 | PhoneNumber1 | 1/1/1980
FirstName3 | LastName3 | 2344327 | Address1 | PhoneNumber1 | 1/1/1980
FirstName4 | LastName4 | 5998943 | Address1 | PhoneNumber1 | 1/1/1980
FirstName5 | LastName5 | 9854531 | Address1 | PhoneNumber1 | 1/1/1980
My DB has 2 Tables, one for PERSON and one for ADDRESS, so I need to store columns 1,2,3 and 6 in PERSON and column 4 and 5 in ADDRESS. All examples provided in the SQL Loader documentation address this case but only for fixed size columns, and my data file is pipe delimited (and spiting this into 2 different data files is not an option).
Do someone knows how to do this?
As always help will be deeply appreciated.

Another option may be to set up the file as an external table and then run inserts selecting the columns you want from the external table.

options(skip=1)
load data
infile "csv file path"
insert into table person
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(1,2,3,6)
insert into table address
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(4,5)

Even if SQLLoader doesn't support this (I'm not sure) nothing stops you from pre-processing it with say awk and then loading. For example:
cat 1.dat | awk -F '|' '{print $1 $2 $3 $6}' > person.dat
cat 1.dat | awk -F '|' '{print $4 $5}' > address.dat

Related

Upsert a csv file from a second file using bash

I have a main csv file with records (file1). I then have a second "delta" csv file (file2). I would like to update the main file with the records from the delta file using bash. Existing records should get the new value (replace the row) and new records should be appended.
Example file1
unique_id|value
'1'|'old'
'2'|'old'
'3'|'old'
Example file2
unique_id|value
'1'|'new'
'4'|'new'
Desired outcome
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
awk -F '|' '
! ($1 in rows){ ids[id_count++] = $1 }
{ rows[$1] = $0 }
END {
for(i=0; i<id_count; i++)
print rows[ids[i]]
}
' old.csv new.csv
Output:
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
Similar approach using perl
perl -F'\|' -lane '
$id = $F[0];
push #ids, $id unless exists $rows{$id};
$rows{$id} = $_;
END { print $rows{$_} for #ids }
' old.csv new.csv
You could also use an actual database e.g. sqlite
sqlite> create table old (unique_id text primary key, value text);
sqlite> create table new (unique_id text primary key, value text);
# skip headers
sqlite> .sep '|'
sqlite> .import --skip 1 new.csv new
sqlite> .import --skip 1 old.csv old
sqlite> select * from old;
'1'|'old'
'2'|'old'
'3'|'old'
sqlite> insert into old
select * from new where true
on conflict(unique_id)
do update set value=excluded.value;
sqlite> select * from old;
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
I immediately thought of join, but you cannot specify "take this column if there's a match, otherwise use another column, and have either output end up in a single column".
For command-line processing of CSV files, I really like GoCSV. It has its own CSV-aware join command—which is also limited like join (above)—and it has other commands that we can chain together to produce the desired output.
GoCSV uses a streaming/buffered reader/writer for as many subcommands as it can. Every command but join operates in this buffered-in-buffered-out fashion, but join needs to read both sides in total to match. Still, GoCSV is compiled and just really, really fast.
All GoCSV commands read the delimiter to use from the GOCSV_DELIMITER environment variable, so your first order of business is to export that for your pipe delimiter:
export GOCSV_DELIMITER='|'
Joining is easy, just specify the columns from either file to use as the key. I'm also going to rename the columns now so that we're set up for the conditional logic in the next step. If your columns vary from file to file, you'll want to rename each set of columns first, before you join.
I'm telling gocsv join to pick the first columns from both files, -c 1,1 and use an outer join to keep both left and right sides, regardless of match:
gocsv join -c 1,1 -outer file1.csv file2.csv \
| gocsv rename -c 1,2,3,4 -names 'id_left','val_left','id_right','val_right'
| id_left | val_left | id_right | val_right |
|---------|----------|----------|-----------|
| 1 | old | 1 | new |
| 2 | old | | |
| 3 | old | | |
| | | 4 | new |
There's no way to change a value in an existing column based on another column's value, but we can add new columns and use a templating language to define the logic we need.
The following syntax creates two new columns, id_final and val_final. For both columns, if there's a value in val_right that value is used, otherwise val_left is used. This, cominbed with the outer-join of "left then right" from before, gives us the effect of the right side updating/overwriting the left side if the IDs matched:
... \
| gocsv add -name 'id_final' -t '{{ if .id_right }}{{ .id_right }}{{ else }}{{ .id_left }}{{ end }}' \
| gocsv add -name 'val_final' -t '{{ if .val_right }}{{ .val_right }}{{ else }}{{ .val_left }}{{ end }}'
| id_left | val_left | id_right | val_right | id_final | val_final |
|---------|----------|----------|-----------|----------|-----------|
| 1 | old | 1 | new | 1 | new |
| 2 | old | | | 2 | old |
| 3 | old | | | 3 | old |
| | | 4 | new | 4 | new |
Finally, we can select just the "final" fields and rename them back to their original names:
... \
| gocsv select -c 'id_final','val_final' \
| gocsv rename -c 1,2 -names 'unique_id','value'
| unique_id | value |
|-----------|-------|
| 1 | new |
| 2 | old |
| 3 | old |
| 4 | new |
GoCSV has pre-built binaries for modern platforms.
I use Miller and run
mlr --csv --fs "|" join --ul --ur -j unique_id --lp "l#" --rp "r#" -f 01.csv \
then put 'if(is_not_null(${r#value})){$value=${r#value}}else{$value=$value}' \
then cut -x -r -f '#' 02.csv
and I have
unique_id|value
'1'|'new'
'4'|'new'
'2'|'old'
'3'|'old'
I run a full join and I use an if condition, to check if I have value on the right. If I have it, I use it.

Loading data using sqlloader

I am trying to load data from flat file to the table. But flat file has LF(linefeed) so for every LF, one empty row is getting inserted into the table.
Tried using some commands like trim, replace and others none of them are working here, Could anyone please help.
sample flat file
test.txt
ID NAME
1 abc
2 def
(linefeed)
3 hij
(linefeed)
(linefeed)
4 klm
control file
test.ctl
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET WE8ISO8859P1 length semantics char
TRUNCATE
INTO TABLE test
TRAILING NULLCOLS
( ID char(4) ,
NAME char(18))
command used
sqlldr CONTROL=test.ctl log=test.log data=test.txt USERID=appdata/app#orcl direct=true
table data
| ID| NAME |
| --| -- |
| 1 | abc|
| 2 | def|
| NULL| NULL|
| 3 | hij|
| NULL| NULL|
| NULL| NULL|
| 4| klm|
While loading data to the table i need to avoid these empty row to be inserted into the table.
Notepad++ is a free text editor, which contains following command in the Edit menu:

How do we get the 1000 tables description using hive?

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.
You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done
Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

MySQL blob dump to tab delimited files

I am migrating a MySQL 5.1 database in Amazon's EC2, and I am having issues tables with longblob datatype we use for image storage. Basically, after the migration, the data in the longblob column is a different size, due to the fact that the character encoding seems to be different.
First of all, here is an example of before and after the migration:
Old:
x??]]??}?_ѕ??d??i|w?%?????q$??+?
New:
x��]]����_ѕ��d��i|w�%�����q$��+�
I checked the character set variables on both machines and they are identical. I also checked the 'show create table' and they are identical as well. The client's are both connecting the same way (no SET NAMES, or specifying character sets).
Here is the mysqldump command I used (I tried it without --hex-blob as well):
mysqldump --hex-blob --default-character-set=utf8 --tab=. DB_NAME
Here is how I loaded the data:
mysql DB_NAME --default-character-set=utf8 -e "LOAD DATA INFILE 'EXAMPLE.txt' INTO TABLE EXAMPLE;"
Here are the MySQL character set variables (identical):
Old:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
New:
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
I'm not sure what else to try to be able to run mysqldump and have the blob data be identical on both machines. Any tips would be greatly appreciated.
The issue seems to be a bug in mysql (http://bugs.mysql.com/bug.php?id=27724). The solution is to not use mysqldump, but to write your own SELECT INTO OUTFILE script for the tables that have blob data. Here is an example:
SELECT
COALESCE(column1, #nullval),
COALESCE(column2, #nullval),
COALESCE(HEX(column3), #nullval),
COALESCE(column4, #nullval),
COALESCE(column5, #nullval)
FROM table
INTO OUTFILE '/mnt/dump/table.txt'
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
To load the data:
SET NAMES utf8;
LOAD DATA INFILE '/mnt/dump/table.txt'
INTO TABLE table
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
(column1, column1, #column1, column1, column1)
SET data = UNHEX(#column1)
This loads the blob data correctly.

Resources