How Vertica handles semi-structured data even if loaded from different file formats - vertica

My understanding about semi-structured data handling in Vertica is that if data is say like this (in json)
{
"f1":1,
"f2":"hello",
"f3":false,
"f4":2
}
then a flextable is created with two columns __identity__ and __raw__. __identify__ will have 4 fields (I suppose integers 1,2,3,4) and __raw__ will be raw representation of data (1, hello,false and 2).
I can also load data in a csv file in the same flextable eg 2, hello2, true, 3. How does Vertica decide which field maps to which column (eg. both f1 and f4) are int.

Well, nothing beats having a Vertica SQL prompt ready (and the privilege to create a database object ...) to try and find out.
With JSON, the field names are in the structure: key-value pairs.
With CSV, the first line of the data file needs to have the column names - which I add below ...
-- connecting with VSQL,
$ vsql -h localhost -d sbx -U dbadmin -w pwd
$ vsql -h localhost -d sbx -U dbadmin -w pwd
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
sbx=> -- create the flex table
sbx=> CREATE FLEX TABLE flx();
CREATE TABLE
sbx=> -- load the flex table from stdin - data handed in-line - using your input
sbx=> COPY flx FROM stdin PARSER fjsonparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {
>> "f1":1,
>> "f2":"hello",
>> "f3":false,
>> "f4":2
>> }
>> \.
-- test the load ...
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+-------+-------+----
1 | hello | false | 2
sbx=>-- load the CSV file - note that we need the title line,
sbx=>-- which I add, to have same values in the same fields
sbx=> COPY flx FROM stdin PARSER fcsvparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> f1,f2,f3,f4
>> 2, hello2, true, 3
>> \.
sbx=>-- check the contents now
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+--------+-------+----
1 | hello | false | 2
2 | hello2 | true | 3
sbx=>-- resulting table definition in catalog ...
sbx=> \d flx
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
---------+-------+--------------+------------------------+--------+---------+----------+-------------+-------------
dbadmin | flx | __identity__ | int | 8 | | t | f |
dbadmin | flx | __raw__ | long varbinary(130000) | 130000 | | t | f |
(2 rows)
sbx=> -- check the contents of __identity__ and (after visualising) __raw__
sbx=> SELECT __identity__,REPLACE(MAPTOSTRING(__raw__),CHR(10),' ') FROM flx;
__identity__ | REPLACE
--------------+------------------------------------------------------------------------
1 | { "f1": "1", "f2": "hello", "f3": "false", "f4": "2" }
2 | { "f1": "2", "f2": "hello2", "f3": "true", "f4": "3" }

Related

Upsert a csv file from a second file using bash

I have a main csv file with records (file1). I then have a second "delta" csv file (file2). I would like to update the main file with the records from the delta file using bash. Existing records should get the new value (replace the row) and new records should be appended.
Example file1
unique_id|value
'1'|'old'
'2'|'old'
'3'|'old'
Example file2
unique_id|value
'1'|'new'
'4'|'new'
Desired outcome
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
awk -F '|' '
! ($1 in rows){ ids[id_count++] = $1 }
{ rows[$1] = $0 }
END {
for(i=0; i<id_count; i++)
print rows[ids[i]]
}
' old.csv new.csv
Output:
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
Similar approach using perl
perl -F'\|' -lane '
$id = $F[0];
push #ids, $id unless exists $rows{$id};
$rows{$id} = $_;
END { print $rows{$_} for #ids }
' old.csv new.csv
You could also use an actual database e.g. sqlite
sqlite> create table old (unique_id text primary key, value text);
sqlite> create table new (unique_id text primary key, value text);
# skip headers
sqlite> .sep '|'
sqlite> .import --skip 1 new.csv new
sqlite> .import --skip 1 old.csv old
sqlite> select * from old;
'1'|'old'
'2'|'old'
'3'|'old'
sqlite> insert into old
select * from new where true
on conflict(unique_id)
do update set value=excluded.value;
sqlite> select * from old;
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
I immediately thought of join, but you cannot specify "take this column if there's a match, otherwise use another column, and have either output end up in a single column".
For command-line processing of CSV files, I really like GoCSV. It has its own CSV-aware join command—which is also limited like join (above)—and it has other commands that we can chain together to produce the desired output.
GoCSV uses a streaming/buffered reader/writer for as many subcommands as it can. Every command but join operates in this buffered-in-buffered-out fashion, but join needs to read both sides in total to match. Still, GoCSV is compiled and just really, really fast.
All GoCSV commands read the delimiter to use from the GOCSV_DELIMITER environment variable, so your first order of business is to export that for your pipe delimiter:
export GOCSV_DELIMITER='|'
Joining is easy, just specify the columns from either file to use as the key. I'm also going to rename the columns now so that we're set up for the conditional logic in the next step. If your columns vary from file to file, you'll want to rename each set of columns first, before you join.
I'm telling gocsv join to pick the first columns from both files, -c 1,1 and use an outer join to keep both left and right sides, regardless of match:
gocsv join -c 1,1 -outer file1.csv file2.csv \
| gocsv rename -c 1,2,3,4 -names 'id_left','val_left','id_right','val_right'
| id_left | val_left | id_right | val_right |
|---------|----------|----------|-----------|
| 1 | old | 1 | new |
| 2 | old | | |
| 3 | old | | |
| | | 4 | new |
There's no way to change a value in an existing column based on another column's value, but we can add new columns and use a templating language to define the logic we need.
The following syntax creates two new columns, id_final and val_final. For both columns, if there's a value in val_right that value is used, otherwise val_left is used. This, cominbed with the outer-join of "left then right" from before, gives us the effect of the right side updating/overwriting the left side if the IDs matched:
... \
| gocsv add -name 'id_final' -t '{{ if .id_right }}{{ .id_right }}{{ else }}{{ .id_left }}{{ end }}' \
| gocsv add -name 'val_final' -t '{{ if .val_right }}{{ .val_right }}{{ else }}{{ .val_left }}{{ end }}'
| id_left | val_left | id_right | val_right | id_final | val_final |
|---------|----------|----------|-----------|----------|-----------|
| 1 | old | 1 | new | 1 | new |
| 2 | old | | | 2 | old |
| 3 | old | | | 3 | old |
| | | 4 | new | 4 | new |
Finally, we can select just the "final" fields and rename them back to their original names:
... \
| gocsv select -c 'id_final','val_final' \
| gocsv rename -c 1,2 -names 'unique_id','value'
| unique_id | value |
|-----------|-------|
| 1 | new |
| 2 | old |
| 3 | old |
| 4 | new |
GoCSV has pre-built binaries for modern platforms.
I use Miller and run
mlr --csv --fs "|" join --ul --ur -j unique_id --lp "l#" --rp "r#" -f 01.csv \
then put 'if(is_not_null(${r#value})){$value=${r#value}}else{$value=$value}' \
then cut -x -r -f '#' 02.csv
and I have
unique_id|value
'1'|'new'
'4'|'new'
'2'|'old'
'3'|'old'
I run a full join and I use an if condition, to check if I have value on the right. If I have it, I use it.

Loading data using sqlloader

I am trying to load data from flat file to the table. But flat file has LF(linefeed) so for every LF, one empty row is getting inserted into the table.
Tried using some commands like trim, replace and others none of them are working here, Could anyone please help.
sample flat file
test.txt
ID NAME
1 abc
2 def
(linefeed)
3 hij
(linefeed)
(linefeed)
4 klm
control file
test.ctl
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET WE8ISO8859P1 length semantics char
TRUNCATE
INTO TABLE test
TRAILING NULLCOLS
( ID char(4) ,
NAME char(18))
command used
sqlldr CONTROL=test.ctl log=test.log data=test.txt USERID=appdata/app#orcl direct=true
table data
| ID| NAME |
| --| -- |
| 1 | abc|
| 2 | def|
| NULL| NULL|
| 3 | hij|
| NULL| NULL|
| NULL| NULL|
| 4| klm|
While loading data to the table i need to avoid these empty row to be inserted into the table.
Notepad++ is a free text editor, which contains following command in the Edit menu:

How do we get the 1000 tables description using hive?

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.
You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done
Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

Dropping hive partition based on certain condition in runtime

I have a table in hive built using the following command:
create table t1 (x int, y int, s string) partitioned by (wk int) stored as sequencefile;
The table has the data below:
select * from t1;
+-------+-------+-------+--------+--+
| t1.x | t1.y | t1.s | t1.wk |
+-------+-------+-------+--------+--+
| 1 | 2 | abc | 10 |
| 4 | 5 | xyz | 11 |
| 7 | 8 | pqr | 12 |
+-------+-------+-------+--------+--+
Now the ask is to drop the oldest partition when partition count is >=2
Can this be handled in hql or through any shell script and how?
Considering I will be using dbname as variable like hive -e 'use "$dbname"; show partitions t1
If your partitions are ordered by date, you could write a shell script in which you could use hive -e 'SHOW PARTITIONS t1' to get all partitions, in your example, it will return:
wk=10
wk=11
wk=12
Then you can issue hive -e 'ALTER TABLE t1 DROP PARTITION (wk=10)' to remove the first partition;
So something like:
db=mydb
if (( `hive -e "use $db; SHOW PARTITIONS t1" | grep wk | wc -l` < 2)) ; then
exit;
fi
partition=`hive -e "use $db; SHOW PARTITIONS t1" | grep wk | head -1`;
hive -e "use $db; ALTER TABLE t1 DROP PARTITION ($partition)";

vsql/vertica, how to copy text input file's name into destination table

I have to copy a input text file (text_file.txt) to a table (table_a). I also need to include the input file's name into the table.
my code is:
\set t_pwd `pwd`
\set input_file '\'':t_pwd'/text_file.txt\''
copy table_a
( column1
,column2
,column3
,FileName :input_file
)
from :input_file
The last line does not copy the input text file name in the table.
How to copy the input text file's name into the table? (without manually typing the file name)
Solution 1
This might not be the perfect solution for your job but i think will do the job :
You can get the table name and store it in a TBL variable and next add this variable at the end of each line in the CSV file that you are about to load into Vertica.
Now depending on your CSV file size this can be quite time and CPU consuming.
export TBL=`ls -1 | grep *.txt` | sed -e 's/$/,'$TBL'/' -i $TBL
Example:
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10
[dbadmin#bih001 ~]$ export TBL=`ls -1 | grep load*` | sed -e 's/$/|'$TBL'/' -i $TBL
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10||load_data1
Solution 2
You can use a DEFAULT CONSTRAINT, see example:
1. Create your table with a DEFAULT CONSTRAINT
[dbadmin#bih001 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> create table TBL (id int ,CSV_FILE_NAME varchar(200) default 'TBL');
CREATE TABLE
dbadmin=> \dt
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
See the DEFAULT CONSTRAINT it has the 'TBL' default value
dbadmin=> \d TBL
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+---------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | 'TBL' | f | f |
(2 rows)
2. Now setup your COPY variables
- insert some data and alter the DEFAULT CONSTRAINT value to your current :input_file value.
dbadmin=> \set t_pwd `pwd`
dbadmin=> \set CSV_FILE `ls -1 | grep load*`
dbadmin=> \set input_file '\'':t_pwd'/':CSV_FILE'\''
dbadmin=>
dbadmin=>
dbadmin=> insert into TBL values(1);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+---------------
1 | TBL
(1 row)
dbadmin=> ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
ALTER TABLE
dbadmin=> \dt TBL;
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
dbadmin=> \d TBL;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+----------------------------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | '/home/dbadmin/load_data1' | f | f |
(2 rows)
dbadmin=> insert into TBL values(2);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+--------------------------
1 | TBL
2 | /home/dbadmin/load_data1
(2 rows)
Now you can implement this in your copy script.
Example:
\set t_pwd `pwd`
\set CSV_FILE `ls -1 | grep load*`
\set input_file '\'':t_pwd'/':CSV_FILE'\''
ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
copy TBL from :input_file DELIMITER '|' DIRECT;
Solution 3
Use the LOAD_STREAMS table
Example:
When loading a table give it a stream name - this way you can identify the file name / stream name:
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
*Here is how you can query your load_streams table :*
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Makes sense ? Hope this helped !
If you do not need to do it purely from inside vsql, it might possible to cheat a bit, and export the logic outside Vertica, in bash for example:
FILE=text_file.txt
(
while read LINE; do
echo "$LINE|$FILE"
done < "$FILE"
) | vsql -c 'copy table_a (...) FROM STDIN'
That way you basically COPY FROM STDIN, adding the filename to each line before it even reaches Vertica.

Resources