Upsert a csv file from a second file using bash - bash

I have a main csv file with records (file1). I then have a second "delta" csv file (file2). I would like to update the main file with the records from the delta file using bash. Existing records should get the new value (replace the row) and new records should be appended.
Example file1
unique_id|value
'1'|'old'
'2'|'old'
'3'|'old'
Example file2
unique_id|value
'1'|'new'
'4'|'new'
Desired outcome
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'

awk -F '|' '
! ($1 in rows){ ids[id_count++] = $1 }
{ rows[$1] = $0 }
END {
for(i=0; i<id_count; i++)
print rows[ids[i]]
}
' old.csv new.csv
Output:
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
Similar approach using perl
perl -F'\|' -lane '
$id = $F[0];
push #ids, $id unless exists $rows{$id};
$rows{$id} = $_;
END { print $rows{$_} for #ids }
' old.csv new.csv
You could also use an actual database e.g. sqlite
sqlite> create table old (unique_id text primary key, value text);
sqlite> create table new (unique_id text primary key, value text);
# skip headers
sqlite> .sep '|'
sqlite> .import --skip 1 new.csv new
sqlite> .import --skip 1 old.csv old
sqlite> select * from old;
'1'|'old'
'2'|'old'
'3'|'old'
sqlite> insert into old
select * from new where true
on conflict(unique_id)
do update set value=excluded.value;
sqlite> select * from old;
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'

I immediately thought of join, but you cannot specify "take this column if there's a match, otherwise use another column, and have either output end up in a single column".
For command-line processing of CSV files, I really like GoCSV. It has its own CSV-aware join command—which is also limited like join (above)—and it has other commands that we can chain together to produce the desired output.
GoCSV uses a streaming/buffered reader/writer for as many subcommands as it can. Every command but join operates in this buffered-in-buffered-out fashion, but join needs to read both sides in total to match. Still, GoCSV is compiled and just really, really fast.
All GoCSV commands read the delimiter to use from the GOCSV_DELIMITER environment variable, so your first order of business is to export that for your pipe delimiter:
export GOCSV_DELIMITER='|'
Joining is easy, just specify the columns from either file to use as the key. I'm also going to rename the columns now so that we're set up for the conditional logic in the next step. If your columns vary from file to file, you'll want to rename each set of columns first, before you join.
I'm telling gocsv join to pick the first columns from both files, -c 1,1 and use an outer join to keep both left and right sides, regardless of match:
gocsv join -c 1,1 -outer file1.csv file2.csv \
| gocsv rename -c 1,2,3,4 -names 'id_left','val_left','id_right','val_right'
| id_left | val_left | id_right | val_right |
|---------|----------|----------|-----------|
| 1 | old | 1 | new |
| 2 | old | | |
| 3 | old | | |
| | | 4 | new |
There's no way to change a value in an existing column based on another column's value, but we can add new columns and use a templating language to define the logic we need.
The following syntax creates two new columns, id_final and val_final. For both columns, if there's a value in val_right that value is used, otherwise val_left is used. This, cominbed with the outer-join of "left then right" from before, gives us the effect of the right side updating/overwriting the left side if the IDs matched:
... \
| gocsv add -name 'id_final' -t '{{ if .id_right }}{{ .id_right }}{{ else }}{{ .id_left }}{{ end }}' \
| gocsv add -name 'val_final' -t '{{ if .val_right }}{{ .val_right }}{{ else }}{{ .val_left }}{{ end }}'
| id_left | val_left | id_right | val_right | id_final | val_final |
|---------|----------|----------|-----------|----------|-----------|
| 1 | old | 1 | new | 1 | new |
| 2 | old | | | 2 | old |
| 3 | old | | | 3 | old |
| | | 4 | new | 4 | new |
Finally, we can select just the "final" fields and rename them back to their original names:
... \
| gocsv select -c 'id_final','val_final' \
| gocsv rename -c 1,2 -names 'unique_id','value'
| unique_id | value |
|-----------|-------|
| 1 | new |
| 2 | old |
| 3 | old |
| 4 | new |
GoCSV has pre-built binaries for modern platforms.

I use Miller and run
mlr --csv --fs "|" join --ul --ur -j unique_id --lp "l#" --rp "r#" -f 01.csv \
then put 'if(is_not_null(${r#value})){$value=${r#value}}else{$value=$value}' \
then cut -x -r -f '#' 02.csv
and I have
unique_id|value
'1'|'new'
'4'|'new'
'2'|'old'
'3'|'old'
I run a full join and I use an if condition, to check if I have value on the right. If I have it, I use it.

Related

How Vertica handles semi-structured data even if loaded from different file formats

My understanding about semi-structured data handling in Vertica is that if data is say like this (in json)
{
"f1":1,
"f2":"hello",
"f3":false,
"f4":2
}
then a flextable is created with two columns __identity__ and __raw__. __identify__ will have 4 fields (I suppose integers 1,2,3,4) and __raw__ will be raw representation of data (1, hello,false and 2).
I can also load data in a csv file in the same flextable eg 2, hello2, true, 3. How does Vertica decide which field maps to which column (eg. both f1 and f4) are int.
Well, nothing beats having a Vertica SQL prompt ready (and the privilege to create a database object ...) to try and find out.
With JSON, the field names are in the structure: key-value pairs.
With CSV, the first line of the data file needs to have the column names - which I add below ...
-- connecting with VSQL,
$ vsql -h localhost -d sbx -U dbadmin -w pwd
$ vsql -h localhost -d sbx -U dbadmin -w pwd
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
sbx=> -- create the flex table
sbx=> CREATE FLEX TABLE flx();
CREATE TABLE
sbx=> -- load the flex table from stdin - data handed in-line - using your input
sbx=> COPY flx FROM stdin PARSER fjsonparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> {
>> "f1":1,
>> "f2":"hello",
>> "f3":false,
>> "f4":2
>> }
>> \.
-- test the load ...
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+-------+-------+----
1 | hello | false | 2
sbx=>-- load the CSV file - note that we need the title line,
sbx=>-- which I add, to have same values in the same fields
sbx=> COPY flx FROM stdin PARSER fcsvparser();
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> f1,f2,f3,f4
>> 2, hello2, true, 3
>> \.
sbx=>-- check the contents now
sbx=> SELECT f1,f2,f3,f4 FROM flx;
f1 | f2 | f3 | f4
----+--------+-------+----
1 | hello | false | 2
2 | hello2 | true | 3
sbx=>-- resulting table definition in catalog ...
sbx=> \d flx
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
---------+-------+--------------+------------------------+--------+---------+----------+-------------+-------------
dbadmin | flx | __identity__ | int | 8 | | t | f |
dbadmin | flx | __raw__ | long varbinary(130000) | 130000 | | t | f |
(2 rows)
sbx=> -- check the contents of __identity__ and (after visualising) __raw__
sbx=> SELECT __identity__,REPLACE(MAPTOSTRING(__raw__),CHR(10),' ') FROM flx;
__identity__ | REPLACE
--------------+------------------------------------------------------------------------
1 | { "f1": "1", "f2": "hello", "f3": "false", "f4": "2" }
2 | { "f1": "2", "f2": "hello2", "f3": "true", "f4": "3" }

update values defined in specific section of config file using bash

I am looking to update one config file using bash.
Config file is having multiple section like
[SECTION1]
action.email.useNSSubject = 1
dispatch.earliest_time = 1578927600
dispatch.latest_time = 1579016736
search = | inputlookup KPI_MASTER_LIST.csv | search TYPE="MTE_GENERIC" \
| table ALERT Order\
| map maxsearches=21 search="| savedsearch "$$ALERT$$" host_token=$host_token$ SERVICE_EARLIEST_TIME=$SERVICE_EARLIEST_TIME$ time_token.earliest=$time_token.earliest$ time_token.latest=$time_token.latest$ | appendcols [ | makeresults | eval Order="$$Order$$" | fillnull count ] | table ALERT count Order "\
| sort Order \
[SECTION2]
action.email.useNSSubject = 1
alert.track = 0
dispatch.earliest_time = 153437300
dispatch.latest_time = 1549013433
display.general.timeRangePicker.show = 0
search = | inputlookup KPI_MASTER_LIST.csv | search TYPE="MTE_GENERIC" \
| table ALERT Order\
| map maxsearches=21 search="| savedsearch "$$ALERT$$" host_token=$host_token$ SERVICE_EARLIEST_TIME=$SERVICE_EARLIEST_TIME$ time_token.earliest=$time_token.earliest$ time_token.latest=$time_token.latest$ | appendcols [ | makeresults | eval Order="$$Order$$" | fillnull count ] | table ALERT count Order "\
| sort Order \
I am looking to update value of "dispatch.earliest_time" and "dispatch.latest_time" in specific section only(not all occurrence of this file)
You could use an address range specifying your delimiter
sed '/\[SECTION_NAME\]/,/^\[/ s/dispatch.earliest_time/new_value_here/'
You can find thorough documentation about sed here

How do we get the 1000 tables description using hive?

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.
You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done
Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

how to count number of words in each column delimited by "|" seperator using hive?

input data is
+----------------------+--------------------------------+
| movie_name | Genres |
+----------------------+--------------------------------+
| digimon | Adventure|Animation|Children's |
| Slumber_Party_Massac | Horror |
+----------------------+--------------------------------+
i need output like
+----------------------+--------------------------------+-----------------+
| movie_name | Genres | count_of_genres |
+----------------------+--------------------------------+-----------------+
| digimon | Adventure|Animation|Children's | 3 |
| Slumber_Party_Massac | Horror | 1 |
+----------------------+--------------------------------+-----------------+
select *
,size(split(coalesce(Genres,''),'[^|\\s]+'))-1 as count_of_genres
from mytable
This solution covers varying use-cases, including -
NULL values
Empty strings
Empty tokens (e.g. Adventure||Animation orAdventure| |Animation )
This is a really, really bad way to store data. You should have a separate MovieGenres table with one row per movie and per genre.
One method is to use length() and replace():
select t.*,
(1 + length(genres) - length(replace(genres, '|', ''))) as num_genres
from t;
This assumes that each movie has at least one genre. If not, you need to test for that as well.

Oracle SQL Loader split data into different tables

I have a Data file that looks like this:
1 2 3 4 5 6
FirstName1 | LastName1 | 4224423 | Address1 | PhoneNumber1 | 1/1/1980
FirstName2 | LastName2 | 4008933 | Address1 | PhoneNumber1 | 1/1/1980
FirstName3 | LastName3 | 2344327 | Address1 | PhoneNumber1 | 1/1/1980
FirstName4 | LastName4 | 5998943 | Address1 | PhoneNumber1 | 1/1/1980
FirstName5 | LastName5 | 9854531 | Address1 | PhoneNumber1 | 1/1/1980
My DB has 2 Tables, one for PERSON and one for ADDRESS, so I need to store columns 1,2,3 and 6 in PERSON and column 4 and 5 in ADDRESS. All examples provided in the SQL Loader documentation address this case but only for fixed size columns, and my data file is pipe delimited (and spiting this into 2 different data files is not an option).
Do someone knows how to do this?
As always help will be deeply appreciated.
Another option may be to set up the file as an external table and then run inserts selecting the columns you want from the external table.
options(skip=1)
load data
infile "csv file path"
insert into table person
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(1,2,3,6)
insert into table address
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(4,5)
Even if SQLLoader doesn't support this (I'm not sure) nothing stops you from pre-processing it with say awk and then loading. For example:
cat 1.dat | awk -F '|' '{print $1 $2 $3 $6}' > person.dat
cat 1.dat | awk -F '|' '{print $4 $5}' > address.dat

Resources