How do we get the 1000 tables description using hive? - hadoop

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.

You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done

Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

Related

Upsert a csv file from a second file using bash

I have a main csv file with records (file1). I then have a second "delta" csv file (file2). I would like to update the main file with the records from the delta file using bash. Existing records should get the new value (replace the row) and new records should be appended.
Example file1
unique_id|value
'1'|'old'
'2'|'old'
'3'|'old'
Example file2
unique_id|value
'1'|'new'
'4'|'new'
Desired outcome
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
awk -F '|' '
! ($1 in rows){ ids[id_count++] = $1 }
{ rows[$1] = $0 }
END {
for(i=0; i<id_count; i++)
print rows[ids[i]]
}
' old.csv new.csv
Output:
unique_id|value
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
Similar approach using perl
perl -F'\|' -lane '
$id = $F[0];
push #ids, $id unless exists $rows{$id};
$rows{$id} = $_;
END { print $rows{$_} for #ids }
' old.csv new.csv
You could also use an actual database e.g. sqlite
sqlite> create table old (unique_id text primary key, value text);
sqlite> create table new (unique_id text primary key, value text);
# skip headers
sqlite> .sep '|'
sqlite> .import --skip 1 new.csv new
sqlite> .import --skip 1 old.csv old
sqlite> select * from old;
'1'|'old'
'2'|'old'
'3'|'old'
sqlite> insert into old
select * from new where true
on conflict(unique_id)
do update set value=excluded.value;
sqlite> select * from old;
'1'|'new'
'2'|'old'
'3'|'old'
'4'|'new'
I immediately thought of join, but you cannot specify "take this column if there's a match, otherwise use another column, and have either output end up in a single column".
For command-line processing of CSV files, I really like GoCSV. It has its own CSV-aware join command—which is also limited like join (above)—and it has other commands that we can chain together to produce the desired output.
GoCSV uses a streaming/buffered reader/writer for as many subcommands as it can. Every command but join operates in this buffered-in-buffered-out fashion, but join needs to read both sides in total to match. Still, GoCSV is compiled and just really, really fast.
All GoCSV commands read the delimiter to use from the GOCSV_DELIMITER environment variable, so your first order of business is to export that for your pipe delimiter:
export GOCSV_DELIMITER='|'
Joining is easy, just specify the columns from either file to use as the key. I'm also going to rename the columns now so that we're set up for the conditional logic in the next step. If your columns vary from file to file, you'll want to rename each set of columns first, before you join.
I'm telling gocsv join to pick the first columns from both files, -c 1,1 and use an outer join to keep both left and right sides, regardless of match:
gocsv join -c 1,1 -outer file1.csv file2.csv \
| gocsv rename -c 1,2,3,4 -names 'id_left','val_left','id_right','val_right'
| id_left | val_left | id_right | val_right |
|---------|----------|----------|-----------|
| 1 | old | 1 | new |
| 2 | old | | |
| 3 | old | | |
| | | 4 | new |
There's no way to change a value in an existing column based on another column's value, but we can add new columns and use a templating language to define the logic we need.
The following syntax creates two new columns, id_final and val_final. For both columns, if there's a value in val_right that value is used, otherwise val_left is used. This, cominbed with the outer-join of "left then right" from before, gives us the effect of the right side updating/overwriting the left side if the IDs matched:
... \
| gocsv add -name 'id_final' -t '{{ if .id_right }}{{ .id_right }}{{ else }}{{ .id_left }}{{ end }}' \
| gocsv add -name 'val_final' -t '{{ if .val_right }}{{ .val_right }}{{ else }}{{ .val_left }}{{ end }}'
| id_left | val_left | id_right | val_right | id_final | val_final |
|---------|----------|----------|-----------|----------|-----------|
| 1 | old | 1 | new | 1 | new |
| 2 | old | | | 2 | old |
| 3 | old | | | 3 | old |
| | | 4 | new | 4 | new |
Finally, we can select just the "final" fields and rename them back to their original names:
... \
| gocsv select -c 'id_final','val_final' \
| gocsv rename -c 1,2 -names 'unique_id','value'
| unique_id | value |
|-----------|-------|
| 1 | new |
| 2 | old |
| 3 | old |
| 4 | new |
GoCSV has pre-built binaries for modern platforms.
I use Miller and run
mlr --csv --fs "|" join --ul --ur -j unique_id --lp "l#" --rp "r#" -f 01.csv \
then put 'if(is_not_null(${r#value})){$value=${r#value}}else{$value=$value}' \
then cut -x -r -f '#' 02.csv
and I have
unique_id|value
'1'|'new'
'4'|'new'
'2'|'old'
'3'|'old'
I run a full join and I use an if condition, to check if I have value on the right. If I have it, I use it.

Change table column name parquet format Hadoop

I have table with columns a,b,c.
The data store on hdfs as parquet, is it possible to change specific column name even if the parquet already writted with the schema of a,b,c?
read file in a loop
create a new df with changed column name
write new df in append mode in another dir
move this new dir to read dir
cmd=['hdfs', 'dfs', '-ls', OutDir]
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
for i in process.communicate():
if i:
for j in i.decode('utf-8').strip().split():
if j.endswith('snappy.parquet'):
print('reading file ',j)
mydf = spark.read.format("parquet").option("inferSchema","true")\
.option("header", "true")\
.load(j)
print('df built on bad file ')
mydf.createOrReplaceTempView("dtl_rev")
ssql="""select old-name AS new_name,
old_col AS new_col from dtl_rev"""
newdf=spark.sql(ssql)
print('df built on renamed file ')
aggdf.write.format("parquet").mode("append").save(newdir)
We can not rename column name in the existing files, parquet stores schema in the data file,
we can check schema using below command
parquet-tools schema part-m-00000.parquet
we have to take backup into a temp table and re-ingest the history data.
Try using, ALTER TABLE
desc p;
+-------------------------+------------+----------+--+
| col_name | data_type | comment |
+-------------------------+------------+----------+--+
| category_id | int | |
| category_department_id | int | |
| category_name | string | |
+-------------------------+------------+----------+--+
alter table p change column category_id id int
desc p;
+-------------------------+------------+----------+--+
| col_name | data_type | comment |
+-------------------------+------------+----------+--+
| id | int | |
| category_department_id | int | |
| category_name | string | |
+-------------------------+------------+----------+--+

Automatically generating documentation about the structure of the database

There is a database that contains several views and tables.
I need create a report (documentation of database) with a list of all the fields in these tables indicating the type and, if possible, an indication of the minimum/maximum values and values from first row. For example:
.------------.--------.--------.--------------.--------------.--------------.
| Table name | Column | Type | MinValue | MaxValue | FirstRow |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | day | date | ‘2010-09-17’ | ‘2016-12-10’ | ‘2016-12-10’ |
:------------+--------+--------+--------------+--------------+--------------:
| Table1 | price | double | 1030.8 | 29485.7 | 6023.8 |
:------------+--------+--------+--------------+--------------+--------------:
| … | | | | | |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | day | date | ‘2014-06-20’ | ‘2016-11-28’ | ‘2016-11-16’ |
:------------+--------+--------+--------------+--------------+--------------:
| TableN | owner | string | NULL | NULL | ‘Joe’ |
'------------'--------'--------'--------------'--------------'--------------'
I think the execution of many queries
SELECT MAX(column_name) as max_value, MIN(column_name) as min_value
FROM table_name
Will be ineffective on the huge tables that are stored in Hadoop.
After reading documentation found an article about "Statistics in Hive"
It seems I must use request like this:
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;
But this command ended with error:
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.ColumnStatsTask
Do I understand correctly that this request add information to the description of the table and not display the result? Will this request work with view?
Please suggest how to effectively and automatically create documentation for the database in HIVE?

Dropping hive partition based on certain condition in runtime

I have a table in hive built using the following command:
create table t1 (x int, y int, s string) partitioned by (wk int) stored as sequencefile;
The table has the data below:
select * from t1;
+-------+-------+-------+--------+--+
| t1.x | t1.y | t1.s | t1.wk |
+-------+-------+-------+--------+--+
| 1 | 2 | abc | 10 |
| 4 | 5 | xyz | 11 |
| 7 | 8 | pqr | 12 |
+-------+-------+-------+--------+--+
Now the ask is to drop the oldest partition when partition count is >=2
Can this be handled in hql or through any shell script and how?
Considering I will be using dbname as variable like hive -e 'use "$dbname"; show partitions t1
If your partitions are ordered by date, you could write a shell script in which you could use hive -e 'SHOW PARTITIONS t1' to get all partitions, in your example, it will return:
wk=10
wk=11
wk=12
Then you can issue hive -e 'ALTER TABLE t1 DROP PARTITION (wk=10)' to remove the first partition;
So something like:
db=mydb
if (( `hive -e "use $db; SHOW PARTITIONS t1" | grep wk | wc -l` < 2)) ; then
exit;
fi
partition=`hive -e "use $db; SHOW PARTITIONS t1" | grep wk | head -1`;
hive -e "use $db; ALTER TABLE t1 DROP PARTITION ($partition)";

vsql/vertica, how to copy text input file's name into destination table

I have to copy a input text file (text_file.txt) to a table (table_a). I also need to include the input file's name into the table.
my code is:
\set t_pwd `pwd`
\set input_file '\'':t_pwd'/text_file.txt\''
copy table_a
( column1
,column2
,column3
,FileName :input_file
)
from :input_file
The last line does not copy the input text file name in the table.
How to copy the input text file's name into the table? (without manually typing the file name)
Solution 1
This might not be the perfect solution for your job but i think will do the job :
You can get the table name and store it in a TBL variable and next add this variable at the end of each line in the CSV file that you are about to load into Vertica.
Now depending on your CSV file size this can be quite time and CPU consuming.
export TBL=`ls -1 | grep *.txt` | sed -e 's/$/,'$TBL'/' -i $TBL
Example:
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10
[dbadmin#bih001 ~]$ export TBL=`ls -1 | grep load*` | sed -e 's/$/|'$TBL'/' -i $TBL
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10||load_data1
Solution 2
You can use a DEFAULT CONSTRAINT, see example:
1. Create your table with a DEFAULT CONSTRAINT
[dbadmin#bih001 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> create table TBL (id int ,CSV_FILE_NAME varchar(200) default 'TBL');
CREATE TABLE
dbadmin=> \dt
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
See the DEFAULT CONSTRAINT it has the 'TBL' default value
dbadmin=> \d TBL
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+---------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | 'TBL' | f | f |
(2 rows)
2. Now setup your COPY variables
- insert some data and alter the DEFAULT CONSTRAINT value to your current :input_file value.
dbadmin=> \set t_pwd `pwd`
dbadmin=> \set CSV_FILE `ls -1 | grep load*`
dbadmin=> \set input_file '\'':t_pwd'/':CSV_FILE'\''
dbadmin=>
dbadmin=>
dbadmin=> insert into TBL values(1);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+---------------
1 | TBL
(1 row)
dbadmin=> ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
ALTER TABLE
dbadmin=> \dt TBL;
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
dbadmin=> \d TBL;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+----------------------------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | '/home/dbadmin/load_data1' | f | f |
(2 rows)
dbadmin=> insert into TBL values(2);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+--------------------------
1 | TBL
2 | /home/dbadmin/load_data1
(2 rows)
Now you can implement this in your copy script.
Example:
\set t_pwd `pwd`
\set CSV_FILE `ls -1 | grep load*`
\set input_file '\'':t_pwd'/':CSV_FILE'\''
ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
copy TBL from :input_file DELIMITER '|' DIRECT;
Solution 3
Use the LOAD_STREAMS table
Example:
When loading a table give it a stream name - this way you can identify the file name / stream name:
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
*Here is how you can query your load_streams table :*
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Makes sense ? Hope this helped !
If you do not need to do it purely from inside vsql, it might possible to cheat a bit, and export the logic outside Vertica, in bash for example:
FILE=text_file.txt
(
while read LINE; do
echo "$LINE|$FILE"
done < "$FILE"
) | vsql -c 'copy table_a (...) FROM STDIN'
That way you basically COPY FROM STDIN, adding the filename to each line before it even reaches Vertica.

Resources