Dropping hive partition based on certain condition in runtime - shell

I have a table in hive built using the following command:
create table t1 (x int, y int, s string) partitioned by (wk int) stored as sequencefile;
The table has the data below:
select * from t1;
+-------+-------+-------+--------+--+
| t1.x | t1.y | t1.s | t1.wk |
+-------+-------+-------+--------+--+
| 1 | 2 | abc | 10 |
| 4 | 5 | xyz | 11 |
| 7 | 8 | pqr | 12 |
+-------+-------+-------+--------+--+
Now the ask is to drop the oldest partition when partition count is >=2
Can this be handled in hql or through any shell script and how?
Considering I will be using dbname as variable like hive -e 'use "$dbname"; show partitions t1

If your partitions are ordered by date, you could write a shell script in which you could use hive -e 'SHOW PARTITIONS t1' to get all partitions, in your example, it will return:
wk=10
wk=11
wk=12
Then you can issue hive -e 'ALTER TABLE t1 DROP PARTITION (wk=10)' to remove the first partition;
So something like:
db=mydb
if (( `hive -e "use $db; SHOW PARTITIONS t1" | grep wk | wc -l` < 2)) ; then
exit;
fi
partition=`hive -e "use $db; SHOW PARTITIONS t1" | grep wk | head -1`;
hive -e "use $db; ALTER TABLE t1 DROP PARTITION ($partition)";

Related

hive table shows 0 results while querying

My hive table is a managed table and i can see the files present in HDFS.
While querying through hive it does not display any result.
hive> describe formatted emp
Result -
| Table Type: | MANAGED_TABLE
| Table Parameters: | NULL
| 2 | bucketing_version
| 1376 | numFiles
| 43 | numPartitions
| 0 | numRows
| gzip | parquet.compression
| 0 | rawDataSize
| 4770821594 | totalSize
| true | transactional
| insert_only | transactional_properties
| 1612857428 | transient_lastDdlTime
While selecting data from table -
select * from emp;
it fetches no results.
Why there is difference in HDFS and select output.
Command worked for me -
ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS;

How do we get the 1000 tables description using hive?

I have 1000 tables, need to check the describe <table name>; for one by one. Instead of running one by one, can you please give me one command to fetch "N" number of tables in a single shot.
You can make a shell script and call it with a parameter. For example following script receives schema, prepares list of tables in the schema, calls DESCRIBE EXTENDED command, extracts location, prints table location for first 1000 tables in the schema ordered by name. You can modify and use it as a single command:
#!/bin/bash
#Create table list for a schema (script parameter)
HIVE_SCHEMA=$1
echo Processing Hive schema $HIVE_SCHEMA...
tablelist=tables_$HIVE_SCHEMA
hive -e " set hive.cli.print.header=false; use $HIVE_SCHEMA; show tables;" 1> $tablelist
#number of tables
tableNum_limit=1000
#For each table do:
for table in $(cat $tablelist|sort|head -n "$tableNum_limit") #add proper sorting
do
echo Processing table $table ...
#Call DESCRIBE
out=$(hive client -S -e "use $HIVE_SCHEMA; DESCRIBE EXTENDED $table")
#Get location for example
table_location=$(echo "${out}" | egrep -o 'location:[^,]+' | sed 's/location://')
echo Table location: $table_location
#Do something else here
done
Query the metastore
Demo
Hive
create database my_db_1;
create database my_db_2;
create database my_db_3;
create table my_db_1.my_tbl_1 (i int);
create table my_db_2.my_tbl_2 (c1 string,c2 date,c3 decimal(12,2));
create table my_db_3.my_tbl_3 (x array<int>,y struct<i:int,j:int,k:int>);
MySQL (Metastore)
use metastore
;
select d.name as db_name
,t.tbl_name
,c.integer_idx + 1 as col_position
,c.column_name
,c.type_name
from DBS as d
join TBLS as t
on t.db_id =
d.db_id
join SDS as s
on s.sd_id =
t.sd_id
join COLUMNS_V2 as c
on c.cd_id =
s.cd_id
where d.name like 'my\_db\_%'
order by d.name
,t.tbl_name
,c.integer_idx
;
+---------+----------+--------------+-------------+---------------------------+
| db_name | tbl_name | col_position | column_name | type_name |
+---------+----------+--------------+-------------+---------------------------+
| my_db_1 | my_tbl_1 | 1 | i | int |
| my_db_2 | my_tbl_2 | 1 | c1 | string |
| my_db_2 | my_tbl_2 | 2 | c2 | date |
| my_db_2 | my_tbl_2 | 3 | c3 | decimal(12,2) |
| my_db_3 | my_tbl_3 | 1 | x | array<int> |
| my_db_3 | my_tbl_3 | 2 | y | struct<i:int,j:int,k:int> |
+---------+----------+--------------+-------------+---------------------------+

How to create Hive table with user specified number of records?

Is it possible to create a hive table with user-specified number of records?
For example, I want to create a table with x number of rows (where x is defined by the user). The table would have two columns 1. unique row id [could be auto-incremented] 2. Randomly generated String.
Is this possible using Hive?
set N=7;
select pe.i+1 as n
,java_method ('org.apache.commons.lang.RandomStringUtils','randomAlphabetic',10) as str
from (select 1) x
lateral view posexplode(split(space(${hiveconf:N}-1),' ')) pe as i,x
;
+---+------------+
| n | str |
+---+------------+
| 1 | udttBCmtxT |
| 2 | kkrMQmirSG |
| 3 | iYDABgXOvW |
| 4 | DKHKgtXKPS |
| 5 | ylebKcdcGj |
| 6 | DaujBCkCtz |
| 7 | VMaWfbtzFY |
+---+------------+
posexplode
java_method
RandomStringUtils
Specifying limit on number of rows at the time of creating table may not be possible but , its possible to limit the number of rows that can be inserted into table using LIMIT clause
-- <filename:dbloader.sql>
create table {hiveconf:TABLENAME} ( id int, string1 string)
insert into newtable
select id,string1 from oldtable limit {hiveconf:ROWLIMIT};
and while submitting hive script -
hive --hiveconf TABLENAME='XYZ' --hiveconf ROWLIMIT=1000 -f dbloader.sql
as far as creating unique incremental id , you will have to write UDF for it.

vsql/vertica, how to copy text input file's name into destination table

I have to copy a input text file (text_file.txt) to a table (table_a). I also need to include the input file's name into the table.
my code is:
\set t_pwd `pwd`
\set input_file '\'':t_pwd'/text_file.txt\''
copy table_a
( column1
,column2
,column3
,FileName :input_file
)
from :input_file
The last line does not copy the input text file name in the table.
How to copy the input text file's name into the table? (without manually typing the file name)
Solution 1
This might not be the perfect solution for your job but i think will do the job :
You can get the table name and store it in a TBL variable and next add this variable at the end of each line in the CSV file that you are about to load into Vertica.
Now depending on your CSV file size this can be quite time and CPU consuming.
export TBL=`ls -1 | grep *.txt` | sed -e 's/$/,'$TBL'/' -i $TBL
Example:
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10
[dbadmin#bih001 ~]$ export TBL=`ls -1 | grep load*` | sed -e 's/$/|'$TBL'/' -i $TBL
[dbadmin#bih001 ~]$ cat load_data1
1|2|3|4|5|6|7|8|9|10||load_data1
Solution 2
You can use a DEFAULT CONSTRAINT, see example:
1. Create your table with a DEFAULT CONSTRAINT
[dbadmin#bih001 ~]$ vsql
Password:
Welcome to vsql, the Vertica Analytic Database interactive terminal.
Type: \h or \? for help with vsql commands
\g or terminate with semicolon to execute query
\q to quit
dbadmin=> create table TBL (id int ,CSV_FILE_NAME varchar(200) default 'TBL');
CREATE TABLE
dbadmin=> \dt
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
See the DEFAULT CONSTRAINT it has the 'TBL' default value
dbadmin=> \d TBL
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+---------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | 'TBL' | f | f |
(2 rows)
2. Now setup your COPY variables
- insert some data and alter the DEFAULT CONSTRAINT value to your current :input_file value.
dbadmin=> \set t_pwd `pwd`
dbadmin=> \set CSV_FILE `ls -1 | grep load*`
dbadmin=> \set input_file '\'':t_pwd'/':CSV_FILE'\''
dbadmin=>
dbadmin=>
dbadmin=> insert into TBL values(1);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+---------------
1 | TBL
(1 row)
dbadmin=> ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
ALTER TABLE
dbadmin=> \dt TBL;
List of tables
Schema | Name | Kind | Owner | Comment
--------+------+-------+---------+---------
public | TBL | table | dbadmin |
(1 row)
dbadmin=> \d TBL;
List of Fields by Tables
Schema | Table | Column | Type | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+---------------+--------------+------+----------------------------+----------+-------------+-------------
public | TBL | id | int | 8 | | f | f |
public | TBL | CSV_FILE_NAME | varchar(200) | 200 | '/home/dbadmin/load_data1' | f | f |
(2 rows)
dbadmin=> insert into TBL values(2);
OUTPUT
--------
1
(1 row)
dbadmin=> select * from TBL;
id | CSV_FILE_NAME
----+--------------------------
1 | TBL
2 | /home/dbadmin/load_data1
(2 rows)
Now you can implement this in your copy script.
Example:
\set t_pwd `pwd`
\set CSV_FILE `ls -1 | grep load*`
\set input_file '\'':t_pwd'/':CSV_FILE'\''
ALTER TABLE TBL ALTER COLUMN CSV_FILE_NAME SET DEFAULT :input_file;
copy TBL from :input_file DELIMITER '|' DIRECT;
Solution 3
Use the LOAD_STREAMS table
Example:
When loading a table give it a stream name - this way you can identify the file name / stream name:
COPY mytable FROM myfile DELIMITER '|' DIRECT STREAM NAME 'My stream name';
*Here is how you can query your load_streams table :*
=> SELECT stream_name, table_name, load_start, accepted_row_count,
rejected_row_count, read_bytes, unsorted_row_count, sorted_row_count,
sort_complete_percent FROM load_streams;
-[ RECORD 1 ]----------+---------------------------
stream_name | fact-13
table_name | fact
load_start | 2010-12-28 15:07:41.132053
accepted_row_count | 900
rejected_row_count | 100
read_bytes | 11975
input_file_size_bytes | 0
parse_complete_percent | 0
unsorted_row_count | 3600
sorted_row_count | 3600
sort_complete_percent | 100
Makes sense ? Hope this helped !
If you do not need to do it purely from inside vsql, it might possible to cheat a bit, and export the logic outside Vertica, in bash for example:
FILE=text_file.txt
(
while read LINE; do
echo "$LINE|$FILE"
done < "$FILE"
) | vsql -c 'copy table_a (...) FROM STDIN'
That way you basically COPY FROM STDIN, adding the filename to each line before it even reaches Vertica.

Oracle explain plan over simple select performs multiple hash joins when multiple columns are indexed in a table

I am currently running into an issue with my Oracle instance. I have two simple select statements:
select * from dog_vets
and
select * from dog_statuses
and the following fiddle
My explain plan on dog_vets is as follows:
0 | Select Statement
1 | Table Access Full Scan dog_vets
my explain plan on dog_statuses is as follows:
ID|Operation | Name | Rows |Bytes | cost | time
0 | Select Statement | | 20G | 500M | 100000 | 999:99:17
1 | View | index%_join_001 | 20G | 500M | 100000 | 999:99:17
2 | Hash Join | | | | |
3 | Hash Join | | | | |
4 | Index fast full scan dog_statuses_check_up | | 20G | 500M | 100000 | 32:15:00
5 | Index fast full scan dog_statuses_sick| | 20G | 500M | 100000 | 35:19:00
To get this type of output execute the following statement:
explain plan for
select * from dog_vets;
OR
explain plan for
select * from dog_statuses;
and then
select * from table(dbms_xplan.display);
Now my question is, why do multiple indexes imply a view (materialized I assume) being created in my above statements and further what type of performance hit am I suffering on this type of query? As it stands now dog_vets has ~300 million records and dog_Statuses has about 500 million. I have yet to be able to get select * from dog_statuses to return in under 10 hours. This is primarily because the query dies before it completes.
DDL
In case sql fiddle dies:
create table dog_vets
(
name varchar2(50),
founded timestamp,
staff_count number
);
create table dog_statuses
(
check_up timestamp,
sick varchar2(1)
);
create index dog_vet_name
on dog_vets(name);
create index dog_status_check_up
on dog_statuses(check_up);
create index dog_status_sick
on dog_statuses(sick);
You could try to tell the optimizer to forget about indexes
SELECT /*+NO_INDEX(dog_statuses)*/ *
FROM dog_statuses

Resources