Bash command to backfill Hive table—running multiple Hive commands with changing date variables - bash

Trying to figure out a way to backfill partitions of a ds partitioned Hive table.
I know how to run a Hive command from CLI, e.g.
$HIVE_HOME/bin/hive -e 'select a.col from tab1 a'
What I would like to do is provide a .txt file of different DS and have a new job run for each of those DS's, e.g.
$HIVE_HOME/bin/hive -e 'INSERT OVERWRITE PARTITION ds = $DS_VARIABLE_HERE
select a.col from tab1 a where ds = $DS_VARIABLE_HERE'
But I'm not so sure how to do this
I'm thinking of trying out
cat date_file.txt | hive -e 'query here'
But I'm not sure how to place the variable from the date_file file into the Hive query string.

My suggestion is to use shell command to iterate through the values:
Option 1:
If you have fixed set of values you want to iterate through then
DS_VARIABLE_HERE=('val1' 'val2' 'val3')
for ((i=0;i<${#DS_VARIABLE_HERE[#]};i++))
do
$HIVE_HOME/bin/hive -e "INSERT OVERWRITE PARTITION ds = ${DS_VARIABLE_HERE[$i]} select a.col from tab1 a where ds = ${DS_VARIABLE_HERE[$i]}"
done
Option 2:
if you want to iterate through lets say 1 to 10
for ((i=1;i<=10;i++))
do
$HIVE_HOME/bin/hive -e "INSERT OVERWRITE PARTITION ds = ${i} select a.col from tab1 a where ds = ${i}"
done

Related

How to use IF else statement outside hive select?

I want to execute select statement based on region check. If the region value is HK then the table should be created from temp.temp1, otherwise it has to create with temp.temp2.
eg:
**beeline -e "
if [ '$REGION' == 'HK' ]
then
Create table region as Select * from temp.temp1;
else
Create table region as Select * from temp.temp2;
fi**
"**
Is there any possible way to do it?
Hive itself does not support if-else statements, there's HPL/SQL procedural extension that may be useful in your case.
Though, I suggest you a bit different approach: if $REGION variable comes from outside of beeline and those tables' schemes match, you can union the results with the corresponding where case:
create table region as
select *
from temp.temp1
where '$REGION' == 'HK'
union all
select *
from temp.temp2
where '$REGION' != 'HK'
Hive will build the execution plan and get rid of one of the union parts, so it won't affect the real execution time.
Yes- Hive it self doesn't support if-else statement. What i have implemented for now is.
if [ '$REGION' == 'HK' ]
then
beeline -e " Select * from temp.temp1; "
else
beeline -e " Select * from temp.temp2;"
fi
"
I know this is repetitive but for now this is what we have implemented to execute queries of different regions/ blocks

Hive: Conditionally truncate and load the table

I am trying to resolve the issue where if all categories of source table is available in target then truncate and load the target table else don't do anything.
I haven't found any solution just using hive and end up using Shell script as well to resolve this issue.
is it possible to avoid shell script?
Current Approach:
create_ind_table.hql:
create temporary table temp.master_source_join
as select case when source.program_type_cd=master.program_type_cd then 1 else 0 end as IND
from source left join master
on source.program_type_cd=master.program_type_cd;
--if all the categoies from source persent in master then will contain 1 else 0'
drop table if exists temp.indicator;
create table temp.indicator
as select min(ind)*max(ind) as ind from master_source_join;
And following is the script I am calling to Truncate and load the master table if all the source table categories are present in master.
tuncate_load_master.sh
beeline_cmd="beeline -u 'jdbc:hive2://abc.com:2181,abc1.com:2181,abc2.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' --showHeader=flase --silent=true"
${beeline_cmd} -f create_ind_table.hql
## if indicator is 1 all the source category is present in master else not.
a=`${beeline_cmd} -e "select ind from temp.indicator;"`
temp=`echo $a | sed -e 's/-//g' | sed -e 's/+//g' | sed -e 's/|//g'`
echo $temp
if [ ${temp} -eq 1 ]
then
echo "truncate and load the traget table"
${beeline_cmd} -e "insert overwrite table temp.master select * from temp.source;"
else
echo "nothing to load"
fi
Query with dynamic partitioning will overwrite only partitions existing in the source dataset. Add a dummy partition to your table, like in this answer: https://stackoverflow.com/a/47505850/2700344
You can calculate your flag using analytic min() in the same subquery and filter by it.
IND calculated will be the same for all rows returned. And it seems analytic min() is enough, no need to calculate max(). Filter by IND=1. It will return no rows if min() over()=0 and will not overwrite the table.
--enable dynamic partitioning
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table temp.master PARTITION(dummy_part)
select s.col1, s.col2, --list all columns here you need to insert
'dummy_value' as dummy_part --dummy partition column
from
(
select s.*,
min(case when s.program_type_cd=m.program_type_cd then 1 else 0 end ) over() as IND
from source s left join master m
on s.program_type_cd=m.program_type_cd
)s where ind=1 --filter will not return rows if min=0

How to extract last 7 days of rows from the max date in shell

I'm passing max(pay_date) to a variable Max_date in Shell from Hive table.
The datatype of pay_date field is Date.
I want to extract 7 days of pay_date from Max_date of pay_date from the table.
I used below script to get...
#!/bin/bash
Max_date=$(hive -e "select max(pay_date) from dbname.tablename;")
hive -e "select pay_date from dbname.tablename where pay_date >= date_sub(\"$Max_date\",7);"
It's not giving me any output.
I'm stuck with passing a variable which has date value and use that in date_sub function for last 7 days of rows.
Please let me know if I'm missing some absolute basics.
You can run query like this:
hive -e "select o1.order_date from orders o1 join (select max(order_date) order_date from orders) o2 on 1=1 where o1.order_date >= date_sub(o2.order_date, 7);
Also you can use this as part of shell script:
max_date=$(hive -e "select max(order_date) from orders" 2>/dev/null)
hive -e "select order_date from orders where order_date >= date_sub('$max_date', 7);"

how to pass string(s) stored in a variable to sybase SQL IN clause from bash script

I have an array of say empId's which are of typr String , now i want to pass it in IN clause of SQL from bash.
I had tried below
#make an array of empIds
empIdsarray=(e123 e456 e675 e897)
for j in "${empIdsarray[#]}"
do
inclause=\"$j\",$inclause
done
#remove the trailing comma below
inclause=`echo $inclause|sed 's/,$//'
`isql -U$user -P$pwd -D$db -S$server <<< QRY > tempRS.txt
select * from emp where empId IN ($inclause)
go
quit
go
QRY`
i had tried IN('$inclause') as well but nothing is working , the output is blank although when i run in DB directly it gives result .
Any help is appreciated .
#it should execute like
select * from emp where empId IN ("e123", "e456", "e675", "e897")
thanks in advance .
In BASH you can do:
#!/bin/bash
#make an array of empIds
empIdsarray=(e123 e456 e675 e897)
printf -v inclause '"%s",' "${empIdsarray[#]}"
isql -U$user -P$pwd -D$db -S$server <<QRY > tempRS.txt
select * from emp where empId IN (${inclause%,})
go
quit
go
QRY

Get the sysdate -1 in Hive

Is there any way to get the current date -1 in Hive means yesterdays date always?
And in this format- 20120805?
I can run my query like this to get the data for yesterday's date as today is Aug 6th-
select * from table1 where dt = '20120805';
But when I tried doing this way with date_sub function to get the yesterday's date as the below table is partitioned on date(dt) column.
select * from table1 where dt = date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(),
'yyyyMMdd')) , 1) limit 10;
It is looking for the data in all the partitions? Why? Something wrong I am doing in my query?
How I can make the evaluation happen in a subquery to avoid the whole table scanned?
Try something like:
select * from table1
where dt >= from_unixtime(unix_timestamp()-1*60*60*24, 'yyyyMMdd');
This works if you don't mind that hive scans the entire table. from_unixtime is not deterministic, so the query planner in Hive won't optimize for you. For many cases (for example log files), not specifying a deterministic partition key can cause a very large hadoop job to start since it will scan the whole table, not just the rows with the given partition key.
If this matters to you, you can launch hive with an additional option
$ hive -hiveconf date_yesterday=20150331
And in the script or hive terminal use
select * from table1
where dt >= ${hiveconf:date_yesterday};
The name of the variable doesn't matter, nor does the value, you can set them in this case to get the prior date using unix commands. In the specific case of the OP
$ hive -hiveconf date_yesterday=$(date --date yesterday "+%Y%m%d")
In mysql:
select DATE_FORMAT(curdate()-1,'%Y%m%d');
In sqlserver :
SELECT convert(varchar,getDate()-1,112)
Use this query:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()-1*24*60*60,'%Y%m%d');
It looks like DATE_SUB assumes date in format yyyy-MM-dd. So you might have to do some more format manipulation to get to your format. Try this:
select * from table1
where dt = FROM_UNIXTIME(
UNIX_TIMESTAMP(
DATE_SUB(
FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')
, 1)
)
, 'yyyyMMdd') limit 10;
Use this:
select * from table1 where dt = date_format(concat(year(date_sub(current_timestamp,1)),'-', month(date_sub(current_timestamp,1)), '-', day(date_sub(current_timestamp,1))), 'yyyyMMdd') limit 10;
This will give a deterministic result (a string) of your partition.
I know it's super verbose.

Resources