Hive: Conditionally truncate and load the table - hadoop

I am trying to resolve the issue where if all categories of source table is available in target then truncate and load the target table else don't do anything.
I haven't found any solution just using hive and end up using Shell script as well to resolve this issue.
is it possible to avoid shell script?
Current Approach:
create_ind_table.hql:
create temporary table temp.master_source_join
as select case when source.program_type_cd=master.program_type_cd then 1 else 0 end as IND
from source left join master
on source.program_type_cd=master.program_type_cd;
--if all the categoies from source persent in master then will contain 1 else 0'
drop table if exists temp.indicator;
create table temp.indicator
as select min(ind)*max(ind) as ind from master_source_join;
And following is the script I am calling to Truncate and load the master table if all the source table categories are present in master.
tuncate_load_master.sh
beeline_cmd="beeline -u 'jdbc:hive2://abc.com:2181,abc1.com:2181,abc2.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2' --showHeader=flase --silent=true"
${beeline_cmd} -f create_ind_table.hql
## if indicator is 1 all the source category is present in master else not.
a=`${beeline_cmd} -e "select ind from temp.indicator;"`
temp=`echo $a | sed -e 's/-//g' | sed -e 's/+//g' | sed -e 's/|//g'`
echo $temp
if [ ${temp} -eq 1 ]
then
echo "truncate and load the traget table"
${beeline_cmd} -e "insert overwrite table temp.master select * from temp.source;"
else
echo "nothing to load"
fi

Query with dynamic partitioning will overwrite only partitions existing in the source dataset. Add a dummy partition to your table, like in this answer: https://stackoverflow.com/a/47505850/2700344
You can calculate your flag using analytic min() in the same subquery and filter by it.
IND calculated will be the same for all rows returned. And it seems analytic min() is enough, no need to calculate max(). Filter by IND=1. It will return no rows if min() over()=0 and will not overwrite the table.
--enable dynamic partitioning
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table temp.master PARTITION(dummy_part)
select s.col1, s.col2, --list all columns here you need to insert
'dummy_value' as dummy_part --dummy partition column
from
(
select s.*,
min(case when s.program_type_cd=m.program_type_cd then 1 else 0 end ) over() as IND
from source s left join master m
on s.program_type_cd=m.program_type_cd
)s where ind=1 --filter will not return rows if min=0

Related

How to use IF else statement outside hive select?

I want to execute select statement based on region check. If the region value is HK then the table should be created from temp.temp1, otherwise it has to create with temp.temp2.
eg:
**beeline -e "
if [ '$REGION' == 'HK' ]
then
Create table region as Select * from temp.temp1;
else
Create table region as Select * from temp.temp2;
fi**
"**
Is there any possible way to do it?
Hive itself does not support if-else statements, there's HPL/SQL procedural extension that may be useful in your case.
Though, I suggest you a bit different approach: if $REGION variable comes from outside of beeline and those tables' schemes match, you can union the results with the corresponding where case:
create table region as
select *
from temp.temp1
where '$REGION' == 'HK'
union all
select *
from temp.temp2
where '$REGION' != 'HK'
Hive will build the execution plan and get rid of one of the union parts, so it won't affect the real execution time.
Yes- Hive it self doesn't support if-else statement. What i have implemented for now is.
if [ '$REGION' == 'HK' ]
then
beeline -e " Select * from temp.temp1; "
else
beeline -e " Select * from temp.temp2;"
fi
"
I know this is repetitive but for now this is what we have implemented to execute queries of different regions/ blocks

Bash command to backfill Hive table—running multiple Hive commands with changing date variables

Trying to figure out a way to backfill partitions of a ds partitioned Hive table.
I know how to run a Hive command from CLI, e.g.
$HIVE_HOME/bin/hive -e 'select a.col from tab1 a'
What I would like to do is provide a .txt file of different DS and have a new job run for each of those DS's, e.g.
$HIVE_HOME/bin/hive -e 'INSERT OVERWRITE PARTITION ds = $DS_VARIABLE_HERE
select a.col from tab1 a where ds = $DS_VARIABLE_HERE'
But I'm not so sure how to do this
I'm thinking of trying out
cat date_file.txt | hive -e 'query here'
But I'm not sure how to place the variable from the date_file file into the Hive query string.
My suggestion is to use shell command to iterate through the values:
Option 1:
If you have fixed set of values you want to iterate through then
DS_VARIABLE_HERE=('val1' 'val2' 'val3')
for ((i=0;i<${#DS_VARIABLE_HERE[#]};i++))
do
$HIVE_HOME/bin/hive -e "INSERT OVERWRITE PARTITION ds = ${DS_VARIABLE_HERE[$i]} select a.col from tab1 a where ds = ${DS_VARIABLE_HERE[$i]}"
done
Option 2:
if you want to iterate through lets say 1 to 10
for ((i=1;i<=10;i++))
do
$HIVE_HOME/bin/hive -e "INSERT OVERWRITE PARTITION ds = ${i} select a.col from tab1 a where ds = ${i}"
done

Incremental data load to hive table

I am trying to load incremental data from one hive external table to another hive table. I have a date timestamp field on the source table to identify the newly added rows to it on a daily basis. My task is to extract the rows that are newly added to the source and insert them into the target table.
I am using Hive 0.14.
I tried the below queries but could not make it work.
INSERT INTO TABLE TARGET PARTITION (FIELD_DATE)
SELECT A.FIELD1, A.FIELD2, A.FIELD3,
CASE WHEN LENGTH(A.FIELD4)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(A.FIELD5)=0 THEN 0 ELSE 1 END,
FROM SOURCE A, (Select max(FIELD_TIMESTAMP) from TARGET) T
where A.FIELD_TIMESTAMP > T.FIELD_TIMESTAMP;
The above code is taking hours together without giving any result.
I also tried to execute the below query and later found that HIVE does not support subqueries in WHERE clause. (got ParseException)
INSERT INTO TABLE TARGET PARTITION (FIELD_DATE)
SELECT A.FIELD1, A.FIELD2, A.FIELD3,
CASE WHEN LENGTH(A.FIELD4)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(A.FIELD5)=0 THEN 0 ELSE 1 END,
FROM SOURCE A, TARGET T
where A.FIELD_TIMESTAMP > (Select max(FIELD_TIMESTAMP) from TARGET);
Please help me out in selecting only the rows that have been added after my initial load.
Thank you.
Try this
INSERT INTO TABLE TARGET PARTITION (FIELD_DATE)
SELECT A.FIELD1, A.FIELD2, A.FIELD3,
CASE WHEN LENGTH(A.FIELD4)=0 THEN 0 ELSE 1 END,
CASE WHEN LENGTH(A.FIELD5)=0 THEN 0 ELSE 1 END,
FROM SOURCE A JOIN
(Select max(FIELD_TIMESTAMP) as FIELD_TIMESTAMP from TARGET) T
on 1=1
where A.FIELD_TIMESTAMP > T.FIELD_TIMESTAMP;
This is the tested query for your reference:
insert into orders_target
select o.* from orders_source o join
(select max(o1.order_date) order_date from orders_target o1) o2
on 1=1
where o.order_date > o2.order_date;

Oracle SQL Developer verbose output

I'm currently using Oracle SQL Developer 4.0.1.14, doing something like this:
UPDATE mytable SET myfield = 'myvalue' WHERE myotherfield = 234;
INSERT INTO mytable (myfield , myotherfield) VALUES ('abc', 123);
INSERT INTO mytable (myfield , myotherfield) VALUES ('abd', 124);
...
... and many many more lines.
If one runs the whole script, one only sees output like this:
8 rows updated.
1 rows inserted.
1 rows inserted.
...
In a small script, this is not a problem, since you can easily see, which line caused which output. But if you think on a script with 1k lines, to find a "0 rows updated." and which command caused, it ends up in frustration.
So I would rather want an output like this:
> UPDATE mytable SET myfield = 'myvalue' WHERE myotherfield = 7393;
8 rows updated.
> INSERT INTO mytable (myfield , myotherfield) VALUES ('abc', 123);
1 rows inserted.
> INSERT INTO mytable (myfield , myotherfield) VALUES ('abd', 124);
1 rows inserted.
...
I know, this is possible in Sql*Plus when running a script in verbose mode. But this should also be possible in SQL Dev, shouldn't it?
I would highly appreciate if one could tell me how.
Thanks a lot!
try to use set echo on (sqlplus command which works in sql developer too), e.g.
create table t(a varchar2(100));
set echo on
update t set a = 0;
update t set a = 0;
update t set a = 0;
output is (after F5 - run script)
table T created.
> update t set a = 0
0 rows updated.
> update t set a = 0
0 rows updated.
> update t set a = 0
0 rows updated.

Shell script for insert multiple records into a Database

I have a table in an Informix DB into which I want to insert multiple records at a time.
Data for one of the column should be unique & other column data may be the same for all the records I insert
Typical Insert Statement I use to insert one row :
insert into employee(empid, country, state) values(1, us, ca)
Now I want to pass different values for the 'empid' column, & data in rest of the columns can remain the same.
I am looking for something like looping empid & prompting user to enter the Start & End range of values for empid
When User enters Start Value as 1 & End Value as 100, the script should inset 100 records with empid's from 1 to 100
Please notice that you need to pass the start and end parameters as input arguments to your script:
#!/bin/sh
start=$1
end=$2
while [[ $start -le $end ]];
do
echo "insert into employee(empid, country, state) values(${start}, us, ca)"
start=`expr $start + 1`
done
This solution uses a stored procedure to wrap the insert statement in a loop. Probably more suitable for a large amount of data where performance is critical:
File InsertEmployee.sql
drop procedure InsertEmployee;
create procedure InsertEmployee(
p_start_empid like employee.empid,
p_end_empid like employee.empid,
p_country like employee.country,
p_state like employee.state
) returning char(255);
define i integer;
let i = p_start_empid;
while i <= p_end_empid
insert into employee (
empid,
country,
state
) values (
i,
p_country,
p_state
);
-- logging
-- if (DBINFO('sqlca.sqlerrd2') > 0) then
-- return "inserted empid=" || i || " country=" || p_country || " state=" || p_state with resume;
-- end if
let i = i + 1;
end while;
end procedure;
Load stored procedure into your database:
dbaccess mydatabasename InsertEmployee.sql
Call the stored procedure from the shell prompt:
echo 'execute procedure InsertEmployee(1,100,"us","ca");' | dbaccess mydatabasename
#!/bin/bash
declare -a namesArray=("name1","name2","name3")
inserts=""
for i in "{arr[#]}"
do
inserts+="INSERT INTO persons(id, name) VALUES (0,'$i');"
done
echo $inserts | dbaccess yourDataBase
That will insert 3 rows (i asume that your primary key is serial number, thats why is 0 in values field). In informix you cannot add multiple rows in the same insert, thats why i create an insert per row.
Informix: INSERT INTO table VALUES(0);
mySQL & SQL Server: INSERT INTO table VALUES(0),(1),(2); <- 3 rows

Resources