Cannot define a dynamic parameter in a Hive query - hadoop

I'm trying to set up some Views in Hive that will take a date as a dynamic parameter. In my working below, I've swapped to using the hiveconf variable in the Select clause, so we can see what's going on, but the principle remains the same
According to this and this , I should be able to include a statement in " ${hiveconf:dateRangeFrom}" in my Create View statement, supply the hiveconf:dateRangeFrom variable at run time for maximum happiness, but this is just not happening - Hive appears to be using whatever value is assigned to the variable when the View is created and hard-coding it into the View definition, not substituting it at runtime as you might expect.
I've got a workaround, whereby I supply a parameter to a sql file that then creates all the views, substituting the desired value in, but this isn't sustainable
All the working is below, so you can see how I came to this conclusion. Any ideas?
1) Supply a hiveconf value to a simple query
(needs to be a date for the ultimate query)
hive -e "Select ${hiveconf:dateRangeFrom} , unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');" --hiveconf dateRangeFrom='2014-01-01'
The date will be returned as suppled, and converted to a unix timestamp (eg "2014-01-01"=1388534400, "2014-09-12"=41047640). The script can be be run repeatedly with results changing accordingly with the parameter.
2) Create a view that returns this data
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
This returns the error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hivevar' in select clause
Presumably because it is trying to do a replacement, but the ${hivevar:dateRangeFrom} variable has not been initialized at this point
According to:
Creating Views in Hive with parameter and
http://mail-archives.apache.org/mod_mbox/hive-user/201205.mbox/%3CBAY151-W9BC976D584FD172E7D70BC0160#phx.gbl%3E
Then variables can be used in Hive views, long as quotes are used around them:
CREATE VIEW get_date AS
SELECT "${hiveconf:dateRangeFrom}", unix_timestamp("${hiveconf:dateRangeFrom}" , 'yyyy-MM-dd');
This allows the view to be created, so trying to call the view using a parameter:
hive -e "Select * from get_date" --hiveconf dateRangeFrom='2014-01-01'
just returns the variable name:
${hiveconf:dateRangeFrom} NULL
Time taken: 20.614 seconds, Fetched: 1 row(s)
Using single quotes instead:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT '${hiveconf:dateRangeFrom}', unix_timestamp('${hiveconf:dateRangeFrom} ', 'yyyy-MM-dd');
Gives the same result, just the variable name.
3) Create a view in an interactive session with the variable already set
SET hiveconf:dateRangeFrom="2014-02-01";
Rebuild the original view, with the variables without quotes
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
Then calling "select * from get_date;" from within the session gives the expected result.
As does calling from the command line, with the same parameter value:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-02-01'
However, if we call the view with a different parameter, then we still get the original answer:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-09-12'
2014-02-01 1391212800
Time taken: 24.773 seconds, Fetched: 1 row(s)
If we set the variable inside a new session:
SET hiveconf:dateRangeFrom="2014-06-01";
or even not set it all, we still get the same result
Looking at the extended view definition, the reason is obvious:
hive> describe extended get_date;
OK
_c0 string
_c1 bigint
Detailed Table Information Table(tableName:get_date, dbName:default, owner:
36015to, createTime:1410523149, lastAccessTime:0, retention:0, sd:StorageDescrip
tor(cols:[FieldSchema(name:_c0, type:string, comment:null), FieldSchema(name:_c1
, type:bigint, comment:null)], location:null, inputFormat:org.apache.hadoop.mapr
ed.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequen
ceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:nu
ll, serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], parameter
s:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValu
eLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{
transient_lastDdlTime=1410523149}, ***viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')***, tableType:VIRTUAL_VIEW)
Time taken: 0.123 seconds, Fetched: 4 row(s)
The variable substitution took place when the view was created, and hard-coded that date into the definition:
viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')
4) Switch off variable subsitution
Hive is clearly putting in the current value of the variable at run-time, so I tried switching it off and recreating the query:
hive> set hive.variable.substitute;
hive.variable.substitute=true
hive> set hive.variable.substitute = false;
hive> set hive.variable.substitute;
hive.variable.substitute=false
The Create View statement still fails with the same error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hiveconf' in select clause
5) Workaround
If we create a sql file that creates the views, testParam.sql, we can sort of get around the problem:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hivevar:dateRangeFrom}, unix_timestamp(${hivevar:dateRangeFrom} , 'yyyy-MM-dd');
SELECT * FROM get_date;
Calling that from the command line gives the expected results:
hive -f testParam.sql --hiveconf dateRangeFrom='2014-08-01'
2014-08-01 1406847600
Time taken: 20.763 seconds, Fetched: 1 row(s)
hive -f testParam.sql --hiveconf dateRangeFrom='2014-09-12'
2014-09-12 1410476400
Time taken: 19.74 seconds, Fetched: 1 row(s)
This does work, and will be fine for now, but is hardly ideal for a distributed, multi-user environment. Looking at the view meta-data, we can see that the view is always destroyed and rebuilt with the latest parameters:
transient_lastDdlTime=1410525287}, viewOriginalText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), viewExpandedText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), tableType:VIRTUAL_VIEW)
So, how to create a view that can be supplied with dynamic parameters at runtime without constantly rebuilding it

How are you defining daterangeFrom? I think daterange from can be dynamically generated from current_date function by adding and subtracting the days based on your requirement. You can simply use hive functions for that.

I dont if this is what you're looking for!
if you are passing values from a bash script, this should do the job:
dateRangeFrom=$(date +"%Y-%m-%d")
hive -e "Select '${dateRangeFrom}' , unix_timestamp('${dateRangeFrom}' , 'yyyy-MM-dd');"
If you want to set the value in hive script itself you can do something like this
hive -e "SET hivevar:dateRangeFrom=2017-11-21;USE mydb; Select '${dateRangeFrom}' , unix_timestamp('${dateRangeFrom}' , 'yyyy-MM-dd');"
If you want to store the same hive query in a HQL file and run it externally, then you need to pass it like this
hive -f /abc/user/script.hql --hivevar dateRangeFrom=2017-11-21

Related

Getting Unknown Command error on IF-THEN-ELSE

I have the following query that I am using in Oracle 11g
IF EXISTS (SELECT * FROM EMPLOYEE_MASTER WHERE EMPID='ABCD32643')
THEN
update EMPLOYEE_MASTER set EMPID='A62352',EMPNAME='JOHN DOE',EMPTYPE='1' where EMPID='ABCD32643' ;
ELSE
insert into EMPLOYEE_MASTER(EMPID,EMPNAME,EMPTYPE) values('A62352','JOHN DOE','1') ;
END IF;
On running the statement I get the following output:
Error starting at line : 4 in command -
ELSE
Error report -
Unknown Command
1 row inserted.
Error starting at line : 6 in command -
END IF
Error report -
Unknown Command
The values get inserted with error when I run it directly. But when I try to execute this query through my application I get an oracle exception because of the error generated :
ORA-00900: invalid SQL statement
And hence the values are not inserted.
I am relatively new to Oracle. Please advise on what's wrong with the above query so that I could run this query error free.
If MERGE doesn't work for you, try the following:
begin
update EMPLOYEE_MASTER set EMPID='A62352',EMPNAME='JOHN DOE',EMPTYPE='1'
where EMPID='ABCD32643' ;
if SQL%ROWCOUNT=0 then
insert into EMPLOYEE_MASTER(EMPID,EMPNAME,EMPTYPE)
values('A62352','JOHN DOE','1') ;
end if;
end;
Here you you the update on spec, then check whether or not you found a matching row, and insert in case you didn't.
"what's wrong with the above query "
What's wrong with the query is that it is not a query (SQL). It should be a program snippet (PL/SQL) but it isn't written as PL/SQL block, framed by BEGIN and END; keywords.
But turning it into an anonymous PL/SQL block won't help. Oracle PL/SQL does not support IF EXISTS (select ... syntax.
Fortunately Oracle SQL does support MERGE statement which does the same thing as your code, with less typing.
merge into EMPLOYEE_MASTER em
using ( select 'A62352' as empid,
'JOHN DOE' as empname,
'1' as emptype
from dual ) q
on (q.empid = em.empid)
when not matched then
insert (EMPID,EMPNAME,EMPTYPE)
values (q.empid, q.empname, q.emptype)
when matched then
update
set em.empname = q.empname, em.emptype = q.emptype
/
Except that you're trying to update empid as well. That's not supported in MERGE. Why would you want to change the primary key?
"Does this query need me to add values to all columns in the table? "
The INSERT can have all the columns in the table. The UPDATE cannot change the columns used in the ON clause (usually the primary key) because that's a limitation of the way MERGE works. I think it's the same key preservation mechanism we see when updating views. Find out more.

Hive syntax: purpose of curly braces and dollar sign

I'm reading over some Hive scripts from another team in my company and having trouble understanding a specific part of it. The part in question is:where dt='${product_dt}', which can be found on on the third line from the bottom of the code chunk below.
I've never seen this syntax before nor am I able to find anything via Google search (probably because I don't know the correct search terms to use). Any insight into what that where row filter step is doing would be appreciated.
set hive.security.authorization.enabled=false;
add jar /opt/mobiletl/prod_workflow_dir/lib/hiveudf_hash.jar;
create temporary function hash_string as 'HashString';
drop table 00_truthset_product_email_uid_pid;
create table 00_truthset_product_email_uid_pid as
select distinct email,
concat_ws('|', hash_string(lower(email), "SHA-1"),
hash_string(lower(email), "MD5"),
hash_string(upper(email), "SHA-1"),
hash_string(upper(email), "MD5")) as hashed_email,
uid, address_id, confidencescore
from product.prod_vintages
where dt='${product_dt}'
and email is not null and email != ''
and address_id is not null and address_id != '';
I tried set product_dt = 2014-12;, but it doesn't seem to work:
hive> SELECT dt FROM enabilink.prod_vintages GROUP BY dt LIMIT 10;
. . .
dt
2014-12
2015-01
2015-02
2015-03
2015-05
2015-07
2015-10
2016-01
2016-02
2016-03
hive> set product_dt = 2014-12;
hive> SELECT email FROM product.prod_vintages WHERE dt='${product_dt}';
. . .
Total MapReduce CPU Time Spent: 2 seconds 570 msec
OK
email
Time taken: 25.801 seconds
those are variables set in Hive. if you have set the variables before the query (in the same session), Hive will replace it with the specified value
for example
set product_dt=03-11-2012
Edit
Make sure that you are removing the spaces in your dt field (use trim UDF). Also, set the variable without spaces.

Searching first condition first and only if not available then second condition

I am writing an SQL query where the query should first search the first value, and only if this value is missing the query should search for the second value.
I have two tables. One of these tables contains the modification date (this is not always filled and can be null) and a creation date which is always filled.
Now what I want is that the query first looks in the table with the modification date and only if it is null looks at the table with the creation date.
Example of the query:
Select *
from all_articles
where to_char(modification_date, 'YYYYMMDD') = to_char(sysdate, 'YYYYMMDD')-1
-- if this is an empty record then
to_char(creation_date, 'YYYYMMDD') = to_char(sysdate, 'YYYYMMDD')-1
Can anyone help me with this query?
Almost all the major RDBMS' available have in built functions to handle such a situation.
The Oracle DB has NVL function which works as follows:
NVL(Modified_dt, Create_dt);
The above will return Modified_dt column data by default. However, if that isn't available, it will return Create_dt.
See here for details:
http://www.w3schools.com/sql/sql_isnull.asp

birt sql dataset with parameters sql error

I have a birt dataset for a db2 query. My query works fine without parameters with the following query...
with params as (SELECT '2014-02-16' enddate,'1' locationid FROM sysibm.sysdummy1)
select
t.registerid
from (
select
...
FROM params, mytable sos
WHERE sos.locationid=params.locationid
AND sos.repositorytype ='xxx'
AND sos.repositoryaccountability='xxx'
AND sos.terminalid='xxx'
AND DATE(sos.balanceDate) between date(params.enddate)-6 DAY and date(params.enddate)
GROUP BY sos.terminalid,sos.balancedate,params.enddate) t
GROUP BY
t.registerid
WITH UR
But when I change the top line to ...
with params as (SELECT ? enddate,? locationid FROM sysibm.sysdummy1)
And make the two input paramters of string datatype I get db2 errors sqlcode -418. But i know that it is not my querty because my query works.
What is the right way for me to set up the parameters so there is no error?
thanks
I'm not familiar with DB2 programming, but on Oracle the ? works anywhere in the query.
Have you looked at http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=%2Fcom.ibm.db2z9.doc.codes%2Fsrc%2Ftpc%2Fn418.htm?
Seems that on DB2 it's a bit more complicated and you should use "typed parameter markers".
The doc says:
Typed parameter marker
A parameter marker that is specified with its target data type. A typed parameter marker has the general form:
CAST(? AS data-type)
This invocation of a CAST specification is a "promise" that the data type of the parameter at run time will be of the data type that is specified or some data type that is assignable to the specified data type.
Apart from that, always assure that your date strings are in the format that the DB expects, and use explicit format masks in the date function, like this:
with params as (
SELECT cast (? as varchar(10)) enddate,
cast (? as varchar2(80)) locationid
FROM sysibm.sysdummy1
)
select
...
from params, ...
where ...
AND DATE(sos.balanceDate) between date(XXX(params.enddate))-6 DAY and date(XXX(params.enddate))
...
Unfortunately I cannot tell you how the XXX function should look on DB2.
On Oracle, an example would be
to_date('2014-02-18', 'YYYY-MM-DD')
On DB2, see Converting a string to a date in DB2
In addition to hvb answer, i see two options:
Option 1 you could use a DB2 stored procedure instead of a plain SQL query. Thus there won't be these limitations you face to, due to JDBC query parameters.
Option 2, we should be able to remove the first line of the query "with params as" and replace it with question marks within the query:
select
t.registerid
from (
select
sos.terminalid,sos.balancedate,max(sos.balanceDate) as maxdate
FROM params, mytable sos
WHERE sos.locationid=?
AND sos.repositorytype ='xxx'
AND sos.repositoryaccountability='xxx'
AND sos.terminalid='xxx'
AND DATE(sos.balanceDate) between date(?)-6 DAY and date(?)
GROUP BY sos.terminalid,sos.balancedate) t
GROUP BY
t.registerid
A minor drawback is, this time we need to declare 3 dataset parameters in BIRT instead of 2. More nasty, i removed params.endDate from "group by" and replaced it with "max(sos.balanceDate)" in select clause. This is very near but not strictly equivalent. If this is not acceptable in your context, a stored procedure might be the best option.

Get the sysdate -1 in Hive

Is there any way to get the current date -1 in Hive means yesterdays date always?
And in this format- 20120805?
I can run my query like this to get the data for yesterday's date as today is Aug 6th-
select * from table1 where dt = '20120805';
But when I tried doing this way with date_sub function to get the yesterday's date as the below table is partitioned on date(dt) column.
select * from table1 where dt = date_sub(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(),
'yyyyMMdd')) , 1) limit 10;
It is looking for the data in all the partitions? Why? Something wrong I am doing in my query?
How I can make the evaluation happen in a subquery to avoid the whole table scanned?
Try something like:
select * from table1
where dt >= from_unixtime(unix_timestamp()-1*60*60*24, 'yyyyMMdd');
This works if you don't mind that hive scans the entire table. from_unixtime is not deterministic, so the query planner in Hive won't optimize for you. For many cases (for example log files), not specifying a deterministic partition key can cause a very large hadoop job to start since it will scan the whole table, not just the rows with the given partition key.
If this matters to you, you can launch hive with an additional option
$ hive -hiveconf date_yesterday=20150331
And in the script or hive terminal use
select * from table1
where dt >= ${hiveconf:date_yesterday};
The name of the variable doesn't matter, nor does the value, you can set them in this case to get the prior date using unix commands. In the specific case of the OP
$ hive -hiveconf date_yesterday=$(date --date yesterday "+%Y%m%d")
In mysql:
select DATE_FORMAT(curdate()-1,'%Y%m%d');
In sqlserver :
SELECT convert(varchar,getDate()-1,112)
Use this query:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP()-1*24*60*60,'%Y%m%d');
It looks like DATE_SUB assumes date in format yyyy-MM-dd. So you might have to do some more format manipulation to get to your format. Try this:
select * from table1
where dt = FROM_UNIXTIME(
UNIX_TIMESTAMP(
DATE_SUB(
FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')
, 1)
)
, 'yyyyMMdd') limit 10;
Use this:
select * from table1 where dt = date_format(concat(year(date_sub(current_timestamp,1)),'-', month(date_sub(current_timestamp,1)), '-', day(date_sub(current_timestamp,1))), 'yyyyMMdd') limit 10;
This will give a deterministic result (a string) of your partition.
I know it's super verbose.

Resources