Hive syntax: purpose of curly braces and dollar sign - syntax

I'm reading over some Hive scripts from another team in my company and having trouble understanding a specific part of it. The part in question is:where dt='${product_dt}', which can be found on on the third line from the bottom of the code chunk below.
I've never seen this syntax before nor am I able to find anything via Google search (probably because I don't know the correct search terms to use). Any insight into what that where row filter step is doing would be appreciated.
set hive.security.authorization.enabled=false;
add jar /opt/mobiletl/prod_workflow_dir/lib/hiveudf_hash.jar;
create temporary function hash_string as 'HashString';
drop table 00_truthset_product_email_uid_pid;
create table 00_truthset_product_email_uid_pid as
select distinct email,
concat_ws('|', hash_string(lower(email), "SHA-1"),
hash_string(lower(email), "MD5"),
hash_string(upper(email), "SHA-1"),
hash_string(upper(email), "MD5")) as hashed_email,
uid, address_id, confidencescore
from product.prod_vintages
where dt='${product_dt}'
and email is not null and email != ''
and address_id is not null and address_id != '';
I tried set product_dt = 2014-12;, but it doesn't seem to work:
hive> SELECT dt FROM enabilink.prod_vintages GROUP BY dt LIMIT 10;
. . .
dt
2014-12
2015-01
2015-02
2015-03
2015-05
2015-07
2015-10
2016-01
2016-02
2016-03
hive> set product_dt = 2014-12;
hive> SELECT email FROM product.prod_vintages WHERE dt='${product_dt}';
. . .
Total MapReduce CPU Time Spent: 2 seconds 570 msec
OK
email
Time taken: 25.801 seconds

those are variables set in Hive. if you have set the variables before the query (in the same session), Hive will replace it with the specified value
for example
set product_dt=03-11-2012
Edit
Make sure that you are removing the spaces in your dt field (use trim UDF). Also, set the variable without spaces.

Related

Set variables in HIVE query

I am trying to follow the post here to set a variable in my Hive query. Assuming I've the following file in hdfs:
/home/hduser/test/hr.txt
Berg,12000
Faviet,9000
Chen,8200
Urman,7800
Sciarra,7700
Popp,6900
Paino,8790
I then created my schema on top of the data as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/home/hduser/test/';
I want to create 4 tiles for the table but I don't want to hardcode the number of tiles and instead want to pass it in as a variable. My code is below:
SET q1=select ceiling(count(*)/2) from employees;
SELECT lname,
salary,
NTILE(${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
However, this throws an error:
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Number of tiles must be an int expression
I tried to use quotes when calling the variable, as in '${hiveconf:q1}', but that didn't seem to help. If I hardcode the number of tiles (which I am trying to avoid), the workflow will go something like this:
SELECT lname,
salary,
NTILE(4) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
which yields
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Thoughts?
When there isn't a documented way one can use documented features to provide a clean enough hack :)
Here's my attempt, using dfs commands from hive, shell commands from hive, the source-command and what not. I guess it might not work out of the box with queries through Hiveserver2. I would be glad if there were a prettier way
Let's go
Basic setup
SET EMPLOYEE_TABLE_LOCATION=/home/hduser/test/;
CREATE EXTERNAL TABLE IF NOT EXISTS employees (lname STRING, salary INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '${hiveconf:EMPLOYEE_TABLE_LOCATION}';
SET PATH_TO_SETTINGS_FILE=hdfs:/tmp/query_to_setting;
SET FILENAME_ON_LOCAL_FS=query_to_setting.sql;
Generate a file in hdfs
with content "SET q1=<the-query-result>;"
CREATE TABLE query_to_setting_table
LOCATION '${hiveconf:PATH_TO_SETTINGS_FILE}'
AS
SELECT concat('SET q1=', ceiling(count(*)/2),'\;') from employees;
Source in the generated file as any sql-file.
First put the file to local fs since 'source' only operates on local disk...
dfs -get ${hiveconf:PATH_TO_SETTINGS_FILE}/000000_0 ${hiveconf:FILENAME_ON_LOCAL_FS};
source ${hiveconf:FILENAME_ON_LOCAL_FS};
Try the setting
hive> SET q1;
q1=4
Use the setting in a query
hive > SELECT lname,
salary,
NTILE( ${hiveconf:q1}) OVER (
ORDER BY salary DESC) AS quartile
FROM employees;
OK
Berg 12000 1
Faviet 9000 1
Paino 8790 2
Chen 8200 2
Urman 7800 3
Sciarra 7700 3
Popp 6900 4
Optional cleanup
!rm ${hiveconf:FILENAME_ON_LOCAL_FS};
DROP TABLE query_to_setting_table;

H2-Database CommandCentre: CSVREAD skips loading the first(!) csv-Line of Data

folks,
H2 skips/drops the FIRST line of the following csv-Dataset ...
and I couldn't find a solution or workaround.
I have already looked through the various H2-tutorials and of course skimmed
the internet ...
Am I the only one (newbie - my "home" is the IBM-Mainframe)
who has such a problem inserting into a H2-database by using CSVREAD?
I expected here in this example the CSVREAD-Utility to insert 5(five!) lines
into the created table "VL01T098".
!!! there is no "Column-Header-Line" in the csv-dataset - I get the data this way only !!!
AJ52B1;999;2013-01-04;2014-03-01;03Z;A
AJ52C1;777;2012-09-03;2012-08-19;03Z;
AJ52B1;;2013-01-04;2014-03-01;;X
AJ52B1;321;2014-05-12;;03Z;Y
AJ52B1;999;;2014-03-01;03Z;Z
And here is my SQL (from the H2-joboutput):
DROP TABLE IF EXISTS VL01T098;
Update count: 0
(0 ms)
CREATE TABLE VL01T098 (
MODELL CHAR(6)
, FZG_STAT CHAR(3)
, ABGABE_DATUM DATE
, VERSAND_DATUM DATE
, FZG_GRUPPE CHAR(3)
, AV_KZ CHAR(1))
AS SELECT * FROM
CSVREAD
('D:\VL01D_Test\LOAD-csv\T098.csv',
null,
'charset=UTF-8 fieldSeparator=; lineComment=#');
COMMIT;
select count(*) from VL01T098;
select * from VL01T098;
MODELL FZG_STAT ABGABE_DATUM VERSAND_DATUM FZG_GRUPPE AV_KZ
AJ52C1 777 2012-09-03 2012-08-19 03Z null
AJ52B1 null 2013-01-04 2014-03-01 null X
AJ52B1 321 2014-05-12 null 03Z Y
AJ52B1 999 null 2014-03-01 03Z Z
(4 rows, 0 ms)
? Where is just the first csv-line gone ... and why is it lost?
Could you please help a H2-newbie ... with some IBM-DB2-experience
Many thanks in advance
Achim
You didn't specify a column list in the CSVREAD function. That means the column list is read from the file, as documented:
If the column names are specified (a list of column names separated
with the fieldSeparator), those are used, otherwise (or if they are
set to NULL) the first line of the file is interpreted as the column
names.

How to enhance performance of sql query?

Below is the major part of my script which interacts with an Oracle database through SQL*Plus.
#--------------- Now connecting to sqlplus
`$SQLPLUS \#${basePath}/VoucherQuery.sql $startdate> ${basePath}/logs/QueryResult.$currentDate.log`;
if ( $? == 0) {
logger("Processing with the sqlplus is completed. For more details check ${basePath}/logs/QueryResult.$currentDate.log ", 0); }
else {
logger("Not able to fetch data from sqlplus. Please check", 1); exit;}
#print "select * from sample where SERIALNUMBER = $serial\n";
open (FH, "${basePath}/logs/QueryResult.$currentDate.log") or die "Can't open query ${basePath}/logs/QueryResult.$currentDate.log file: $!\n";
my ($serial_number, $state, $at, $operator_id, $old_state);
while (my $data = <FH>) {
chomp ($data);
#print $data."\n";
my #data = split (/\s+/, $data);
my ($serial_number, $state, $at, $operator_id, $old_state) = #data[0..4];
my ($date, $time) = split (/T/, $at);
$date =~ s/(\d{4})(\d{2})(\d{2})/$3-$month{$2}-$1/;
$date =~ s/(.*)(\d{2})(\d{2})/$1$3/;
print WFH "$circle,$date,$time,$operator_id,$serial_number,$old_state,$state\n";
}
close(FH);
close(WFH);
-------------------------------------------------------------
>cat VoucherQuery.sql
SELECT * FROM (SELECT serialnumber, state, at, operatorid, lag(state) OVER ( PARTITION BY serialnumber ORDER BY at) AS previous FROM VOUCHER) WHERE at LIKE '&1';
But the database table contains millions of records and even a simple select count(*) query isn't able to generate output. Now the problem is that no constraints have been defined while creating the database.
I have experience of scripting, but I am quite a novice in SQL queries as far as performance is concerned.
I want to ask
How much difference will it make if I define the primary key constraint within the table? (It is a third-party server so I have to be sure before making any changes.)
Will an index improve the performance? How could they help in this specific query?
Should I break this query into more, simpler queries?
Here is the table description
SQL> DESC VOUCHER;
Name Null? Type
-------------------------- -------- -------------------------------
SERIALNUMBER VARCHAR2(20)
STATE VARCHAR2(4000)
AT VARCHAR2(4000)
OPERATORID VARCHAR2(4000)
SUBSCRIBERID VARCHAR2(20)
TRANSACTIONID VARCHAR2(20)
One more thing. I have to deal with SQL*Plus only I can't use DBI with DBD::Oracle module because of Solaris issues. I want to solve this on my own but need your advice on these performance issues as I can't use hit and miss methods on them.
I am far less experienced in Oracle than other databases, so this may well not be correct. But if it works it should certainly be faster than your existing SQL because it limits the data it is working on first before adding the previous column.
SELECT
serialnumber,
state,
at,
operatorid,
LAG(state) OVER (PARTITION BY serialnumber ORDER BY at) AS previous
FROM
voucher
WHERE
serialnumber in (SELECT DISTINCT serialnumber FROM voucher WHERE at LIKE '&1')
Note that it won't produce exactly the same result as your own query because it will include all records relating to a given serialnumber as long as at least one of those records has an at column that matches &1.

Cannot define a dynamic parameter in a Hive query

I'm trying to set up some Views in Hive that will take a date as a dynamic parameter. In my working below, I've swapped to using the hiveconf variable in the Select clause, so we can see what's going on, but the principle remains the same
According to this and this , I should be able to include a statement in " ${hiveconf:dateRangeFrom}" in my Create View statement, supply the hiveconf:dateRangeFrom variable at run time for maximum happiness, but this is just not happening - Hive appears to be using whatever value is assigned to the variable when the View is created and hard-coding it into the View definition, not substituting it at runtime as you might expect.
I've got a workaround, whereby I supply a parameter to a sql file that then creates all the views, substituting the desired value in, but this isn't sustainable
All the working is below, so you can see how I came to this conclusion. Any ideas?
1) Supply a hiveconf value to a simple query
(needs to be a date for the ultimate query)
hive -e "Select ${hiveconf:dateRangeFrom} , unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');" --hiveconf dateRangeFrom='2014-01-01'
The date will be returned as suppled, and converted to a unix timestamp (eg "2014-01-01"=1388534400, "2014-09-12"=41047640). The script can be be run repeatedly with results changing accordingly with the parameter.
2) Create a view that returns this data
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
This returns the error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hivevar' in select clause
Presumably because it is trying to do a replacement, but the ${hivevar:dateRangeFrom} variable has not been initialized at this point
According to:
Creating Views in Hive with parameter and
http://mail-archives.apache.org/mod_mbox/hive-user/201205.mbox/%3CBAY151-W9BC976D584FD172E7D70BC0160#phx.gbl%3E
Then variables can be used in Hive views, long as quotes are used around them:
CREATE VIEW get_date AS
SELECT "${hiveconf:dateRangeFrom}", unix_timestamp("${hiveconf:dateRangeFrom}" , 'yyyy-MM-dd');
This allows the view to be created, so trying to call the view using a parameter:
hive -e "Select * from get_date" --hiveconf dateRangeFrom='2014-01-01'
just returns the variable name:
${hiveconf:dateRangeFrom} NULL
Time taken: 20.614 seconds, Fetched: 1 row(s)
Using single quotes instead:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT '${hiveconf:dateRangeFrom}', unix_timestamp('${hiveconf:dateRangeFrom} ', 'yyyy-MM-dd');
Gives the same result, just the variable name.
3) Create a view in an interactive session with the variable already set
SET hiveconf:dateRangeFrom="2014-02-01";
Rebuild the original view, with the variables without quotes
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
Then calling "select * from get_date;" from within the session gives the expected result.
As does calling from the command line, with the same parameter value:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-02-01'
However, if we call the view with a different parameter, then we still get the original answer:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-09-12'
2014-02-01 1391212800
Time taken: 24.773 seconds, Fetched: 1 row(s)
If we set the variable inside a new session:
SET hiveconf:dateRangeFrom="2014-06-01";
or even not set it all, we still get the same result
Looking at the extended view definition, the reason is obvious:
hive> describe extended get_date;
OK
_c0 string
_c1 bigint
Detailed Table Information Table(tableName:get_date, dbName:default, owner:
36015to, createTime:1410523149, lastAccessTime:0, retention:0, sd:StorageDescrip
tor(cols:[FieldSchema(name:_c0, type:string, comment:null), FieldSchema(name:_c1
, type:bigint, comment:null)], location:null, inputFormat:org.apache.hadoop.mapr
ed.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequen
ceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:nu
ll, serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], parameter
s:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValu
eLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{
transient_lastDdlTime=1410523149}, ***viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')***, tableType:VIRTUAL_VIEW)
Time taken: 0.123 seconds, Fetched: 4 row(s)
The variable substitution took place when the view was created, and hard-coded that date into the definition:
viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')
4) Switch off variable subsitution
Hive is clearly putting in the current value of the variable at run-time, so I tried switching it off and recreating the query:
hive> set hive.variable.substitute;
hive.variable.substitute=true
hive> set hive.variable.substitute = false;
hive> set hive.variable.substitute;
hive.variable.substitute=false
The Create View statement still fails with the same error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hiveconf' in select clause
5) Workaround
If we create a sql file that creates the views, testParam.sql, we can sort of get around the problem:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hivevar:dateRangeFrom}, unix_timestamp(${hivevar:dateRangeFrom} , 'yyyy-MM-dd');
SELECT * FROM get_date;
Calling that from the command line gives the expected results:
hive -f testParam.sql --hiveconf dateRangeFrom='2014-08-01'
2014-08-01 1406847600
Time taken: 20.763 seconds, Fetched: 1 row(s)
hive -f testParam.sql --hiveconf dateRangeFrom='2014-09-12'
2014-09-12 1410476400
Time taken: 19.74 seconds, Fetched: 1 row(s)
This does work, and will be fine for now, but is hardly ideal for a distributed, multi-user environment. Looking at the view meta-data, we can see that the view is always destroyed and rebuilt with the latest parameters:
transient_lastDdlTime=1410525287}, viewOriginalText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), viewExpandedText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), tableType:VIRTUAL_VIEW)
So, how to create a view that can be supplied with dynamic parameters at runtime without constantly rebuilding it
How are you defining daterangeFrom? I think daterange from can be dynamically generated from current_date function by adding and subtracting the days based on your requirement. You can simply use hive functions for that.
I dont if this is what you're looking for!
if you are passing values from a bash script, this should do the job:
dateRangeFrom=$(date +"%Y-%m-%d")
hive -e "Select '${dateRangeFrom}' , unix_timestamp('${dateRangeFrom}' , 'yyyy-MM-dd');"
If you want to set the value in hive script itself you can do something like this
hive -e "SET hivevar:dateRangeFrom=2017-11-21;USE mydb; Select '${dateRangeFrom}' , unix_timestamp('${dateRangeFrom}' , 'yyyy-MM-dd');"
If you want to store the same hive query in a HQL file and run it externally, then you need to pass it like this
hive -f /abc/user/script.hql --hivevar dateRangeFrom=2017-11-21

Oracle - dynamic column name in select statement

Question:
Is it possible to have a column name in a select statement changed based on a value in it's result set?
For example, if a year value in a result set is less than 1950, name the column OldYear, otherwise name the column NewYear. The year value in the result set is guaranteed to be the same for all records.
I'm thinking this is impossible, but here was my failed attempt to test the idea:
select 1 as
(case
when 2 = 1 then "name1";
when 1 = 1 then "name2")
from dual;
You can't vary a column name per row of a result set. This is basic to relational databases. The names of columns are part of the table "header" and a name applies to the column under it for all rows.
Re comment: OK, maybe the OP Americus means that the result is known to be exactly one row. But regardless, SQL has no syntax to support a dynamic column alias. Column aliases must be constant in a query.
Even dynamic SQL doesn't help, because you'd have to run the query twice. Once to get the value, and a second time to re-run the query with a different column alias.
The "correct" way to do this in SQL is to have both columns, and have the column that is inappropriate be NULL, such as:
SELECT
CASE WHEN year < 1950 THEN year ELSE NULL END AS OldYear,
CASE WHEN year >= 1950 THEN year ELSE NULL END AS NewYear
FROM some_table_with_years;
There is no good reason to change the column name dynamically - it's analogous to the name of a variable in procedural code - it's just a label that you might refer to later in your code, so you don't want it to change at runtime.
I'm guessing what you're really after is a way to format the output (e.g. for printing in a report) differently depending on the data. In that case I would generate the heading text as a separate column in the query, e.g.:
SELECT 1 AS mydata
,case
when 2 = 1 then 'name1'
when 1 = 1 then 'name2'
end AS myheader
FROM dual;
Then the calling procedure would take the values returned for mydata and myheader and format them for output as required.
You will need something similar to this:
select 'select ' || CASE WHEN YEAR<1950 THEN 'OLDYEAR' ELSE 'NEWYEAR' END || ' FROM TABLE 1' from TABLE_WITH_DATA
This solution requires that you launch SQLPLUS and a .sql file from a .bat file or using some other method with the appropriate Oracle credentials. The .bat file can be kicked off manually, from a server scheduled task, Control-M job, etc...
Output is a .csv file. This also requires that you replace all commas in the output with some other character or risk column/data mismatch in the output.
The trick is that your column headers and data are selected in two different SELECT statements.
It isn't perfect, but it does work, and it's the closest to standard Oracle SQL that I've found for a dynamic column header outside of a development environment. We use this extensively to generate recurring daily/weekly/monthly reports to users without resorting to a GUI. Output is saved to a shared network drive directory/Sharepoint.
REM BEGIN runExtract1.bat file -----------------------------------------
sqlplus username/password#database #C:\DailyExtracts\Extract1.sql > C:\DailyExtracts\Extract1.log
exit
REM END runExtract1.bat file -------------------------------------------
REM BEGIN Extract1.sql file --------------------------------------------
set colsep ,
set pagesize 0
set trimspool on
set linesize 4000
column dt new_val X
select to_char(sysdate,'MON-YYYY') dt from dual;
spool c:\DailyExtracts\&X._Extract1.csv
select '&X-Project_id', 'datacolumn2-Project_Name', 'datacolumn3-Plant_id' from dual;
select
PROJ_ID
||','||
replace(PROJ_NAME,',',';')-- "Project Name"
||','||
PLANT_ID-- "Plant ID"
from PROJECTS
where ADDED_DATE >= TO_DATE('01-'||(select to_char(sysdate,'MON-YYYY') from dual));
spool off
exit
/
REM ------------------------------------------------------------------
CSV OUTPUT (opened in Excel and copy/pasted):
old 1: select '&X-Project_id' 'datacolumn2-Project_Name' 'datacolumn3-Plant_id' from dual
new 1: select 'MAR-2018-Project_id' 'datacolumn2-Project_Name' 'datacolumn3-Plant_id' from dual
MAR-2018-Project_id datacolumn2-Project_Name datacolumn3-Plant_id
31415 name1 1007
31415 name1 2032
32123 name2 3302
32123 name2 3384
32963 name3 2530
33629 name4 1161
34180 name5 1173
34180 name5 1205
...
...
etc...
135 rows selected.

Resources