Strange beeline error: "Error: (state=,code=0)" - hadoop

I'm seeing a very strange error when running my HiveQL through beeline:
Error: (state=,code=0)
Error: (state=,code=0)
Aborting command set because "force" is false and command failed: "create table some_database.some_table..."
My query is quite complex, utilizing UNIONS and transforms, but it runs fine when I submit it using the Hive client. It looks something like this:
create table some_database.some_table
stored as rcfile
as select * from (
from some_other_db.table_1
select transform (*)
using "hdfs:///some/transform/script.py"
as (
some_field_1 string,
some_field_2 double
)
union all
from some_other_db.table_2
select transform (*)
using "hdfs:///some/transform/script.py"
as (
some_field_1 string,
some_field_2 double
)
union all
from some_other_db.table_3
select transform (*)
using "hdfs:///some/transform/script.py"
as (
some_field_1 string,
some_field_2 double
)
) all_unions
;
I'm using:
CDH 4.3.0-1
Hive 0.10.0-cdh4.3.0
Beeline version 0.10.0-cdh4.3.0

Related

How to use IF else statement outside hive select?

I want to execute select statement based on region check. If the region value is HK then the table should be created from temp.temp1, otherwise it has to create with temp.temp2.
eg:
**beeline -e "
if [ '$REGION' == 'HK' ]
then
Create table region as Select * from temp.temp1;
else
Create table region as Select * from temp.temp2;
fi**
"**
Is there any possible way to do it?
Hive itself does not support if-else statements, there's HPL/SQL procedural extension that may be useful in your case.
Though, I suggest you a bit different approach: if $REGION variable comes from outside of beeline and those tables' schemes match, you can union the results with the corresponding where case:
create table region as
select *
from temp.temp1
where '$REGION' == 'HK'
union all
select *
from temp.temp2
where '$REGION' != 'HK'
Hive will build the execution plan and get rid of one of the union parts, so it won't affect the real execution time.
Yes- Hive it self doesn't support if-else statement. What i have implemented for now is.
if [ '$REGION' == 'HK' ]
then
beeline -e " Select * from temp.temp1; "
else
beeline -e " Select * from temp.temp2;"
fi
"
I know this is repetitive but for now this is what we have implemented to execute queries of different regions/ blocks

How to modify Oracle SQL query to Snowflake

Using Oracle SQL, there is a function, noted below, that will allow you to create a "list" of names, phone numbers, etc., without using
multiple DUAL queries and UNION/UNION ALL to get more than one record.
The query below produces a list in this case of 10 names.
SELECT COLUMN_VALUE USERNAME
FROM TABLE(SYS.DBMS_DEBUG_VC2COLL(
'WARNER,JEFF',
'MALITO,CARL',
'MOODY,JEANNE',
'PHILLIPS,HUGH & KELLY',
'PATSANTARAS,VICTORIA',
'BROWN,ROLAND',
'RADOSEVICH,MIKE',
'RIDER,JACK',
'MACLEOD,LENARD',
'SCOTT,DAN' ))
However, when trying to run this same query in Snowflake, it will not work.
I receive this error: SQL compilation error: Invalid identifier SYS.DBMS_DEBUG_VC2COLL
Is there a "Snowflake version" of this query that can be used?
Here are some options, you can see which works best for you.
This works if you can get your SQL to look similar:
SELECT $1::VARCHAR AS column_value
FROM (VALUES ('WARNER,JEFF'), ('MACLEOD,LENARD'), ('SCOTT,DAN'));
This also works if you can get your list to be in a single string, delimited by a pipe or similar:
SELECT value::VARCHAR AS column_value
FROM LATERAL FLATTEN(INPUT=>SPLIT('WARNER,JEFF|MACLEOD,LENARD|SCOTT,DAN', '|'));
If you have the strings in the format 'a','b' and find it painful to do one of the above, I'd do something like this:
SELECT value::VARCHAR AS column_value
FROM LATERAL FLATTEN(INPUT=>SPLIT(ARRAY_TO_STRING(ARRAY_CONSTRUCT('WARNER,JEFF', 'MALITO,CARL', 'MOODY,JEANNE'), '|'), '|'));
Similar to the above suggestions, you can try this:
SELECT VALUE::VARCHAR as column_name
FROM TABLE(FLATTEN(INPUT => ARRAY_CONSTRUCT('WARNER,JEFF', 'MALITO,CARL', 'MOODY,JEANNE'), MODE => 'array'));

Error in Hive : For Exists/Not Exists operator SubQuery must be Correlated

select * from students1;
students1.name students1.age students1.gpa
fred 35 1.28
barney 32 2.32
shyam 32 2.32
select * from students2;
students1.name students1.age
fred 35
barney 32
When I am running this query
select
name,age from students1
where not exists
(select name,age from students2);
I am getting this bellow error
Error while compiling statement: FAILED: SemanticException line 39:22
Invalid SubQuery expression 'age' in definition of SubQuery sq_1 [
exists (select name,age from students2) ] used as sq_1 at Line 3:10:
For Exists/Not Exists operator SubQuery must be Correlated.
The error message is clear. The subquery should be correlated when using exists/not exists.
select name,age
from students1 s1
where not exists (select 1
from students2 s2
where s1.name=s2.name and s1.age=s2.age
)
You are trying to achieve a MINUS output of a query. It is unfortunately not available in Hive.
You can read through the limitations of HQL and SQL here. HQL vs SQL
For usage of not exists, the manual has good example.
subqueries in hive
MINUS is now available in Hive. You can achieve this as the following:
select name,age from students1
MINUS
select name,age from students2;

Oracle: ORA-01722: invalid number

I have a query which works nice when I run it inside sqlplus:
SQL> SELECT T_0.ID AS ATTR_1_, T_0_0.ID AS ATTR_2_,
CASE WHEN ( T_0.ID=1 AND ( T_0_0.ID=3 OR T_0_1.ID='val_1') )
THEN 'val_1' ELSE 'val_2' END AS TXT, T_0_1.ID,
CASE WHEN T_0.ID='boo' THEN 'boo' END AS EXTRA_FIELD
FROM TEST_TABLE T_0
INNER JOIN TEST_TABLE_2 T_0_0 ON ( T_0_0.ATTR=T_0.ID )
INNER JOIN TEST_TABLE_3 T_0_1 ON ( T_0_1.ID = T_0_0.ID )
WHERE ( ( T_0.ID=1 AND T_0_0.ID=3 )
OR T_0_1.ID=2 OR T_0_0.TXT='val_2');
no rows selected
Although, it returns nothing, it still works and does not result in error. However, when I do the same thing in Python, using bindings, I get this error message:
cx_Oracle.DatabaseError: ORA-01722: invalid number
This is how my query looks in Python, before I do cursor.execute:
SELECT T_0.ID AS ATTR_1_, T_0_0.ID AS ATTR_2_,
CASE WHEN ( T_0.ID=:TXT_ AND ( T_0_0.ID=:TXT__ OR T_0_1.ID=:TXT___ ) )
THEN :TXT___ ELSE :TXT____ END AS TXT, T_0_1.ID,
CASE WHEN T_0.ID=:EXTRA_FIELD THEN :EXTRA_FIELD END AS EXTRA_FIELD
FROM TEST_TABLE T_0
INNER JOIN TEST_TABLE_2 T_0_0 ON ( T_0_0.ATTR=T_0.ID )
INNER JOIN TEST_TABLE_3 T_0_1 ON ( T_0_1.ID = T_0_0.ID )
WHERE ( ( T_0.ID=:ID AND T_0_0.ID=:ID_ )
OR T_0_1.ID=:ID__ OR T_0_0.TXT=:TXT )
The query is just a string double-quoted "SELECT ..." . And this is how the dictionary with binding variables looks like:
OrderedDict([('TXT_', 1), ('TXT__', 3), ('TXT___', 'val_1'),
('TXT____', 'val_2'), ('EXTRA_FIELD', 'boo'), ('ID', 1),
('ID_', 3), ('ID__', 2), ('TXT', 'val_2')])
So, as you can see I have a perfect dictionary - number values are just numbers without quotes, string values are just strings with single quotes. I know, you will ask about the schema of the tables. So, here its is:
SQL> SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE
TABLE_NAME = 'TEST_TABLE';
COLUMN_NAME
------------------------------
DATA_TYPE
------------------------------
ID
NUMBER
SQL> SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS WHERE
TABLE_NAME = 'TEST_TABLE_2';
COLUMN_NAME
------------------------------
DATA_TYPE
------------------------------
ATTR
NUMBER
ID
NUMBER
TXT
VARCHAR2
SQL> SELECT COLUMN_NAME, DATA_TYPE FROM USER_TAB_COLUMNS
WHERE TABLE_NAME = 'TEST_TABLE_3';
COLUMN_NAME
------------------------------
DATA_TYPE
------------------------------
ID
NUMBER
So, it seems like one and the same query works good in the console, but does not work when using Python. Why is that?
EDIT
And here is a proof - a screen of two console windows. In the first console I run the query in sqlplus, in the second console I print sql query and the dictionary, which is used for binding variables:
EDIT
Oh, it's even more interesting. I was able to reproduce this error in Oracle shell and it looks like Oracle 11c bug. So, look at this:
Please, pay attention to the fact that ID field has a NUMBER type. And then pay attention to these two screens:
In the screen above you can see that everything is ok. However, if we slightly change it by adding OR T_0_1.ID=2 to the WHERE part, then it breaks:
So, this problem is reproducible even in Oracle shell. You can do it, using the schema I provided above.
EDIT
I updated the topic of my question, because it has nothing to do with Python. The whole problem with Oracle itself.
EDIT
BTW. My last comment does not contradict to the beginning part of my investigation. The thing is, if I have some data in TEST_TABLE_3, then the query breaks. And if I delete data, then is starts working. Here is a big proof:
How can data affect correctness of the query??
On your last screen just below the last line of the statement you have
CASE WHEN ( T_O.ID=1 AND ( T_0_0.ID=3 OR T_0_1.ID='VAL_1') )
there's an asterisk (now it helps, but sometimes it could lead in the wrong direction) showing the place of the encountered issue
T_0_1.ID='VAL_1'
in your table ID column is of Number type. 'VAL_1' - is Varchar.
As the comparison rules state:
When comparing a character value with a numeric value, Oracle converts the character data to a numeric value.
see (https://docs.oracle.com/database/121/SQLRF/sql_elements002.htm#SQLRF00214)
when oracle encounters this it tries to cast your string to number - and you get the error
How can data affect correctness of the query??
When there's no data in the table - there's no record returned from the table, hence there's no need the check the value of the column for equality - this comparison is not executed and no error shown

Cannot define a dynamic parameter in a Hive query

I'm trying to set up some Views in Hive that will take a date as a dynamic parameter. In my working below, I've swapped to using the hiveconf variable in the Select clause, so we can see what's going on, but the principle remains the same
According to this and this , I should be able to include a statement in " ${hiveconf:dateRangeFrom}" in my Create View statement, supply the hiveconf:dateRangeFrom variable at run time for maximum happiness, but this is just not happening - Hive appears to be using whatever value is assigned to the variable when the View is created and hard-coding it into the View definition, not substituting it at runtime as you might expect.
I've got a workaround, whereby I supply a parameter to a sql file that then creates all the views, substituting the desired value in, but this isn't sustainable
All the working is below, so you can see how I came to this conclusion. Any ideas?
1) Supply a hiveconf value to a simple query
(needs to be a date for the ultimate query)
hive -e "Select ${hiveconf:dateRangeFrom} , unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');" --hiveconf dateRangeFrom='2014-01-01'
The date will be returned as suppled, and converted to a unix timestamp (eg "2014-01-01"=1388534400, "2014-09-12"=41047640). The script can be be run repeatedly with results changing accordingly with the parameter.
2) Create a view that returns this data
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
This returns the error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hivevar' in select clause
Presumably because it is trying to do a replacement, but the ${hivevar:dateRangeFrom} variable has not been initialized at this point
According to:
Creating Views in Hive with parameter and
http://mail-archives.apache.org/mod_mbox/hive-user/201205.mbox/%3CBAY151-W9BC976D584FD172E7D70BC0160#phx.gbl%3E
Then variables can be used in Hive views, long as quotes are used around them:
CREATE VIEW get_date AS
SELECT "${hiveconf:dateRangeFrom}", unix_timestamp("${hiveconf:dateRangeFrom}" , 'yyyy-MM-dd');
This allows the view to be created, so trying to call the view using a parameter:
hive -e "Select * from get_date" --hiveconf dateRangeFrom='2014-01-01'
just returns the variable name:
${hiveconf:dateRangeFrom} NULL
Time taken: 20.614 seconds, Fetched: 1 row(s)
Using single quotes instead:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT '${hiveconf:dateRangeFrom}', unix_timestamp('${hiveconf:dateRangeFrom} ', 'yyyy-MM-dd');
Gives the same result, just the variable name.
3) Create a view in an interactive session with the variable already set
SET hiveconf:dateRangeFrom="2014-02-01";
Rebuild the original view, with the variables without quotes
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hiveconf:dateRangeFrom}, unix_timestamp(${hiveconf:dateRangeFrom} , 'yyyy-MM-dd');
Then calling "select * from get_date;" from within the session gives the expected result.
As does calling from the command line, with the same parameter value:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-02-01'
However, if we call the view with a different parameter, then we still get the original answer:
hive -e "Select * from get_date;" --hiveconf dateRangeFrom='2014-09-12'
2014-02-01 1391212800
Time taken: 24.773 seconds, Fetched: 1 row(s)
If we set the variable inside a new session:
SET hiveconf:dateRangeFrom="2014-06-01";
or even not set it all, we still get the same result
Looking at the extended view definition, the reason is obvious:
hive> describe extended get_date;
OK
_c0 string
_c1 bigint
Detailed Table Information Table(tableName:get_date, dbName:default, owner:
36015to, createTime:1410523149, lastAccessTime:0, retention:0, sd:StorageDescrip
tor(cols:[FieldSchema(name:_c0, type:string, comment:null), FieldSchema(name:_c1
, type:bigint, comment:null)], location:null, inputFormat:org.apache.hadoop.mapr
ed.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequen
ceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:nu
ll, serializationLib:null, parameters:{}), bucketCols:[], sortCols:[], parameter
s:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValu
eLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{
transient_lastDdlTime=1410523149}, ***viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')***, tableType:VIRTUAL_VIEW)
Time taken: 0.123 seconds, Fetched: 4 row(s)
The variable substitution took place when the view was created, and hard-coded that date into the definition:
viewOriginalText:SELECT "2014-02-01", unix_t
imestamp("2014-02-01" , 'yyyy-MM-dd'), viewExpandedText:SELECT "2014-02-01", un
ix_timestamp("2014-02-01" , 'yyyy-MM-dd')
4) Switch off variable subsitution
Hive is clearly putting in the current value of the variable at run-time, so I tried switching it off and recreating the query:
hive> set hive.variable.substitute;
hive.variable.substitute=true
hive> set hive.variable.substitute = false;
hive> set hive.variable.substitute;
hive.variable.substitute=false
The Create View statement still fails with the same error:
FAILED: ParseException line 2:8 cannot recognize input near '$' '{' 'hiveconf' in select clause
5) Workaround
If we create a sql file that creates the views, testParam.sql, we can sort of get around the problem:
DROP VIEW get_date;
CREATE VIEW get_date AS
SELECT ${hivevar:dateRangeFrom}, unix_timestamp(${hivevar:dateRangeFrom} , 'yyyy-MM-dd');
SELECT * FROM get_date;
Calling that from the command line gives the expected results:
hive -f testParam.sql --hiveconf dateRangeFrom='2014-08-01'
2014-08-01 1406847600
Time taken: 20.763 seconds, Fetched: 1 row(s)
hive -f testParam.sql --hiveconf dateRangeFrom='2014-09-12'
2014-09-12 1410476400
Time taken: 19.74 seconds, Fetched: 1 row(s)
This does work, and will be fine for now, but is hardly ideal for a distributed, multi-user environment. Looking at the view meta-data, we can see that the view is always destroyed and rebuilt with the latest parameters:
transient_lastDdlTime=1410525287}, viewOriginalText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), viewExpandedText:SELECT '2014-09-12', unix_timestamp('2014-09-12' , 'yyyy-MM-dd'), tableType:VIRTUAL_VIEW)
So, how to create a view that can be supplied with dynamic parameters at runtime without constantly rebuilding it
How are you defining daterangeFrom? I think daterange from can be dynamically generated from current_date function by adding and subtracting the days based on your requirement. You can simply use hive functions for that.
I dont if this is what you're looking for!
if you are passing values from a bash script, this should do the job:
dateRangeFrom=$(date +"%Y-%m-%d")
hive -e "Select '${dateRangeFrom}' , unix_timestamp('${dateRangeFrom}' , 'yyyy-MM-dd');"
If you want to set the value in hive script itself you can do something like this
hive -e "SET hivevar:dateRangeFrom=2017-11-21;USE mydb; Select '${dateRangeFrom}' , unix_timestamp('${dateRangeFrom}' , 'yyyy-MM-dd');"
If you want to store the same hive query in a HQL file and run it externally, then you need to pass it like this
hive -f /abc/user/script.hql --hivevar dateRangeFrom=2017-11-21

Resources