SparkSQL does not work with Hive UDAF - hadoop

I am using AWS EMR + Spark 1.6.1 + Hive 1.0.0
I have this UDAF and have included it in the classpath of spark https://github.com/scribd/hive-udaf-maxrow/blob/master/src/com/scribd/hive/udaf/GenericUDAFMaxRow.java
And registered it in spark by sqlContext.sql("CREATE TEMPORARY FUNCTION maxrow AS 'some.cool.package.hive.udf.GenericUDAFMaxRow'")
However, when I call it in Spark in the following query
CREATE VIEW VIEW_1 AS
SELECT
a.A,
a.B,
maxrow ( a.C,
a.D,
a.E,
a.F,
a.G,
a.H,
a.I
) as m
FROM
table_1 a
JOIN
table_2 b
ON
b.Z = a.D
AND b.Y = a.C
JOIN dummy_table
GROUP BY
a.A,
a.B
It gave me this error
16/05/18 19:49:14 WARN RowResolver: Duplicate column info for a.A was overwritten in RowResolver map: _col0: string by _col0: string
16/05/18 19:49:14 WARN RowResolver: Duplicate column info for a.B was overwritten in RowResolver map: _col1: bigint by _col1: bigint
16/05/18 19:49:14 ERROR Driver: FAILED: SemanticException [Error 10002]: Line 16:32 Invalid column reference 'C'
org.apache.hadoop.hive.ql.parse.SemanticException: Line 16:32 Invalid column reference 'C'
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10643)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10591)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3656)
But if I removed the group by clause and the aggregate function it worked. So I doubt that SparkSQL somehow does not think it as a aggregate function.
Any help is appreciated. Thanks.

Related

Clickhouse - Alias in view

I would like create view with alias in SELECT query.
After try with this syntax it's not work.
Clickhouse don't support alias in view query or my syntax is bad ?
Error message:
Received exception from server (version 20.3.5): Code: 352.
DB::Exception: Received from localhost:9000. DB::Exception: Cannot
detect left and right JOIN keys. JOIN ON section is ambiguous..
Error message if i drop alias in JOIN (ON A.column1 = B.column1 ---> ON table_a.column1 = table_b.column1):
Received exception from server (version 20.3.5): Code: 47.
DB::Exception: Received from localhost:9000. DB::Exception: Missing
columns:
Create table:
CREATE TABLE IF NOT EXISTS table_a
(
`column1` Nullable(Int32),
`column2` Nullable(Int32),
`column3` Nullable(Int32),
`column4` Nullable(Int32)
)
ENGINE = MergeTree()
PARTITION BY tuple()
order by tuple();
CREATE TABLE IF NOT EXISTS table_b
(
`column1` Nullable(Int32),
`column2` Nullable(Int32),
`column3` Nullable(Int32),
`column4` Nullable(Int32)
)
ENGINE = MergeTree()
PARTITION BY tuple()
order by tuple();
View query:
CREATE VIEW IF NOT EXISTS view_table_AB AS
SELECT
A.column1,
A.column2,
A.column3,
A.column4,
B.column1,
B.column2,
B.column3,
B.column4
FROM table_a AS A
INNER JOIN table_b AS B ON A.column1 = B.column1;
DOC clickhouse: https://clickhouse.tech/docs/fr/sql-reference/syntax/#syntax-expression_aliases
Thank you for your help
It looks like it is the bug. I added CH Issue 11000, let's wait for the answer.
As workaround need to specify database-prefix instead of aliases:
CREATE VIEW IF NOT EXISTS view_table_AB AS
SELECT
table_a.column1,
table_a.column2,
table_a.column3,
table_a.column4,
table_b.column1,
table_b.column2,
table_b.column3,
table_b.column4
FROM table_a
INNER JOIN table_b ON table_a.column1 = table_b.column1;

Hive gives error when trying to find record with min subquery

In hive,
I am trying to select the entry with the minimum timestamp, however it's throwing the following error, not sure what is the reason.
select * from sales where partition_batch_ts = (select max(partition_batch_ts) from sales);
Error
Error while compiling statement: FAILED: ParseException line 1:91 cannot recognize input near 'select' 'max' '(' in expression specification
I think you need to use proper table alias. Also, IN must be used instead of =
SELECT s1.*
FROM sales s1
WHERE s1.partition_batch_ts IN
(SELECT MAX(partition_batch_ts)
FROM sales s2);
From Hive manual, SUBQUERIES :
As of Hive 0.13 some types of subqueries are supported in the WHERE
clause.

Issue with Spring data jpa Db2 pagination

I am using Spring JPA with DB2, when i use paging repository and queries for second page it throws error.
This is the generated query
SELECT *
FROM (SELECT inner2_.*,
ROWNUMBER()
OVER(
ORDER BY ORDER OF inner2_) AS rownumber_
FROM (SELECT db2DATAa0_.c_type AS col_0_0_,
db2DATAa0_.h_proc AS col_1_0_,
db2DATAa0_.n_vin AS col_2_0_,
db2DATAa0_.i_cust AS col_3_0_
FROM dcu.v_rpt_data_hist db2DATAa0_
WHERE db2DATAa0_.reportid = '0H000488089'
AND ( db2DATAa0_.c_type = 'S'
OR db2DATAa0_.c_type = 'N'
OR db2DATAa0_.c_type = 'A'
OR db2DATAa0_.c_type = 'T' )
ORDER BY db2DATAa0_.h_proc desc
FETCH first 30 ROWS only) AS inner2_) AS inner1_
WHERE rownumber_ > 15
ORDER BY rownumber_
Error:
2719372 [2016-10-21 16:29:02,040] [RxCachedThreadScheduler-13] WARN org.hibern
ate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: -199, SQLState: 42601
2719379 [2016-10-21 16:29:02,047] [RxCachedThreadScheduler-13] ERROR org.hibern
ate.engine.jdbc.spi.SqlExceptionHelper - DB2 SQL Error: SQLCODE=-199, SQLSTATE=
42601, SQLERRMC=OF;??( [ DESC ASC NULLS RANGE CONCAT || / MICROSECONDS MICROSECO
ND, DRIVER=3.57.82
Any idea?
Your error states the ILLEGAL USE OF KEYWORD OF. TOKEN [DESC ASC NULLS RANGE CONCAT] WAS EXPECTED.
I identified this as the critical part of the query:
ORDER BY ORDER OF inner2_
DB2 expects one of DESC, ASC, NULLS, RANGE, CONCAT after the second ORDER keyword.
This issue can be resolve by change dialect.
Change dialect in configuration or property file to DB2ZOSDialect

Hive LATERAL VIEW and WHERE Clause using Sub query

I'm looking for a way to optimize my query.
We have a table with events called lea, with a column app_properties, which are tags, stored as a comma separated string.
I would like to select all the events that match the result of a query that select the desired tags.
My first try:
SELECT uuid, app_properties, tag
FROM events
LATERAL VIEW explode(split(app_properties, '(, |,)')) tag_table AS tag
WHERE tag IN (SELECT source_value FROM mapping WHERE indicator = 'Bandwidth Usage')
But Hive will not allow this...
FAILED: SemanticException [Error 10249]: Line 4:6 Unsupported SubQuery Expression 'tag': Correlating expression cannot contain unqualified column references.
Gave it another try by replacing WHERE tag IN by WHERE tag_table.tag IN but not luck...
FAILED: SemanticException Line 4:6 Invalid table alias tag_table' in definition of SubQuery sq_1 [tag_table.tag IN (SELECT source_value FROM mapping WHERE indicator = 'Bandwidth Usage')] used as sq_1 at Line 4:20.
In the end... The query below gives the desired result, but I've a feeling that this is not the most optimized way of solving this use case. Has anyone ran into the same use case where you need the select from a LATERAL VIEW using a Sub query?
SELECT to_date(substring(events.time, 0, 10)) as date, t2.code, t2.indicator, count(1) as total
FROM events
LEFT JOIN (
SELECT distinct t.uuid, im.code, im.indicator
FROM mapping im
RIGHT JOIN (
SELECT tag, uuid
FROM events
LATERAL VIEW explode(split(app_properties, '(, |,)')) tag_table AS tag
) t
ON im.source_value = t.tag AND im.indicator = 'Bandwidth Usage'
WHERE im.source_value IS NOT NULL
) t2 ON (events.uuid = t2.uuid)
WHERE t2.code IS NOT NULL
GROUP BY to_date(substring(events.time, 0, 10)), t2.code, t2.indicator;
The Hive subquery in the WHERE clause can be used with IN, NOT IN, EXIST, or NOT
EXIST as follows. If the alias (see the following example for the employee table) is not specified before columns (name) in the WHERE condition, Hive will report the error Correlating expression cannot contain unqualified column references. This is a limitation of the Hive subquery.
From Apache Hive Essentials.
I guess this problem is also caused by subquery.
events should have an alias

Strange error when use hive udf through jdbc client

all. I met a strange error when I use hive udf through jdbc client.
I have a udf to help me convert a string into time stamp format called reformat_date. I firstly execute ADD JAR and CREATE TEMPORARY FUNCTION, both work fine.
The SQL also can be explained in hive cli mode, and can be executed. But when use jdbc client, I got errors:
Query returned non-zero code: 10, cause:
FAILED: Error in semantic analysis: Line 1:283 Wrong arguments ''20121201000000'':
org.apache.hadoop.hive.ql.metadata.HiveException:
Unable to execute method public org.apache.hadoop.io.Text com.aa.datawarehouse.hive.udf.ReformatDate.evaluate(org.apache.hadoop.io.Text) on object com.aa.datawarehouse.hive.udf.ReformatDate#4557e3e8 of class com.aa.datawarehouse.hive.udf.ReformatDate with arguments {20121201000000:org.apache.hadoop.io.Text} of size 1:
at com.aa.statistic.dal.impl.TjLoginDalImpl.selectAwakenedUserCount(TjLoginDalImpl.java:258)
at com.aa.statistic.backtask.service.impl.UserBehaviorAnalysisServiceImpl.recordAwakenedUser(UserBehaviorAnalysisServiceImpl.java:326)
at com.aa.statistic.backtask.controller.BackstatisticController$21.execute(BackstatisticController.java:773)
at com.aa.statistic.backtask.controller.BackstatisticController$DailyExecutor.execute(BackstatisticController.java:823)
My SQL is
select count(distinct a.user_id) as cnt from ( select user_id, user_kind, login_date, login_time from tj_login_hive where p_month = '2012_12' and login_date = '20121201' and user_kind = '0' ) a join ( select user_id from tj_login_hive where p_month <= '2012_12' and datediff(to_date(reformat_date(concat('20121201', '000000'))), to_date(reformat_date(concat(login_date, '000000')))) >= 90 ) b on a.user_id = b.user_id
Thanks.
i think your udf threw exception.
if reformat_date function is that you make, you should check your logic.
if not, you should check the udf's specification.

Resources