whether impala support java UDF in where clause - user-defined-functions

I can use a java based UDF in hive and impala,but throw ClassNotFound error when call the udf in where clause
The UDF can not use when referenced in where clause but work properly when it only referenced behind select with impala 2.9.0-cdh5.12.1
In hive select udfjson(memo,state) from tableA where udfjson(memo,state) = 0 and name = 'test' is working properly but not in impala.
Execute select udfjson(memo,state) from tableA where name = 'test' in impala is OK. The UDF can use in impala only it not in where clause
here is the error
Error(255): Unknown error 255
Root cause: NoClassDefFoundError: org/apache/hadoop/hdfs/DFSInputStream$ByteArrayStrategy
It's possible to referenced UDF in where clause with impala ?

Use sub-query:
select * from
(
select udfjson(memo,state) as state from tableA where name = 'test'
)s
where s.state=0

Related

Spark SQL throwing error "java.lang.UnsupportedOperationException: Unknown field type: void"

I am getting below error in Spark(1.6) SQL while creating a table with column value default as NULL. Ex: create table test as select column_a, NULL as column_b from test_temp;
The same thing works in Hive and creates the column with data type "void".
I am using empty string instead of NULL to avoid the exception and new column getting string data type.
Is there any better way to insert null values in hive table using spark sql ?
2017-12-26 07:27:59 ERROR StandardImsLogger$:177 - org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Unknown field type: void
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:789)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:746)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply$mcV$sp(ClientWrapper.scala:428)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:426)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$createTable$1.apply(ClientWrapper.scala:426)
at org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:293)
at org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:239)
at org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:238)
at org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:281)
at org.apache.spark.sql.hive.client.ClientWrapper.createTable(ClientWrapper.scala:426)
at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute$1(CreateTableAsSelect.scala:72)
at org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$1(CreateTableAsSelect.scala:47)
at org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:56)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:153)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:829)
I couldn't find much information regarding the datatype void but it looks like it is somewhat equivalent to the Any datatype we have in Scala.
The table at the end of this page explains that a void can be cast to any other data type.
Here are some JIRA issues that are kinda similar to the problem you are facing
HIVE-2901
HIVE-747
So, as mentioned in the comment, instead of NULL you can cast it to any of the implicit data types.
select cast(NULL as string) as column_b
I started to get a similar issue. I build the code down to an example
WITH DATA
AS (
SELECT 1 ISSUE_ID,
DATE(NULL) DueDate,
MAKE_DATE(2000,01,01) DDate
UNION ALL
SELECT 1 ISSUE_ID,
MAKE_DATE(2000,01,01),
MAKE_DATE(2000,01,02)
)
SELECT ISNOTNULL(lag(IT.DueDate, 1) OVER (PARTITION by IT.ISSUE_ID ORDER BY IT.DDate ))
AND ISNULL(IT.DueDate)
FROM DATA IT

add partition in hive table based on a sub query

I am trying to add partition to a hive table (partitioned by date)
My problem is that the date needs to be fetched from another table.
My query looks like :
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION(server_date = (SELECT max(server_date) FROM processed_table));
When i run the query hive throws the following error:
Error: Error while compiling statement: FAILED: ParseException line 1:84 cannot recognize input near '(' 'SELECT' 'max' in constant (state=42000,code=40000)
Hive does not allow to use functions/UDF's for the partition column.
Approach 1:
To achieve this you can run the first query and store the result in one variable and then execute the query.
server_date=$(hive -e "set hive.cli.print.header=false; select max(server_date) from processed_table;")
hive -hiveconf "server_date"="$server_date" -f your_hive_script.hql
Inside your script you can use the following statement:
ALTER TABLE my_table ADD IF NOT EXISTS PARTITION(server_date =${hiveconf:server_date});
For more information on the hive variable substitution, you can refer link
Approach 2:
In this approach, you will need to create a temporary table if the partition data you are expecting is already not loaded in any other partitioned table.
Considering your data doesn't have the server_date column.
Load the data into temporary table
set hive.exec.dynamic.partition=true;
Execute the below query:
INSERT OVERWRITE TABLE my_table PARTITION (server_date)
SELECT b.column1, b.column2,........,a.server_date as server_date FROM (select max(server_date) as server_date from ) a, my_table b;

How to fix partition predicate error without using (set hive.mapred.mode=unstrict)

Is there any way to bypass Hive query when strict mode is enabled in partition table without using set hive.mapred.mode=unstrict
I have two tables both are partitioned with dte, when I do union operation on both tables and trying to select 2 where conditions I am getting partition predicate error
Query
with a as (select "table1" as table_name,column1,column2,column3 from table_one
union all
select "table2" as table_name,column1,column2,column3 from table_two)
select * from a where dte='2017-08-01' and table_name='table1';
Error
Error: Error while compiling statement: FAILED: SemanticException
[Error 10041]: No partition predicate found for Alias "table_one"
Table "table_one" (state=42000,code=10041)

No partition predicate found for Alias even when the partition predicate in present in the query

I have a table pos.pos_inv in hdfs which is partitioned by yyyymm. Below is the query:
select DATE_ADD(to_date(from_unixtime(unix_timestamp(Inv.actvydt, 'MM/dd/yyyy'))),5),
to_date(from_unixtime(unix_timestamp(Inv.actvydt, 'MM/dd/yyyy'))),yyyymm
from pos.pos_inv inv
INNER JOIN pos.POSActvyBrdg Brdg ON Brdg.EIS_POSActvyBrdgId = Inv.EIS_POSActvyBrdgId
where to_date(from_unixtime(unix_timestamp(Inv.nrmlzdwkenddt, 'MM/dd/yyyy')))
BETWEEN DATE_SUB(to_date(from_unixtime(unix_timestamp(Inv.actvydt, 'MM/dd/yyyy'))),6)
and DATE_ADD(to_date(from_unixtime(unix_timestamp(Inv.actvydt, 'MM/dd/yyyy'))),6)
and inv.yyyymm=201501
I have provided the partition value for the query as 201501, but still i get the error"
Error while compiling statement: FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "inv" Table "pos_inv"
(schema)The partition, yyyymm is int type and actvydt is date stored as string type.
This happens because hive is set to strict mode.
this allow the partition table to access the respective partition /folder in hdfs .
set hive.mapred.mode=unstrict; it will work
In your query error it is said: No partition predicate found for Alias "inv" Table "pos_inv".
So you must put the where clause for the fields of the partitioned table (for pos_inv), and not for the other one (inv), as you've done.
set hive.mapred.mode=unstrict allows you access the whole data rather than the particular partitons. In some case read whole dataset is necessary, such as: rank() over
This happens when the 2 tables have the same column name (possibly same partition column). Try to deal with separate tables with where condition like below
WITH tableA as
(
-- All your where clause here
),
tableB AS
(
-- All your where clause here
)
select tableA.*, tableB.*

Greenplum: Find associated error table of any external table

is there any way to find out the list of all error tables associated with each external table.
Actual Requirement: I am using External tables in Greenplum and data coming from source in form of files,data ingestion to Greenplum via external tables. and I want to report all the rejected rows to source system
Regards,
Gurupreet
http://gpdb.docs.pivotal.io/4340/admin_guide/load/topics/g-viewing-bad-rows-in-the-error-table-or-error-log.html
You basically just use the built-in function gp_read_error_log() and pass in the external table name to get the errors associated with the files. There is an example in the above link too.
The field fmterrtbl of pg_exttable contains the oid of the error table for any external table. So the query to find the error table for all external tables in the database is:
SELECT
external_namespace.nspname AS external_schema, external_class.relname AS external_table,
error_namespace.nspname AS error_schema, error_class.relname AS error_table
FROM pg_exttable AS external_tables
INNER JOIN pg_class AS external_class ON external_class.oid = external_tables.reloid
INNER JOIN pg_namespace AS external_namespace ON external_namespace.oid = external_class.relnamespace
LEFT JOIN (
pg_class AS error_class
INNER JOIN pg_namespace AS error_namespace ON error_namespace.oid = error_class.relnamespace
) ON error_class.oid = external_tables.fmterrtbl
the error_schema and error_table fields will be NULL for external tables with no error tables.

Resources