AvroSerDe: Encountered exception determining schema

AvroSerDe: Encountered exception determining schema - hadoop

I'm moving data from one hive table to another using Spark after applying some transformations. While doing so, I'm getting an exception saying avsc file can't be read (as per my understanding) but I can see the avsc file in HDFS.
Please advise what could be the reason.
I've source hive table defined as:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE TABLE `exchr`( |
| `exchr_sk` bigint, |
| `rec_ctry_cd` string, |
| `cob_dt` date, |
| `ccy_cd` string, |
| `exchr_val` decimal(10,5)) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.mapred.TextInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
| LOCATION |
| 'hdfs://xxxx/EXCHR' |
| TBLPROPERTIES ( |
| 'COLUMN_STATS_ACCURATE'='true', |
| 'last_modified_by'='hive', |
| 'last_modified_time'='1500408192', |
| 'numFiles'='0', |
| 'numRows'='3', |
| 'rawDataSize'='73', |
| 'totalSize'='0', |
| 'transient_lastDdlTime'='1501780655') |
+----------------------------------------------------+--+
I've a target hive table defined as:
+----------------------------------------------------+--+
| createtab_stmt |
+----------------------------------------------------+--+
| CREATE EXTERNAL TABLE `fx_rate`( |
| `stamp` string COMMENT '', |
| `curr_code` string COMMENT '', |
| `fx_rate` double COMMENT '') |
| PARTITIONED BY ( |
| `cb_dt` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://xxxx/fx_rate' |
| TBLPROPERTIES ( |
| 'avro.schema.url'='hdfs://xxxx/fx_rate.avsc', |
| 'transient_lastDdlTime'='1502313078') |
+----------------------------------------------------+--+
when i'm trying to insert data into this table using spark with below code, I encounter Encountered exception determining schema. Returning signal schema to indicate problem
Spark code:
val TABLE_2_0 = sqlContext.sql("select * from xxxx.exchr");
val SELECT_1_0 = TABLE_2_0.select(TABLE_2_0.col("*"));
val SELECT_0_0 = SELECT_1_0.select(date_format(SELECT_1_0.col("cob_dt"), "yyyymmdd").as("stamp"), lit(null).as("curr_code"), round(lit(10)/SELECT_1_0.col("exchr_val"),10).as("fx_rate"), date_format(SELECT_1_0.col("cb_dt"), "yyyymmdd").as("cb_dt")).limit(1000);
SELECT_0_0.toDF().write.mode("append").insertInto("xxxx.fx_rate")
Exception trace:
java.lang.NullPointerException
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:381)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFromFS(AvroSerdeUtils.java:154)
at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:135)
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:172)
at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:103)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.newSerializer(InsertIntoHiveTable.scala:59)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.outputClass$lzycompute(InsertIntoHiveTable.scala:53)
17/08/10 23:46:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 3, myhost.net, executor 2): org.apache.hadoop.hive.serde2.SerDeException: Encountered exception determining schema. Returning signal schema to indicate problem: null
at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:523)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:97)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:88)
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializer(TableDesc.java:81)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:92)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Related

Dbeaver does not display "Table description" of table in a Hive database

I created a Hive table with table description metadata using this command:
create table sbx_ppppp.comments (
s string comment 'uma string',
i int comment 'um inteiro'
) comment 'uma tabela com comentários';
But it isn't correctly displayed when I double click the table:
The table description also isn't displayed in the table tooltip or in the table list when I double click the database name.
When I run the describe formatted table sbx_ppppp.comments command with the comment is correctly displayed as table property:
col_name |data_type |comment |
----------------------------+------------------------------------------------+---------------------------------------------------------------------------+
# col_name |data_type |comment |
s |string |uma string |
i |int |um inteiro |
| | |
# Detailed Table Information| | |
Database: |sbx_ppppp | |
OwnerType: |USER | |
Owner: |ppppp | |
CreateTime: |Fri Apr 29 18:31:31 BRT 2022 | |
LastAccessTime: |UNKNOWN | |
Retention: |0 | |
Location: |hdfs://BNDOOP03/corporativo/sbx_ppppp/comments | |
Table Type: |MANAGED_TABLE | |
Table Parameters: | | |
|COLUMN_STATS_ACCURATE |{\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"i\":\"true\",\"s\":\"true\"}}|
|bucketing_version |2 |
|comment |uma tabela com comentários |
|numFiles |0 |
|numRows |0 |
|rawDataSize |0 |
|totalSize |0 |
|transactional |true |
|transactional_properties |default |
|transient_lastDdlTime |1651267891 |
| | |
# Storage Information | | |
SerDe Library: |org.apache.hadoop.hive.ql.io.orc.OrcSerde | |
InputFormat: |org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | |
OutputFormat: |org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat| |
Compressed: |No | |
Num Buckets: |-1 | |
Bucket Columns: |[] | |
Sort Columns: |[] | |
Storage Desc Params: | | |
|serialization.format |1 |
In "Table Parameters" you can see the value "uma tabela com comentários" for the "comment" parameter.
I'm using Cloudera ODBC driver version 2.6.11.1011 to connect to Hive. DBeaver is version 22.0.3.202204170718. I don't know if this is a bug in DBeaver or in Cloudera ODBC driver. Maybe I'm not correctly setting the table description.

How to drop hive partitions with hivevar passed as partition variable?

I have been trying to run this piece of code to drop current day's partition from hive a table and for some reason it does not drop the partition from the hive table. Not sure what's worng.
Table Name : prod_db.products
desc:
+----------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+----------------------------+-----------------------+-----------------------+--+
| name | string | |
| cost | double | |
| load_date | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| load_date | string | |
+----------------------------+-----------------------+-----------------------+--+
## I am using the following code
SET hivevar:current_date=current_date();
ALTER TABLE prod_db.products DROP PARTITION(load_date='${current_date}');
Before and After picture of partitions:
+-----------------------+--+
| partition |
+-----------------------+--+
| load_date=2022-04-07 |
| load_date=2022-04-11 |
| load_date=2022-04-18 |
| load_date=2022-04-25 |
+-----------------------+--+
It runs without any error but doesn't work but won't drop the partition. Table is internal/managed.
I tried different ways mentioned on stack but it is just not working for me.
Help.

You dont need to set a variable. You can directly drop using direct sql.
Alter table prod_db.products
drop partition (load_date= current_date());

Unable to query/select data those inserted through Spark SQL

I am trying to insert data into a Hive Managed table that has a partition.
Show create table output for reference.
+--------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+--------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `part_test08`( |
| `id` string, |
| `name` string, |
| `baseamount` double, |
| `billtoaccid` string, |
| `extendedamount` double, |
| `netamount` decimal(19,5), |
| `netunitamount` decimal(19,5), |
| `pricingdate` timestamp, |
| `quantity` int, |
| `invoiceid` string, |
| `shiptoaccid` string, |
| `soldtoaccid` string, |
| `ingested_on` timestamp, |
| `external_id` string) |
| PARTITIONED BY ( |
| `productid` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'wasb://blobrootpath/hive/warehouse/db_103.db/part_test08' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| 'transactional'='true', |
| 'transactional_properties'='default', |
| 'transient_lastDdlTime'='1549962363') |
+--------------------------------------------------------------------------------------------------+--+
Trying to execute SQL statement to insert records into the part table like below
sparkSession.sql("INSERT INTO TABLE db_103.part_test08 PARTITION(ProductId) SELECT reflect('java.util.UUID', 'randomUUID'),stg_name,stg_baseamount,stg_billtoaccid,stg_extendedamount,stg_netamount,stg_netunitamount,stg_pricingdate,stg_quantity,stg_invoiceid,stg_shiptoaccid,stg_soldtoaccid,'2019-02-12 09:06:07.566',stg_id,stg_ProductId FROM tmp_table WHERE part_id IS NULL");
Without insert statement, if we run select query then getting below data.
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
|reflect(java.util.UUID, randomUUID)|stg_name|stg_baseamount| stg_billtoaccid|stg_extendedamount|stg_netamount|stg_netunitamount| stg_pricingdate|stg_quantity|stg_invoiceid| stg_shiptoaccid| stg_soldtoaccid|2019-02-12 09:06:07.566|stg_id|stg_ProductId|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
| 4e0b4331-b551-42d...| OLI6| 16.0|2DD4E682-6B4F-E81...| 34.567| 1166.74380| 916.78000|2018-10-18 05:06:22| 13| I1|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 6| P3|
| 8b327a8e-dd3c-445...| OLI7| 16.0|2DD4E682-6B4F-E81...| 34.567| 766.74380| 1016.78000|2018-10-18 05:06:22| 13| I6|2DD4E682-6B4F-E81...|2DD4E682-6B4F-E81...| 2019-02-12 09:06:...| 7| P4|
| c0e14b9a-8d1a-426...| OLI5| 14.6555| null| 34.56| 500.87000| 814.65000|2018-10-11 05:06:22| 45| I4|29B73C4E-846B-E71...|29B73C4E-846B-E71...| 2019-02-12 09:06:...| 5| P1|
+-----------------------------------+--------+--------------+--------------------+------------------+-------------+-----------------+-------------------+------------+-------------+--------------------+--------------------+-----------------------+------+-------------+
Earlier I was getting error while inserting into Managed table. But after restarting Hive & Thrift services now there is no error in execution of job but not able to see those inserted data while doing select query through beeline/program. I can see partition with delta files also got inserted into hive/warehouse, see below screenshot.
Also, I can see some warnings as below not sure its related to the error or not.
Cannot get ACID state for db_103.part_test08 from null
One more note: If I use External Table then it is working fine can able to view data as well.
We are using Azure HDInsight Spark 2.3 (HDI 4.0 Preview) cluster with below Service Stacks.
HDFS: 3.1.1
Hive: 3.1.0
Spark2: 2.3.1

Have you added below set commands while trying to insert data.
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I have faced similar issue where i was not allowed to do any read / write operation, after adding the above properties I was able to query the table.
Since you have not faced any issue with external table, not sure if this is going to solve your problem.

Hive: Error while fetching data

I tried to connect to Hive using the below query:
select * from some-table where yyyy = 2018 and mm = 01 and dd = 05 runs
The query ran successfully.
After adding one more filter, i.e string data type
The following error is generated:
java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.serde2.io.DateWritable cannot be cast to
org.apache.hadoop.io.Text

The error is generated by Serializer-Deserializers.
Root Cause: When you created the table, you probably didn't define the STORED AS tag. Try to describe your table using desc <table name> and you may see something like this:
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
Which is not in the good practice. Your SerDes use Lazy Evaluations by default. Create a table using STORED AS ORC and then try to describe your table and the result may be different this time:
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
Try this and you may be able to resolve the issue.

Spark avro insertInto file extension

I've an external Hive table based on Avro.
| CREATE EXTERNAL TABLE `temp_avro`( |
| `string1` string COMMENT '') |
| PARTITIONED BY ( |
| `string2` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://xxx/xxx/temp_avro' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1503938718') |
I'm trying to write to this table using Spark as:
SELECT_0_0.toDF().write.mode("append").insertInto("temp_avro")
With this, the avro file gets created in the HDFS location without avro extension (with names part-00001, part-00002, and so on). Is there a way to have the file name with extension .avro

Try using coalesce to reduce parts combining to one before saving your results
SELECT_0_0.toDF().coalesce(1).write.mode("append").insertInto("temp_avro")

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

AvroSerDe: Encountered exception determining schema - hadoop

Related

Dbeaver does not display "Table description" of table in a Hive database

How to drop hive partitions with hivevar passed as partition variable?

Unable to query/select data those inserted through Spark SQL

Hive: Error while fetching data

Spark avro insertInto file extension

Categories

Resources